Running Jobs
Monitoring Jobs & Troubleshooting
Monitoring Jobs & Troubleshooting
Monitoring and Controlling Jobs
Whether using a GUI or the command line, monitoring the progress of your jobs and understanding how to control them is essential. Below we provide commands to accomplish both.
Using these commands will provide information about the job state. This value will typically be one of PENDING, RUNNING, COMPLETED, CANCELLED, and FAILED:
- PENDING: Job is awaiting a slot suitable for the requested resources or you've gone over your limit on resource usage. Jobs with high resource demands may spend significant time PENDING if the compute grid is busy.
- RUNNING: Job is running.
- COMPLETED: Job has finished and the command(s) have returned successfully (i.e., exit code 0).
- CANCELLED: Job has been terminated by the user or administrator using bkill.
- FAILED: Job finished with an exit code other than 0.
Please also see our technote on Monitoring CPU Usage for your Jobs in order to understand the runtime behavior of your code.
Note: If using the NoMachine NX client, you will need to open a terminal window to execute these commands. From the menu bar, select Applications > Accessories > gnome-terminal.
- Summary of LSF Commands for Monitoring and Controlling Jobs
-
The table below shows a summary of LSF commands. A link to the official LSF documentation is included at the bottom of the table. Note that JOBID refers to one of the job IDs listed from the generic bjobs command or from your log of jobs you've run.
Action LSF command Example Submit a batch (background) script or run and application script
Get an interactive session (shell) or script interactively
Kill a job or kill all jobs
View current/pending jobs
View specific job
View details (long format) of job
View the output and error files of a job
View queue status
View queue status for all users
View recent past job info
View list of past jobs (failed or good) from date to now
View details (long format) of past job
Note: The
option (where n = 0 or 1 - 400) must be used for jobs older than a day or two. This indicates how many jobs lobs to look backwards through. 0 indicates first 100 logs, and may result in the command taking several tens of seconds to return any information.Check how busy the cluster is by user
View cluster load by several criteria
See additional commands in the official LSF documentation.
Troubleshooting Jobs and Resources
A variety of problems can arise when running jobs and applications on the HBSGrid. Many are related to resource misallocation, but there are other common problems as well. Be sure to check for email messages from the schedule which may explain problems; or check the log files from your application.
Error | Likely Cause |
---|---|
You did not specify enough time in your submission script. The | option sets time in minutes or can also take HH:MM form (12:30 for 12.5 hours)|
Your job is attempting to use more memory than you've requested for it. Either increase the amount of memory requested or, if possible, reduce the amount your application is trying to use. For example, many Java programs set heap space using the | JVM option. This could potentially be reduced.|
Your job failed because your application exited with an error. Please look at the job or application logs to determine why your program exited abnormally. |
If you are unable to determine why your jobs are not running correctly, please contact RCS for assistance.