Compute Cluster
Monitoring Jobs & Troubleshooting
Monitoring Jobs & Troubleshooting
Monitoring and Controlling Jobs
Whether using a GUI or the command line, monitoring the progress of your jobs and understanding how to control them is essential. Below we provide commands to accomplish both.
Using these commands will provide information about the job state. This value will typically be one of PENDING, RUNNING, COMPLETED, CANCELLED, and FAILED:
- PENDING: Job is awaiting a slot suitable for the requested resources or you've gone over your limit on resource usage. Jobs with high resource demands may spend significant time PENDING if the compute grid is busy.
- RUNNING: Job is running.
- COMPLETED: Job has finished and the command(s) have returned successfully (i.e., exit code 0).
- CANCELLED: Job has been terminated by the user or administrator using bkill.
- FAILED: Job finished with an exit code other than 0.
Please also see our technote on Monitoring CPU Usage for your Jobs in order to understand the runtime behavior of your code.
Note: If using the NoMachine NX client, you will need to open a terminal window to execute these commands. From the menu bar, select Applications > Accessories > gnome-terminal.
- Summary of LSF Commands
-
The table below shows a summary of LSF commands. A link to the official LSF documentation is included at the bottom of the table. Note that JOBID refers to one of the job IDs listed from the generic bjobs command or from your log of jobs you've run.
Action LSF command Example Submit a batch (background) script or run and application script
bsub
bsub runscript.sh
bsub -q long matlab -r "myscript"
Get an interactive session (shell) or script interactively
bsub -Ip
bsub -q long_int -Is -W 2:00 /bin/bash
bsub -q long_int -Is -W 10 /bin/bash /bin/hostname
Kill a job or kill all jobs
bkill
bkill
JOBIDbkill 0
View current/pending jobs
View specific job
View details (long format) of job
bjobs
bjobs
bjobs
JOBIDbjobs -l
JOBIDView the output and error files of a job
bpeek
bpeek
JOBIDView queue status
View queue status for all users
bqueues
bqueues
bqueues -u all
View recent past job info
View list of past jobs (failed or good) from date to now
View details (long format) of past job
bhist
bhist
JOBIDbhist -a [-n #] -S 2017/09/01,
bhist -l [-n #] JOBID
Note: The
-n
option (where n = 0 or 1 - 400) must be used for jobs older than a day or two. This indicates how many jobs lobs to look backwards through. 0 indicates first 100 logs, and may result in the command taking several tens of seconds to return any information.Check how busy the cluster is by user
View cluster load by several criteria
bjobs
bjobs -u all
bqueues
&&bhosts
&&lsload
See additional commands in the official LSF documentation.
Troubleshooting Jobs and Resources
A variety of problems can arise when running jobs and applications on the HBSGrid. Many are related to resource misallocation, but there are other common problems as well. Be sure to check for email messages from the schedule which may explain problems; or check the log files from your application.
Error | Likely Cause |
---|---|
JOB <jobid> CANCELLED AT <time> DUE TO TIME LIMIT
|
You did not specify enough time in your submission script. The -W option sets time in minutes or can also take HH:MM form (12:30 for 12.5 hours)
|
Job <jobid> exceeded <mem> memory limit, being killed
|
Your job is attempting to use more memory than you've requested for it. Either increase
the amount of memory requested or, if possible, reduce the amount your application
is trying to use. For example, many Java programs set heap space using the -Xmx JVM option. This could potentially be reduced.
|
Exited with exit code N
|
Your job failed because your application exited with an error. Please look at the job or application logs to determine why your program exited abnormally. |
If you are unable to determine why your jobs are not running correctly, please contact RCS for assistance.