Monitoring Jobs & Troubleshooting

Monitoring and Controlling Jobs

Whether using a GUI or the command line, monitoring the progress of your jobs and understanding how to control them is essential. Below we provide commands to accomplish both.

Using these commands will provide information about the job state. This value will typically be one of PENDING, RUNNING, COMPLETED, CANCELLED, and FAILED:

PENDING: Job is awaiting a slot suitable for the requested resources or you've gone over your limit on resource usage. Jobs with high resource demands may spend significant time PENDING if the compute grid is busy.
RUNNING: Job is running.
COMPLETED: Job has finished and the command(s) have returned successfully (i.e., exit code 0).
CANCELLED: Job has been terminated by the user or administrator using bkill.
FAILED: Job finished with an exit code other than 0.

Please also see our technote on Monitoring CPU Usage for your Jobs in order to understand the runtime behavior of your code.

Note: If using the NoMachine NX client, you will need to open a terminal window to execute these commands. From the menu bar, select Applications > Accessories > gnome-terminal.

Summary of LSF Commands

The table below shows a summary of LSF commands. A link to the official LSF documentation is included at the bottom of the table. Note that JOBID refers to one of the job IDs listed from the generic bjobs command or from your log of jobs you've run.

Action LSF command Example

Action	LSF command	Example
Submit a batch (background) script or run and application script	`bsub`	`bsub runscript.sh` `bsub -q long matlab -r "myscript"`
Get an interactive session (shell) or script interactively	`bsub -Ip`	`bsub -q long_int -Is -W 2:00 /bin/bash` `bsub -q long_int -Is -W 10 /bin/bash /bin/hostname`
Kill a job or kill all jobs	`bkill`	`bkill` JOBID `bkill 0`
View current/pending jobs View specific job View details (long format) of job	`bjobs`	`bjobs` `bjobs` JOBID `bjobs -l` JOBID
View the output and error files of a job	`bpeek`	`bpeek` JOBID
View queue status View queue status for all users	`bqueues`	`bqueues` `bqueues -u all`
View recent past job info View list of past jobs (failed or good) from date to now View details (long format) of past job	`bhist`	`bhist` JOBID `bhist -a [-n #] -S 2017/09/01,` `bhist -l [-n #] JOBID` Note: The `-n` option (where n = 0 or 1 - 400) must be used for jobs older than a day or two. This indicates how many jobs lobs to look backwards through. 0 indicates first 100 logs, and may result in the command taking several tens of seconds to return any information.
Check how busy the cluster is by user View cluster load by several criteria	`bjobs`	`bjobs -u all` `bqueues` && `bhosts` && `lsload`

Submit a batch (background) script or run and application script

bsub

bsub runscript.sh

bsub -q long matlab -r "myscript"

Get an interactive session (shell) or script interactively

bsub -Ip

bsub -q long_int -Is -W 2:00 /bin/bash

bsub -q long_int -Is -W 10 /bin/bash /bin/hostname

Kill a job or kill all jobs

bkill

bkill JOBID

bkill 0

View current/pending jobs

View specific job

View details (long format) of job

bjobs

bjobs

bjobs JOBID

bjobs -l JOBID

View the output and error files of a job

bpeek

bpeek JOBID

View queue status

View queue status for all users

bqueues

bqueues

bqueues -u all

View recent past job info

View list of past jobs (failed or good) from date to now

View details (long format) of past job

bhist

bhist JOBID

bhist -a [-n #] -S 2017/09/01,

bhist -l [-n #] JOBID

Note: The -n option (where n = 0 or 1 - 400) must be used for jobs older than a day or two. This indicates how many jobs lobs to look backwards through. 0 indicates first 100 logs, and may result in the command taking several tens of seconds to return any information.

Check how busy the cluster is by user

View cluster load by several criteria

bjobs

bjobs -u all

bqueues && bhosts && lsload

See additional commands in the official LSF documentation.

Troubleshooting Jobs and Resources

A variety of problems can arise when running jobs and applications on the HBSGrid. Many are related to resource misallocation, but there are other common problems as well. Be sure to check for email messages from the schedule which may explain problems; or check the log files from your application.

Error	Likely Cause
`JOB <jobid> CANCELLED AT <time> DUE TO TIME LIMIT`	You did not specify enough time in your submission script. The `-W` option sets time in minutes or can also take HH:MM form (12:30 for 12.5 hours)
`Job <jobid> exceeded <mem> memory limit, being killed`	Your job is attempting to use more memory than you've requested for it. Either increase the amount of memory requested or, if possible, reduce the amount your application is trying to use. For example, many Java programs set heap space using the `-Xmx` JVM option. This could potentially be reduced.
`Exited with exit code N`	Your job failed because your application exited with an error. Please look at the job or application logs to determine why your program exited abnormally.

If you are unable to determine why your jobs are not running correctly, please contact RCS for assistance.

Compute Cluster

Compute Cluster

Monitoring Jobs & Troubleshooting

Monitoring and Controlling Jobs

Troubleshooting Jobs and Resources