Skip to Main Content
HBS Home
  • About
  • Academic Programs
  • Alumni
  • Faculty & Research
  • Baker Library
  • Giving
  • Harvard Business Review
  • Initiatives
  • News
  • Recruit
  • Map / Directions
Research Computing Services
  • Online Requests
  • FAQ
  • Blog
  • Contact Us
  • About Us
  • Faculty Projects
  • Training
  • Compute Cluster & Data Storage
  • Data Practices
  • Help
  • …→
  • Harvard Business School→
  • Research Computing Services→
  • Compute Cluster and Data Storage
    • Compute Cluster and Data Storage
    • Compute Cluster
    • Data Storage
    • Database Server
    • Other Research Computing Environments
    →
  • Compute Cluster
    • Compute Cluster
    • Technical Benefits and Features
    • Quick Start
    • Requesting an Account
    • Logging In
    • Copying & Extracting Files
    • Running Jobs
    • Software Tools
    →
  • Running Jobs
    • Running Jobs
    • Guidelines for Choosing Resources
    • Running a Program/Submitting a Job
    • Monitoring Jobs & Troubleshooting
    • Scaling Work
    →
  • Monitoring Jobs & Troubleshooting→

Running Jobs

Running Jobs

  • Guidelines for Choosing Resources
  • Running a Program/Submitting a Job
  • Monitoring Jobs & Troubleshooting
  • Scaling Work

Monitoring Jobs & Troubleshooting

Monitoring Jobs & Troubleshooting

  • Compute Cluster
    • Technical Benefits and Features
    • Quick Start
    • Requesting an Account
    • Logging In
    • Copying & Extracting Files
    • Running Jobs
      • Guidelines for Choosing Resources
      • Running a Program/Submitting a Job
      • Monitoring Jobs & Troubleshooting
      • Scaling Work
    • Software Tools
  • Data Storage
  • Database Server
  • Other Research Computing Environments
4ms

Monitoring and Controlling Jobs

Whether using a GUI or the command line, monitoring the progress of your jobs and understanding how to control them is essential. Below we provide commands to accomplish both.

Using these commands will provide information about the job state. This value will typically be one of PENDING, RUNNING, COMPLETED, CANCELLED, and FAILED:

  • PENDING: Job is awaiting a slot suitable for the requested resources or you've gone over your limit on resource usage. Jobs with high resource demands may spend significant time PENDING if the compute grid is busy.
  • RUNNING: Job is running.
  • COMPLETED: Job has finished and the command(s) have returned successfully (i.e., exit code 0).
  • CANCELLED: Job has been terminated by the user or administrator using bkill.
  • FAILED: Job finished with an exit code other than 0.

Please also see our technote on Monitoring CPU Usage for your Jobs in order to understand the runtime behavior of your code.

Note: If using the NoMachine NX client, you will need to open a terminal window to execute these commands. From the menu bar, select Applications > Accessories > gnome-terminal.

Summary of LSF Commands for Monitoring and Controlling Jobs

The table below shows a summary of LSF commands. A link to the official LSF documentation is included at the bottom of the table. Note that JOBID refers to one of the job IDs listed from the generic bjobs command or from your log of jobs you've run.

Action LSF command Example

Submit a batch (background) script or run and application script

bsub

bsub runscript.sh

bsub -q long matlab -r "myscript"

Get an interactive session (shell) or script interactively

bsub -Ip

bsub -q long_int -Is -W 2:00 /bin/bash

bsub -q long_int -Is -W 10 /bin/bash /bin/hostname

Kill a job or kill all jobs

bkill

bkill JOBID

bkill 0

View current/pending jobs

View specific job

View details (long format) of job

bjobs

bjobs

bjobs JOBID

bjobs -l JOBID

View the output and error files of a job

bpeek

bpeek JOBID

View queue status

View queue status for all users

bqueues

bqueues

bqueues -u all

View recent past job info

View list of past jobs (failed or good) from date to now

View details (long format) of past job

bhist

bhist JOBID

bhist -a [-n #] -S 2017/09/01,

bhist -l [-n #] JOBID

Note: The -n option (where n = 0 or 1 - 400) must be used for jobs older than a day or two. This indicates how many jobs lobs to look backwards through. 0 indicates first 100 logs, and may result in the command taking several tens of seconds to return any information.

Check how busy the cluster is by user

View cluster load by several criteria

bjobs

bjobs -u all

bqueues && bhosts && lsload

See additional commands in the official LSF documentation.

Troubleshooting Jobs and Resources

A variety of problems can arise when running jobs and applications on the HBSGrid. Many are related to resource misallocation, but there are other common problems as well. Be sure to check for email messages from the schedule which may explain problems; or check the log files from your application.

Error Likely Cause
JOB <jobid> CANCELLED AT <time> DUE TO TIME LIMIT You did not specify enough time in your submission script. The -W option sets time in minutes or can also take HH:MM form (12:30 for 12.5 hours)
Job <jobid> exceeded <mem> memory limit, being killed Your job is attempting to use more memory than you've requested for it. Either increase the amount of memory requested or, if possible, reduce the amount your application is trying to use. For example, many Java programs set heap space using the -Xmx JVM option. This could potentially be reduced.
Exited with exit code N Your job failed because your application exited with an error. Please look at the job or application logs to determine why your program exited abnormally.

If you are unable to determine why your jobs are not running correctly, please contact RCS for assistance.

ǁ
Campus Map
Research Computing Services (RCS) 
Harvard Business School
Baker Library, B90, 25 Harvard Way
Boston, MA 02163
Phone: 617.495.6100
Email: research@hbs.edu
→Map & Directions
→More Contact Information
→Terms Of Service
  • Make a Gift
  • Site Map
  • Jobs
  • Harvard University
  • Trademarks
  • Policies
  • Accessibility
  • Digital Accessibility
Copyright © President & Fellows of Harvard College