Technical How-To’s & Notes

Monitoring CPU usage of your jobs on the HBSGrid

There are two ways to monitor your running job:

Run unix command top on the compute node where your job is executing, or
Query LSF via bjobs for current running jobs

Both have their advantages and disadvantages, which we'll discuss below. In either case, either method will give you real-time feedback on how your code is behaving.

Running top on the compute node:

Use bjobs to figure out what execution host (EXEC_HOST) your job is running on
For each host, get an interactive session on that machine (replace EXEC_HOST with the actual machine name, e.g rhrcsnod05) bsub -q interactive -Is -W 24:00 -R "rusage[mem=1000]" -m EXEC_HOST /bin/bash (This command gets a bash shell on the named machine with 1 core for 24 hrs with 1000 MB of RAM)
Run top, and watch for your processes.
(Optional) Press u and enter your username to view only your processes.
(Optional) Press 1 (the number one) to see the utilization of each cpu core or hyperthread.
Press q or ctrl-c to stop monitoring.

Monitoring execution via bjobs output

This is less precise as LSF (the scheduler) must collect runtime data on a periodic basis from the execution hosts, so there will be a lag in information. Also, you may not have information for the first 1 to 5 minutes of your job.

Use bjobs to figure out the job IDs for your jobs
For each job, use bjobs -l jobID to get job details
Look at the IDLE_FACTOR statistic. For a 1 core job at 100% efficiency, this will be 1. For a 2 core job, this will be 2, etc. An example of threading out would be asking for 2 cores and seeing a value of > 2.5 (e.g. 4, 6, etc).