Compute Cluster
Scaling Work
Scaling Work
Parallel Processing
Requesting multiple CPUs on the HBSGrid
When using parallel processing on shared compute systems, you need to indicate to the scheduler, the system software that manages workloads, that you wish to use multiple cores. On your personal desktop or laptop, this isn't necessary, as you control all the resources on that machine. However, on a compute cluster, you only control the resources that the scheduler has given you, and it has given you only the resources that you've requested, whether this is done explicitly via a custom job submission script, or implicitly using a default values or default submission scripts available on the HBS compute grid. This is due to the fact that jobs (work sessions) from multiple (and possibly) different people, are often running side-by-side on a given compute node on the compute cluster.
To use multiple CPUs on the HBSGrid you can start your application using a wrapper script {link to Working with wrapper (default submission) scripts subsection} or a custom submission script {link to Working with custom submission scripts subsection} and specify the number of CPUs you will use. For example, starting R with the command
in the terminal when on a login node will call the wrapper script ( ; the RAM footprint (5 GB RAM) is optional for the smallest one) and start R with 5 CPUs reserved. Note that one cannot use the NoMachine drop-down menus to start parallel jobs except for using Stata: The menus indicate if 1-core (Stata-SE), 4-core (Stata-MP4), or 8-core (Stata-MP8). If you wish to use multiple cores with other tools, you must issue the appropriate commands yourself in the terminal to request and reserve these resources from the scheduler.Implicit Parallelism
Implicit parallelism is easiest to use but limited to the features offered by your application or programming language. Most of the applications commonly used for data analysis on the HBSGrid provide some degree of implicit parallelization. The system-wide installation of Rstudio / Microsoft R Open uses the Intel Math Kernel Lbrary (MKL) for fast multi-threaded computations. The system-wide installation of Spyder / Python also use MKL to speed up some computations. Similarly, many Stata commands have been parallelized, as have some Matlab algorithms. Note that for all these applications only some computations use implicit parallelization and many computations will only use a single CPU. To speed up other computations you may be able to use explicit parallelization.
Most system-wide applications started on the HBSGrid via wrapper scripts will use the number of CPUs you specify when starting your application. For example, starting R via
from the command line will start R with MKL correctly configured to use five cores.Explicit Parallelism
Explicit parallelism can be achieved using application or library features to leverage multiple CPUs on a single compute node, or using LSF job arrays to leverage multiple CPUs across multiple compute nodes. For this to work, your script or code must be parallelizable -- it can be broken into parts that can execute independently. This is often the case with for loops or functions that can perform work independently. A good example is the apply functions in R (
, , etc.).As when using implicit parallelism, you must request the number of CPUs you will use when submitting a job. We also recommend that you do not statically indicate ('hard code') the number of cores that you'll be using in your code. Instead, set this value dynamically based on job/runtime environment variables that are set as the job executes. You'll see examples of this in the following sections.
Finally, one must also factor in memory requirements for explicit parallelization. If the parallelization all happens within one job (e.g. Python's
, certain approaches with or , or others), one must also determine how the memory will be consumed for each fork/thread/process/branch of the code. If each has it's own copy of the data and data structures, then memory requirements will increase significantly based on the number of parallel executions. Conversely, if each shares the data in memory with the parent program, then significantly less memory will be needed. Each application/programming framework works differently; consult the documentation and adjust the memory requirement appropriately when submitting the job.Explicit parallelism uses application-specific libraries and features, and is described below for each of the most commonly used programs on the HBSGrid cluster.
- MATLAB
-
Introduction
The following has been adapted from FAS RC’s Parallel MATLAB page (https://docs.rc.fas.harvard.edu/kb/parallel-matlab-pct-dcs/). As the Odyssey cluster uses a different workload manager, the code has been adapted to the workload manager on the HBS compute grid.
This page is intended to help you with running parallel MATLAB codes on the HBS compute grid. Parallel processing with MATLAB is performed with the help of two products, Parallel Computing Toolbox (PCT) and Distributed Computing Server (DCS). HBS is licensed only for use of the PCT.
Supported Versions: On the HBS compute grid, the following versions of MATLAB with the Parallel Computing Toolbox (PCT) are:
MATLAB Version Executable name MATALB 2018a 64-bit matlab Maximum Workers: PCT uses workers, MATLAB computational engines, to execute parallelized applications and their parts on CPU cores. Each compute node on the Grid has 32 physical cores; therefore (in theory) users should request no more than 32 cores when using MATLAB with PCT. However, due to current user resource limits, you should request no more than 12 (interactive) or 16 (batch) cores. If you request more than this, your job will not run as it will sit in a
state.Code Example
The following simple code illustrates the use of PCT to calculate pi via a parallel Monte-Carlo method. This example also illustrates the use of
(parallel ) loops. In this scheme, suitable loops could be simply replaced by parallel loops without other changes to the code:Code with Job Submission Script
To run the above code (named code.m) using 5 CPU cores with the Grid's default wrapper scripts, in the terminal use the following command:
Using custom job submission commands, the following will similarly submit the job with 5 cores for 10 mins with 100 MB of memory:
This will cause a log file to be created called
owing to the first line in our MATLAB code,If you do not use MATLAB's
function, then you may also enter the following command to have output sent to an unnamed output file:The
is escaped here so that it becomes part of the command, not the command.If you wish to use a submission script to run this code and include LSF job option parameters, create a text file named
containing the following:Once your script is ready, you may run it with 5 cores by entering:
The
character is used here so that the directives in the script file are parsed by LSF.Explanation of Parallel Code
Starting and stopping the parallel pool
The
function is used to initiate the parallel pool. To dynamically set the number of workers to the CPU cores you requested, we ask MATLAB to query the LSF environment variable :Once the parallelized portion of your code has been run, you should explicitly close the parallel pool and release the workers as follows:
Parallelized portion of the code
The actual portion of the code that takes advantage of multiple CPUs is the http://www.mathworks.com/help/distcomp/parfor.html). A loop behaves similarly to a loop, though various iterations of the loop are passed to different workers. It is therefore important that iterations due not rely on the output of any other iteration in the same loop.
loop ( - Python
-
Introduction
This section is intended to help you with running parallel python codes on the HBSGrid cluster or on your local multicore machine. The version of python on the cluster uses MKL to automatically parallelize some computations. Python started via a wrapper script will use the number of CPUs you specify when starting your application. For example, starting Python via from the command line will start Python with MKL correctly configured to use 5 cores. You can configure the number of cores MKL uses with the mkl-service package.
In addition to the implicit parallelization provided by MKL, you can explicitly parallelize your own analysis code using the 'multiprocessing' package. Instructions and examples are provided below. Note that this guide does NOT cover distributed computing, which distributes the workload over multiple machines.
Maximum Workers: Most compute nodes on the cluster have at least 32 physical cores; therefore (in theory) users should request no more than 32 cores. For short queue jobs, you may request the use of up to 16 cores, while the limit remains at 12 cores for long queue jobs. Nota Bene! The number of workers are dynamically determined by asking LSF (the scheduler) how many cores you have reserved via the
environment variable. DO NOT use or similar; instead retrieve the values of this environment variable, e.g., .Example: Parallel Processing Basics
This sample code will provide a basic introduction to parallel processing. You will be shown how to set up your parallel pool with the appropriate number of workers, how to define which function is to be run in parallel, and how to gather the results.
For this example, we will calculate the square of a list of numbers in parallel.
Example: Parallel Processing with Pools
Code with Job Submission Script
To run the above code (named
) using 5 CPU cores via the cluster's default wrapper scripts, in the terminal use the following command:Submit the two examples code files above as follows:
If you wish to use a submission script to run this code and include LSF job option parameters, create a text file named
containing the following:Note that one can use one of
as queue names with increasing run time limits.Once your script is ready, you may run it with 5 cores by entering:
Note, the
character is no longer needed when submitting jobs for LSF to parse directives; this is done by default. - R
-
Implicit Parallelization
The system-wide installation of Rstudio on the Grid uses the Intel Math Kernel Lbrary (MKL) for fast multithreaded computations. R started on the Grid via a wrapper script will use the number of CPUs you specify when starting your application. For example, starting R via from the command line will start R with MKL correctly configured to use 5 cores. You can set the number of cores used by MKL using the setMKLthreads function in the package; more information about MKL in R is available. Some popular R packages, including data.table, also provide some degree of implicit parallelization. The number of threads used by data.table can be set using the setDTthreads function.
Explicit Parallelization
It is also possible to explicilty parallelize your own analysis code. There are a large number of R packages available for parallel computing.
The future package is simple, easy to use, and can make use of several backends to enable parallelization across CPUs, add-hoc clusters, HPC clusters (including LSF on the grid) via future.batchtools and others. A number of front-ends are available, including future.apply and furrr.
The foreach package is another popular option with a number of available backends, including doFuture that allows you to use foreach as a future frontend.
For a more comprehensive survey of parallel computing in R refer to the High Performance Computing Task View.
Code Examples
The following code examples were adapted from the Texas Advanced Computing Center (TACC) seminar "R for High Performance Computing" given through XSEDE (now ACCESS).
Below are a number of very simple examples to highlight how the frameworks can be included in your code. Nota Bene! The number of workers are dynamically determined by asking LSF (the scheduler) how many cores you have reserved via the LSB_MAX_NUM_PROCESSORS environment variable. DO NOT use the
routine or anything similar, as this will clobber your code as well as any other code running on the same compute node.All the following examples will use the following example function :
It is important not to use more cores than we've reserved:
The future.apply package provides *apply functions that use future backends.
The doFuture package makes it easy to write parallel loops:
Scheduler Submission (Job) Script
If submitted via the terminal, the following batch submission script will submit your R code to the compute grid and will allocate 4 CPU cores for the work (as well as 5 GB of RAM for a run time limit of 12 hrs). If your code is written as above, using
, then your code will detect that 4 cores have been allocated. (Please note the second, long command may wrap on the page, but should be submitted as one line) - Stata
-
Introduction
StataMP provides implicilty parallel implementations of many functions, which are documented its 330 page Stata/MP Performance Report, describing which functions are parallelized and each's efficiency (how perfectly parallelized a given function is):
Stata/MP is the version of Stata that is programmed to take full advantage of multicore and multiprocessor computers. It is exactly like Stata/SE in all ways except that it distributes many of Stata’s most computationally demanding tasks across all the cores in your computer and thereby runs faster—much faster.
They could be impressive. But a caveat:
With multiple cores, one might expect to achieve the theoretical upper bound of doubling the speed by doubling the number of cores—2 cores run twice as fast as 1, 4 run twice as fast as 2, and so on. However, there are three reasons why such perfect scalability cannot be expected: 1) some calculations have parts that cannot be partitioned into parallel processes; 2) even when there are parts that can be partitioned, determining how to partition them takes computer time; and 3) multicore/multiprocessor systems only duplicate processors and cores, not all the other system resources.
In general:
Stata/MP achieved 75% parallelization efficiency overall and 85% efficiency among estimation commands... Speed is more important for problems that are quantified as large in terms of the size of the dataset or some other aspect of the problem, such as the number of covariates. On large problems, Stata/MP with 2 cores runs half of Stata’s commands at least 1.7 times faster than on a single core. With 4 cores, the same commands run at least 2.4 times faster than on a single core. NOTE: This is already a drop to 60% efficiency on 4 cores.
How to Utilize This?
This parallelization benefit is mostly realized in running code in batch mode. If using Stata interactively, Stata is predominantly waiting for user input, and so the parallelization gains diminish rapidly. If one intends to do intense, focused work for short periods of time (up to a few days) and subsequently exit the software, choosing multiple cores is fine. But if you plan to run an interactive session over the course of the day or two, please select Stata-SE, as the multiple cores that you have requested are reserved only for you and will sit idle during this time, decreasing the resources available to other people.
No additional work is needed for you to utilize the multiple CPU cores in your code. Stata will handle this transparently for you. But you do need to ensure that you ask the compute grid to reserve the cores for your use:
Using NoMachine (interactive only):
From the Applications menu, select the Stata-SE menus for single-core or Stata-MP4 menus for 4-core Stata. Under each, select the appropriate memory footprint for your work (see Choosing Resources). An example screenshot can be see here. The wrapper scripts that drive these menu items include all the necessary commands to start Stata with the designated number of CPU cores within your session.
Using PAC (batch only):
Note that PAC is not yet available for Grid 2.5. Please contact RCS if you have any questions.
Using the command-line (interactive or batch):
Both interactive and batch jobs can started from the command line, and both default, wrapper scripts and custom submission scripts can be employed. (Again, wrapper scripts are great for out-of-the-box configs; custom scripts are needed for atypical RAM or CPU requirements.) For example:
Explicit Parallelization
Explicit parallelization in Stata can be achieved using the parallel module.
GPU Computing
About GPU Computing
A GPU (graphics processing unit) is a processor that is great at handling specialized computations. We can contrast this to the Central Processing Unit (CPU), which is great at handling general computations. CPUs power most of the computations performed on the devices we use daily.
GPU can be faster at completing tasks than CPU. However, it is not true for every case. The performance hugely depends on the type of computation being performed. GPUs are great at tasks that can be run in parallel ....and are often used for Machine Learning types of'embarrassingly parallel' tasks (== a huge task can be broken down into many smaller ones that are completely independent of one another).
-- Taken and adapted from Why Deep Learning Uses GPUs
The HBSGrid cluster has five NVIDIA Tesla V100 graphics processing units (GPUs) attached to one compute node. Computational workflows that make use of GPUs can see significant speedups in execution time, though one's code must be written using frameworks that will leverage these special resources (e.g. Tensorflow, PyTorch, etc). The GPU node is available for both interactive and batch sessions.
At this point in time, one must submit jobs via custom LSF commands. That is, wrapper scripts cannot be used to start interactive or batch jobs on the GPU node. Due to the unusual nature of the compute node's architecture, there are some decisions one must make that can affect how soon a job is dispatched and how efficient the processing can be, both of which are explained below.
Submitting Jobs
To request any GPU resources as a part of your job submission, you must include the
flag and options and we recommend that you submit to the gpu queue ( ). Your job can be either for interactive or batch sessions as your work requires.The easiest route is to use the default GPU configuration,
. For example:
This command will load the AI/Machine Learning Python environment and submit a job to the gpu queue with default GPU options that launches the Spyder IDE. The default GPU options are (with quotes). If you wish to do something other than the default, simply indicate the option name and its preferred value, or supply the whole option string. For example,
OR
Again, as with all job submissions, one can specify the parameters on the command line or include them in a job submission script.
Note: In December 2020, we will begin phasing out separate AI/ML environments and instead include the most common frameworks (Tensorflow, Keras, OpenCV, etc) as a part of our standard python environments. Please see our Machine Learning page for more information.
Common GPU Options and Their Definitions
The full range of options for use of the GPU resources are documented at LSF's Submitting Jobs that Require GPU Resources page. These five options should handle most use cases (defaults are in boldface type; text below is copied or paraphrased from the LSF page):
num= (default =1): The number of GPUs to request. Note that after your job is dispatched, no matter which GPU one is allocated, the GPUs will be indexed starting from 0. And for security purposes, we are enforcing GPU sandboxing via Linux CGROUPS so one cannot use an incorrect index.
aff=no | yes: CPU-GPU affinity. This indicates whether or not the job should enforce strict GPU-CPU affinity binding. That is, the GPU allocated is on the same socket (group of CPU cores) as the GPU. This GPU-CPU affinity translates to higher communication rate, and thus better performance. If set to no, LSF does not bind the job core on the CPU socket to the GPU, but does ensure that the job is pinned to one or more cores (it does not bounce around == less performance) and that CGROUPs are active (job is sandboxed). NOTE: due to the unusual nature of the compute node, if you request aff=yes and the node has filled the lower 48 cores, your job will not dispatch until some of the lower cores are released. This is due to the fact that the upper 16 cores do not share the same CPU socket with any GPU. If you wish use
and are submitting an interactive job (concerned about immediate job dispatching), we advise that you use to see how busy the GPU node is.mode=shared | exclusive_process: The GPU mode when the job is running, either
or . The mode corresponds to the NVIDIA compute mode -- multiple processes can use the GPU simultaneously. Individual threads of each process may submit work to the GPU simultaneously. The mode corresponds to the NVIDIA compute mode -- the GPU is assigned to only one process at a time, and individual process threads may submit work to the GPU concurrently.mps=no | yes: Enables or disables the NVIDIA Multi-Process Service (MPS) for the GPUs that are allocated to the job. We are not using this service at this time. If you have a need for this or feel that it should be in play, please contact RCS to consult with us on this.
j_exclusive=no| yes: Specifies whether the allocated GPUs can be used by other jobs. When the mode is set to
, the option is set automatically.Further Resources
For more information, please see: