Slurm is a queuing system used for cluster/resource management and job scheduling. Slurm is responsible for allocating resources to users, providing a framework for starting, executing and monitoring work on allocated resources, and scheduling work for future execution.
Job Submission Commands
sbatch
sbatch submits a batch script to Slurm. The batch script may be given to sbatch through a file name on the command line, or if no file name is specified, sbatch will read in a script from standard input. The batch script may contain options preceded with “#SBATCH” before any executable commands in the script.
sbatch exits immediately after the script is successfully transferred to the Slurm controller and assigned a Slurm job ID. The batch script is not necessarily granted resources immediately. Rather it may sit in the queue of pending jobs for some time before its required resources become available.
When a job is submitted, Slurm responds with the job’s ID. This JOBID is used to identify this job in reports from Slurm.
login:~> sbatch run_example
Submitted batch job 1234567
Several example submission scripts can be found in the /nfs/apps/Submission/ directory. Also, two documented examples are provided later in this guide.
salloc
salloc is used to allocate resources for a job in real time by spawning a shell on a compute node. This is known as an interactive job and the shell can then be used to launch parallel tasks. Unlike sbatch, salloc will block until the allocation is made. To exit a salloc session, type “exit” at the command prompt and press Enter.
login:~> salloc -N 1 -p general
salloc: Granted job allocation 1234567
salloc: Waiting for resource configuration
salloc: Nodes n0001 are ready for your job
n0001:~> exit
The “-N 1” flag specifies that one compute node will be used for the interactive session. The “-p general” flag is used to launch the job in the general partition.
srun
The srun command is very similar to salloc in that it can be used to submit a job for real-time execution. Typically srun is used to submit specific commands to the queuing system. The srun command will block until the command has been executed.
login:~> srun -N 1 -p general my_script.sh
An interactive shell can also be started by using the “–pty bash” flag.
login:~> srun -N 1 -p general --pty bash
n0001:→
The srun command can also be used inside of a sbatch script to launch multiple programs at one time within a sbatch allocation. This is considered an advanced technique and should be tested before using it on a large job.
Jobs with Graphical User Interfaces
To run a program with a graphical user interface through the queuing system, it is recommended to use either the srun or salloc commands in conjunction with the “–x11” flag inside of a CAP Client GUI Session. Exporting the DISPLAY variable is no longer supported on Joule 3.0.
As an example, the following command could be used to launch Paraview while using the post partition:
login:~> salloc --x11 -N1 -p post
salloc: Granted job allocation 1234567
salloc: Waiting for resource configuration
salloc: Nodes post01 are ready for your job
post01:~> module load paraview
post01:~> vglrun paraview
For programs that rely on hardware acceleration, it will be necessary to prepend the name of the program with vglrun. Unlike Joule 2.0, vglrun is not offered as an Environment Module. Instead, it will only available in the default environment on login servers and post partition nodes.
Monitoring Jobs
squeue
Jobs can be monitored by using the squeue command. To view jobs specific to you use squeue -u your_username. Please do not script or “watch” the squeue as this can put unnecessary load on the queuing system.
Email Notifications
Users can receive notifications when a job starts, stops, or produces an error. To enable this functionality, the following lines can be added to a sbatch script:
#SBATCH --mail-type=begin,end,fail
#SBATCH --mail-user=first.last@netl.doe.gov
Note: Only NETL email address can be used for these notifications.
Canceling Jobs
scancel
Jobs can be stopped by using the scancel command in conjunction with the appropriate JOB ID.
login:~> scancel -j $JOBID
Holding and Releasing Jobs
A pending job can be prevented from starting by placing a hold on it:
login:~> scontrol hold $JOBID
The hold can then be released with:
login:~> scontrol release $JOBID
Slurm Documentation
For additional details related to Slurm commands, please consult the documentation at https://slurm.schedmd.com/man_index.html.