SLURM Cheat Sheet

SLURM (Simple Linux Utility for Resource Management) is the go-to scheduler for many of the world’s most powerful supercomputers. It efficiently schedules and manages computational workloads across clusters of computers. Whether you’re new to SLURM or need a refresher, this cheat sheet covers the main commands and parameters you should know.

1. Basic Commands

sinfo: Displays the status of nodes and partitions.

  sinfo

squeue: Shows the status of jobs.

  squeue -u [username]

sbatch: Submits a job script for execution.

  sbatch my_script.sh

scancel: Cancels a pending or running job.

  scancel [job_id]

salloc: Allocates resources for an interactive session.

  salloc --nodes=1 --time=1:00:00

srun: Runs a command on allocated nodes.

  srun --pty bash

2. Common Parameters for Job Submission

The following table contains common parameters that can be used in job scripts or with salloc/srun.

Abbreviated Command	Full Command	Description
`-A`	`--account`	Specifies the account for job charging.
`-c`	`--cpus-per-task`	Number of CPU cores per task.
`-J`	`--job-name`	Sets the name of the job.
`-N`	`--nodes`	The number of nodes required.
`-n`	`--ntasks`	The time limit for the job (e.g., `1:00:00` for 1 hour).
`-t`	`--time`	Time limit for the job (e.g., `1:00:00` for 1 hour).
`-p`	`--partition`	Specifies the partition or queue.
`-G`	`--gpus`	Number of GPUs required.
`-o`	`--output`	Directs job’s standard output to a file.
`-e`	`--error`	Directs job’s standard error to a file.
	`--`mem	Memory required per node (e.g., `4G` for 4 gigabytes).
`-C`	`--constraint`	Specifies node feature constraints, like a specific GPU type.

Slurm Cheat Sheet

3. Tips and Tricks

Job Arrays: Submit similar jobs using arrays.

  sbatch --array=1-10 my_array_job.sh

Parallel Tasks: For parallel tasks, use srun inside your job script.

  srun my_parallel_program

Interactive GPU Session: For an interactive session with a GPU:

  salloc --gpus=1
  srun --pty bash

Node Status: To view detailed information about nodes, you may combine -l (lowercase of L) and -N.

  sinfo -lN

4. Understanding Partitions in SLURM

In SLURM, a partition is essentially a group of nodes configured for specific types of jobs. Think of them as queues; you submit your job to a queue, and SLURM schedules it based on the rules and resources of that queue. Partitions can be configured based on many factors, including:

Priority: Some partitions might be configured for high-priority jobs.
Resource Types: Partitions could be specifically for GPU jobs, high memory jobs, etc.
User Groups: Some partitions might be reserved for specific user groups or departments.
Job Duration: Short jobs might have a different partition than long-running jobs.

You can specify the partition using the -p or --partition flag. Use sinfo to see available partitions and their statuses.

5. GPU Requests Examples

5.1 Requesting a Specific GPU Memory Size

To request a GPU with a specific memory size, say 32GB, you can use the --constraint option.

sbatch --gres=gpu:1 --constraint="gpu_mem=32GB" my_gpu_script.sh

5.2 Requesting Multiple GPUs

sbatch --gres=gpu:4 my_multi_gpu_script.sh

5.3 Requesting Specific GPU Type

You can request a specific one using the constraint option if your cluster has various GPU types.

sbatch --gres=gpu:1 --constraint=gpu_type:V100 my_script.sh

5.4 Requesting a Node with High Memory:

sbatch --mem=256G my_high_memory_script.sh

6. Tips and Tricks

Always monitor your jobs with squeue to ensure they are running as expected.
Optimize resource requests. Asking for more than you need can delay job starts while asking for too little can lead to job failures.
Always read the documentation of the specific cluster you’re working on. SLURM configurations can vary!

7. Closing Thoughts

SLURM provides an efficient way to harness the power of high-performance computing clusters. With the commands and parameters covered in this cheat sheet, you’ll be well on your way to effectively submitting, monitoring, and managing your computational tasks.

If you want to review some structured Slurm tutorials, please refer to Slurm’s official tutorials. For more of our Infrastructure related blogs, please refer to infra.

Happy computing!

[Credit: The featured image is proudly generated by Midjourney]