SLURM (Simple Linux Utility for Resource Management) is the go-to scheduler for many of the world’s most powerful supercomputers. It efficiently schedules and manages computational workloads across clusters of computers. Whether you’re new to SLURM or need a refresher, this cheat sheet covers the main commands and parameters you should know.

1. Basic Commands

  • sinfo: Displays the status of nodes and partitions.
  sinfo
  • squeue: Shows the status of jobs.
  squeue -u [username]
  • sbatch: Submits a job script for execution.
  sbatch my_script.sh
  • scancel: Cancels a pending or running job.
  scancel [job_id]
  • salloc: Allocates resources for an interactive session.
  salloc --nodes=1 --time=1:00:00
  • srun: Runs a command on allocated nodes.
  srun --pty bash

2. Common Parameters for Job Submission

The following table contains common parameters that can be used in job scripts or with salloc/srun.

Abbreviated CommandFull CommandDescription
-A--accountSpecifies the account for job charging.
-c--cpus-per-taskNumber of CPU cores per task.
-J--job-nameSets the name of the job.
-N--nodesThe number of nodes required.
-n--ntasksThe time limit for the job (e.g., 1:00:00 for 1 hour).
-t--timeTime limit for the job (e.g., 1:00:00 for 1 hour).
-p--partitionSpecifies the partition or queue.
-G--gpusNumber of GPUs required.
-o--outputDirects job’s standard output to a file.
-e--errorDirects job’s standard error to a file.
--memMemory required per node (e.g., 4G for 4 gigabytes).
-C--constraintSpecifies node feature constraints, like a specific GPU type.
Slurm Cheat Sheet

3. Tips and Tricks

  • Job Arrays: Submit similar jobs using arrays.
  sbatch --array=1-10 my_array_job.sh
  • Parallel Tasks: For parallel tasks, use srun inside your job script.
  srun my_parallel_program
  • Interactive GPU Session: For an interactive session with a GPU:
  salloc --gpus=1
  srun --pty bash
  • Node Status: To view detailed information about nodes, you may combine -l (lowercase of L) and -N.
  sinfo -lN

4. Understanding Partitions in SLURM

In SLURM, a partition is essentially a group of nodes configured for specific types of jobs. Think of them as queues; you submit your job to a queue, and SLURM schedules it based on the rules and resources of that queue. Partitions can be configured based on many factors, including:

  • Priority: Some partitions might be configured for high-priority jobs.
  • Resource Types: Partitions could be specifically for GPU jobs, high memory jobs, etc.
  • User Groups: Some partitions might be reserved for specific user groups or departments.
  • Job Duration: Short jobs might have a different partition than long-running jobs.

You can specify the partition using the -p or --partition flag. Use sinfo to see available partitions and their statuses.

5. GPU Requests Examples

5.1 Requesting a Specific GPU Memory Size

To request a GPU with a specific memory size, say 32GB, you can use the --constraint option.

sbatch --gres=gpu:1 --constraint="gpu_mem=32GB" my_gpu_script.sh

5.2 Requesting Multiple GPUs

sbatch --gres=gpu:4 my_multi_gpu_script.sh

5.3 Requesting Specific GPU Type

You can request a specific one using the constraint option if your cluster has various GPU types.

sbatch --gres=gpu:1 --constraint=gpu_type:V100 my_script.sh

5.4 Requesting a Node with High Memory:

sbatch --mem=256G my_high_memory_script.sh

6. Tips and Tricks

  • Always monitor your jobs with squeue to ensure they are running as expected.
  • Optimize resource requests. Asking for more than you need can delay job starts while asking for too little can lead to job failures.
  • Always read the documentation of the specific cluster you’re working on. SLURM configurations can vary!

7. Closing Thoughts

SLURM provides an efficient way to harness the power of high-performance computing clusters. With the commands and parameters covered in this cheat sheet, you’ll be well on your way to effectively submitting, monitoring, and managing your computational tasks.

If you want to review some structured Slurm tutorials, please refer to Slurm’s official tutorials. For more of our Infrastructure related blogs, please refer to infra.

Happy computing!

[Credit: The featured image is proudly generated by Midjourney]

Leave a Reply

Your email address will not be published. Required fields are marked *