SLURM (Simple Linux Utility for Resource Management) is the go-to scheduler for many of the world’s most powerful supercomputers. It efficiently schedules and manages computational workloads across clusters of computers. Whether you’re new to SLURM or need a refresher, this cheat sheet covers the main commands and parameters you should know.
1. Basic Commands
- sinfo: Displays the status of nodes and partitions.
sinfo
- squeue: Shows the status of jobs.
squeue -u [username]
- sbatch: Submits a job script for execution.
sbatch my_script.sh
- scancel: Cancels a pending or running job.
scancel [job_id]
- salloc: Allocates resources for an interactive session.
salloc --nodes=1 --time=1:00:00
- srun: Runs a command on allocated nodes.
srun --pty bash
2. Common Parameters for Job Submission
The following table contains common parameters that can be used in job scripts or with salloc/srun.
Abbreviated Command | Full Command | Description |
---|---|---|
-A | --account | Specifies the account for job charging. |
-c | --cpus-per-task | Number of CPU cores per task. |
-J | --job-name | Sets the name of the job. |
-N | --nodes | The number of nodes required. |
-n | --ntasks | The time limit for the job (e.g., 1:00:00 for 1 hour). |
-t | --time | Time limit for the job (e.g., 1:00:00 for 1 hour). |
-p | --partition | Specifies the partition or queue. |
-G | --gpus | Number of GPUs required. |
-o | --output | Directs job’s standard output to a file. |
-e | --error | Directs job’s standard error to a file. |
-- mem | Memory required per node (e.g., 4G for 4 gigabytes). | |
-C | --constraint | Specifies node feature constraints, like a specific GPU type. |
3. Tips and Tricks
- Job Arrays: Submit similar jobs using arrays.
sbatch --array=1-10 my_array_job.sh
- Parallel Tasks: For parallel tasks, use
srun
inside your job script.
srun my_parallel_program
- Interactive GPU Session: For an interactive session with a GPU:
salloc --gpus=1
srun --pty bash
- Node Status: To view detailed information about nodes, you may combine -l (lowercase of L) and -N.
sinfo -lN
4. Understanding Partitions in SLURM
In SLURM, a partition is essentially a group of nodes configured for specific types of jobs. Think of them as queues; you submit your job to a queue, and SLURM schedules it based on the rules and resources of that queue. Partitions can be configured based on many factors, including:
- Priority: Some partitions might be configured for high-priority jobs.
- Resource Types: Partitions could be specifically for GPU jobs, high memory jobs, etc.
- User Groups: Some partitions might be reserved for specific user groups or departments.
- Job Duration: Short jobs might have a different partition than long-running jobs.
You can specify the partition using the -p
or --partition
flag. Use sinfo
to see available partitions and their statuses.
5. GPU Requests Examples
5.1 Requesting a Specific GPU Memory Size
To request a GPU with a specific memory size, say 32GB, you can use the --
constraint option.
sbatch --gres=gpu:1 --constraint="gpu_mem=32GB" my_gpu_script.sh
5.2 Requesting Multiple GPUs
sbatch --gres=gpu:4 my_multi_gpu_script.sh
5.3 Requesting Specific GPU Type
You can request a specific one using the constraint option if your cluster has various GPU types.
sbatch --gres=gpu:1 --constraint=gpu_type:V100 my_script.sh
5.4 Requesting a Node with High Memory:
sbatch --mem=256G my_high_memory_script.sh
6. Tips and Tricks
- Always monitor your jobs with
squeue
to ensure they are running as expected. - Optimize resource requests. Asking for more than you need can delay job starts while asking for too little can lead to job failures.
- Always read the documentation of the specific cluster you’re working on. SLURM configurations can vary!
7. Closing Thoughts
SLURM provides an efficient way to harness the power of high-performance computing clusters. With the commands and parameters covered in this cheat sheet, you’ll be well on your way to effectively submitting, monitoring, and managing your computational tasks.
If you want to review some structured Slurm tutorials, please refer to Slurm’s official tutorials. For more of our Infrastructure related blogs, please refer to infra.
Happy computing!
[Credit: The featured image is proudly generated by Midjourney]