Home > Ubuntu > Ubuntu HPC setup with slurm and linux containers > Ubuntu HPC Slurm Resource Management Setup

Slurm Resource Management Setup

For setting up accounting and database support, refer to: Ubuntu HPC Slurm DB setup for Slurm accounting

This section outlines only the resource management setup in Slurm using accounts, QOS, partitions, and user-level resource limitations.

1. Create and Register the Cluster

Run the following on the Slurm master node:

sacctmgr -i add cluster test_hpc-cluster

Note: If the cluster already exists, the command will safely skip.

To verify:

sacctmgr list cluster

This will show the cluster name that we have added in last command

2. Create Account

Create an account associated with the cluster:

sacctmgr add account account1 Cluster=iitdh-hpc

When prompted with:

Would you like to commit changes?

Type `y` and press Enter.

3. Create QOS Levels

List existing QOS:

sacctmgr show qos format=name,priority,GrpTRES

Create new QOS levels:

sacctmgr add qos example1
sacctmgr add qos example2

Verify:

sacctmgr show qos format=name,priority,GrpTRES

4. Update slurm.conf for Resource Management

Edit your shared `slurm.conf` file:

vim /export/tmp/slurm/slurm.conf

Ensure the following lines are present or updated:

AccountingStorageEnforce=associations,limits,qos,safe
GresTypes=gpu
AccountingStorageTRES=gres/gpu

Then copy this config to all nodes and restart services:

Master:

cp /export/tmp/slurm/slurm.conf /etc/slurm/slurm.conf
systemctl restart slurmctld

Compute Nodes:

cp /export/tmp/slurm/slurm.conf /etc/slurm/slurm.conf
systemctl restart slurmd

Login Nodes:

cp /export/tmp/slurm/slurm.conf /etc/slurm/slurm.conf

5. Add User to Account and QOS

sacctmgr add user test_user Accounts=account1 Partitions=Test_partition1 QOSLevel=example1

Verify:

sacctmgr show assoc format=cluster,user,qos

Expected output:

Cluster              User       QOS
--------            --------   --------------
test_hpc-cluster    test_user   example1

Configure GPU Resources (GRES)

To enable GPU (Graphics Processing Unit) resource tracking and allocation in SLURM, you must configure the `gres.conf` file on all nodes that have GPU devices.

Step 1: Identify GPU Devices and Minor Numbers

Use the following command to list GPU Bus IDs and their minor numbers:

nvidia-smi -q | grep -i -e 'Bus Id' -e Minor

Example output:

Minor Number   : 0
Bus Id         : 00000000:01:00.0
Minor Number   : 1
Bus Id         : 00000000:41:00.0
Minor Number   : 2
Bus Id         : 00000000:81:00.0
Minor Number   : 3
Bus Id         : 00000000:C1:00.0

Each GPU is exposed to SLURM via a device file at `/dev/nvidiaX`, where `X` is the minor number.

Step 2: Create the `gres.conf` File

Create or edit the file `/etc/slurm/gres.conf` on all relevant nodes using the following format:

NodeName=<node_name> Name=gpu Type=<gpu_model> File=/dev/nvidia<minor_number>

Example `gres.conf` file:

# Node1 has 2x P100 GPUs
NodeName=node1 Name=gpu Type=p100 File=/dev/nvidia0
NodeName=node1 Name=gpu Type=p100 File=/dev/nvidia1

# Node2 has 1x V100 GPU
NodeName=node2 Name=gpu Type=v100 File=/dev/nvidia0

# Node3 has 4x A100 GPUs
NodeName=node3 Name=gpu Type=a100 File=/dev/nvidia0
NodeName=node3 Name=gpu Type=a100 File=/dev/nvidia1
NodeName=node3 Name=gpu Type=a100 File=/dev/nvidia2
NodeName=node3 Name=gpu Type=a100 File=/dev/nvidia3

# Node4 has 4x A100 GPUs
NodeName=node4 Name=gpu Type=a100 File=/dev/nvidia0
NodeName=node4 Name=gpu Type=a100 File=/dev/nvidia1
NodeName=node4 Name=gpu Type=a100 File=/dev/nvidia2
NodeName=node4 Name=gpu Type=a100 File=/dev/nvidia3

Save this file as `/export/tmp/slurm/gres.conf` for central access and copy this to all the compute and master nodes including the login node.

Step 3: Deploy `gres.conf` to All Nodes

Copy the `gres.conf` file to all compute and controller nodes:

sudo cp /export/tmp/slurm/gres.conf /etc/slurm/gres.conf

Restart SLURM services:

# On compute nodes:
sudo systemctl restart slurmd

# On master/controller node:
sudo systemctl restart slurmctld

Step 4: Verify GPU Resources in SLURM

Use the following command to ensure SLURM sees the GPU resources correctly:

sinfo --format="%P %n %f %G" --Node

Example output:

PARTITION HOSTNAMES AVAIL_FEATURES GRES
gpu-p100  node1     (null)         gpu:p100:2
gpu*      node1     (null)         gpu:p100:2
cpu       node1     (null)         gpu:p100:2
gpu-v100  node2     (null)         gpu:v100:1
gpu*      node2     (null)         gpu:v100:1
cpu       node2     (null)         gpu:v100:1
gpu*      node3     (null)         gpu:a100:4
cpu       node3     (null)         gpu:a100:4
gpu-a100  node3     (null)         gpu:a100:4
gpu*      node4     (null)         gpu:a100:4
cpu       node4     (null)         gpu:a100:4
gpu-a100  node4     (null)         gpu:a100:4

Step 5: Test GPU Resource Selection with SLURM

You can test if GPU selection is working correctly using:

srun --nodelist=node1 --gres=gpu:p100:2 --pty bash

Inside the session:

echo $CUDA_VISIBLE_DEVICES

Expected output:

0,1

This indicates that SLURM correctly allocated two GPUs for the job.

6. Apply Per-User Resource Limits using QOS

Limit CPU and GPU usage per user:

sacctmgr modify qos example1 set MaxTRESPerUser=cpu=2,gres/gpu=1

Limit maximum wall time:

sacctmgr modify qos example1 set MaxWall=12:00:00

Set maximum number of jobs per account:

sacctmgr modify qos example1 set MaxJobsPA=10

Refer to official documentation: https://slurm.schedmd.com/resource_limits.html

7. Test the QOS and Limits

Try submitting a job exceeding the limit:

srun --gres=gpu:2 --cpus-per-task=1 --account=account1 -p Test_partition1 --pty bash

Expected message (queued due to limit):

srun: job 216 queued and waiting for resources

8. Manage Node States

Check node status:

sinfo -l

If any node is `DOWN` or `DRAINED`, bring it back up:

scontrol update NodeName=<node-name> state=idle

Home > Ubuntu > Ubuntu HPC setup with slurm and linux containers > Ubuntu HPC Slurm Resource Management Setup

Anonymous

Search

Ubuntu HPC Slurm Resource Management Setup

Namespaces

More

Page actions

Contents

Slurm Resource Management Setup

1. Create and Register the Cluster

2. Create Account

3. Create QOS Levels

4. Update slurm.conf for Resource Management

5. Add User to Account and QOS

Configure GPU Resources (GRES)

Step 1: Identify GPU Devices and Minor Numbers

Step 2: Create the `gres.conf` File

Step 3: Deploy `gres.conf` to All Nodes

Step 4: Verify GPU Resources in SLURM

Step 5: Test GPU Resource Selection with SLURM

6. Apply Per-User Resource Limits using QOS

7. Test the QOS and Limits

8. Manage Node States

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Ubuntu HPC Slurm Resource Management Setup

Slurm Resource Management Setup

1. Create and Register the Cluster

2. Create Account

3. Create QOS Levels

4. Update slurm.conf for Resource Management

5. Add User to Account and QOS

Configure GPU Resources (GRES)

Step 1: Identify GPU Devices and Minor Numbers

Step 2: Create the `gres.conf` File

Step 3: Deploy `gres.conf` to All Nodes

Step 4: Verify GPU Resources in SLURM

Step 5: Test GPU Resource Selection with SLURM

6. Apply Per-User Resource Limits using QOS

7. Test the QOS and Limits

8. Manage Node States

Navigation

Wiki tools

Page tools