Install slurmctld on master node

From Notes_Wiki
Revision as of 09:52, 6 June 2025 by Akshay (talk | contribs)

Home > Ubuntu > Ubuntu HPC setup with slurm and linux containers > Install slurmctld on master node

Slurm Controller Daemon (slurmctld) Setup on slurm-master (LXC)

Note: To access the shell of any Linux container (e.g., slurm-master), run the following command from the infra node:

 lxc exec <container-name> bash 

Example:

 lxc exec slurm-master bash 

1. Install Required Packages

Run the following command inside the slurm-master container:

 sudo apt install munge slurmctld 

This installs:

  • munge – for authentication between Slurm components.
  • slurmctld – the Slurm controller daemon.


Sample slurm.conf Configuration

Note: The following is a sample Slurm controller configuration file (slurm.conf). You can use this as a reference template for setting up your cluster.

Make sure to modify the node names, IP addresses, memory, CPU configuration, and other values according to your actual cluster setup.

Create slurm.conf File

Create the slurm.conf file at the following location on all nodes:

 
/etc/slurm/slurm.conf 

Paste the below content int

# **Note:** This file needs to have identical contents on all nodes of
# the cluster.  See the `slurm.conf` man page for more information.
#

# Unique name for identifying this cluster entries in the DB

ClusterName='''Cluster_Name'''


## scheduler settings
#
SchedulerType=sched/backfill
SelectType=select/linear


## accounting settings
#

AccountingStorageType=accounting_storage/none

# the "job completion" info is redundant if the accounting
# infrastructure is enabled, so turn it off as it's an endless source
# of authentication and DB connection problems ...
JobCompType=jobcomp/none

# No power consumption acct
AcctGatherEnergyType=acct_gather_energy/none

# No IB usage accounting
AcctGatherInfinibandType=acct_gather_infiniband/none

# No filesystem accounting (only works with Lustre)
AcctGatherFilesystemType=acct_gather_filesystem/none

# No job profiling (for now)
AcctGatherProfileType=acct_gather_profile/none
#AcctGatherProfileType=acct_gather_profile/hdf5

JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=60


## job execution settings
#

# requeue jobs on node failure, unless users ask otherwise
JobRequeue=1

# max number of jobs in a job array
MaxArraySize=1000

# max number of jobs pending + running
MaxJobCount=10000


MpiDefault=none
# Note: Apparently, the `MpiParams` option is needed also for non-mpi
# jobs in slurm 2.5.3.
MpiParams=ports=12000-12999

# track resource usage via Linux /proc tree
ProctrackType=proctrack/linuxproc
#ProctrackType=proctrack/cgroup

# do not propagate `ulimit` restrictions found on login nodes
PropagateResourceLimits=NONE

# automatically return nodes to service, unless they have been marked DOWN by admins
ReturnToService=1


TaskPlugin=task/none
#TaskPlugin=task/cgroup
#TaskEpilog=/etc/slurm/task_epilog
#TaskProlog=/etc/slurm/task_prolog

TmpFs=/tmp

# limit virtual mem usage to 101% of real mem usage
VSizeFactor=101


# misc timeout settings (commented lines show the default)
#
BatchStartTimeout=60
CompleteWait=35
#EpilogMsgTime=2000
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0


## `slurmctld` settings (controller nodes)
#
ControlMachine='''master'''
ControlAddr='''192.168.2.5'''

SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmctldTimeout=300

StateSaveLocation=/var/spool/slurm

SlurmctldDebug=error
SlurmctldLogFile=/var/log/slurm/slurmctld.log
DebugFlags=backfill,cpu_bind,priority,reservation,selecttype,steps

MailProg=/usr/bin/mail


## `slurmd` settings (compute nodes)
#
SlurmdPort=6818
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmdTimeout=300

SlurmdDebug=error
SlurmdLogFile=/var/log/slurm/slurmd.log

AuthType=auth/munge
CryptoType=crypto/munge

DisableRootJobs=NO


## Cluster nodes
#
NodeName=node1 CPUs=32 Boards=1 SocketsPerBoard=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=126000 State=UNKNOWN
NodeName=node2 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=126000 State=UNKNOWN


## cpu partitions
#
PartitionName=cpu Nodes=node1,node2 MaxTime=INFINITE State=UP Default=YES

## Gpu Partition
PartitionName=gpu Nodes=node1,node2 Default=Yes MaxTime=INFINITE State=UP

Please refer to the below link for more details: https://slurm.schedmd.com/slurm.conf.html

Home > Ubuntu > Ubuntu HPC setup with slurm and linux containers > Install slurmctld on master node