Ubuntu HPC slurm control daemon on master node
Home > Ubuntu > Ubuntu HPC setup with slurm and linux containers > Ubuntu HPC slurm control daemon on master node
Slurm Controller Daemon (slurmctld) Setup on slurm-master (LXC)
Note: To access the shell of any Linux container (e.g., slurm-master), run the following command from the infra node:
lxc exec <container-name> bash
Example:
lxc exec slurm-master bash
1. Install Required Packages
Run the following command inside the slurm-master container:
sudo apt install slurmctld
This installs:
- munge – for authentication between Slurm components.
- slurmctld – the Slurm controller daemon.
MUNGE Key Generation and Distribution
Note: After installing munge, you must generate the MUNGE key on the slurm-master node and share the same key with all other nodes in the cluster, including:
- Infra node
- slurm-login (LXC)
- Compute nodes (e.g., node1, node2)
- The MUNGE key must be identical on all nodes for authentication between Slurm components to work.
Run the following commands on the slurm-master container:
sudo create-munge-key -f -r sudo cksum /etc/munge/munge.key sudo mkdir /export/tmp/munge sudo chmod go-rwx /export/tmp/munge sudo cp /etc/munge/munge.key /export/tmp/munge/munge.key
This creates the key, checks its checksum, and places a copy in a shared path (e.g., /export/tmp/munge) accessible to other nodes.
Sample slurm.conf Configuration
Note: The following is a sample Slurm configuration file (slurm.conf). You can use this as a reference template for setting up your cluster.
- Make sure to modify the node names, IP addresses, memory, CPU configuration, and other values according to your actual cluster setup.
Create slurm.conf File
Create the slurm.conf file at the following location on all nodes:
/etc/slurm/slurm.conf
Paste the below content int
# **Note:** This file needs to have identical contents on all nodes of # the cluster. See the `slurm.conf` man page for more information. # # Unique name for identifying this cluster entries in the DB ClusterName=<Cluster_Name> ## scheduler settings # SchedulerType=sched/backfill SelectType=select/linear ## accounting settings # AccountingStorageType=accounting_storage/none # the "job completion" info is redundant if the accounting # infrastructure is enabled, so turn it off as it's an endless source # of authentication and DB connection problems ... JobCompType=jobcomp/none # No power consumption acct AcctGatherEnergyType=acct_gather_energy/none # No IB usage accounting AcctGatherInfinibandType=acct_gather_infiniband/none # No filesystem accounting (only works with Lustre) AcctGatherFilesystemType=acct_gather_filesystem/none # No job profiling (for now) AcctGatherProfileType=acct_gather_profile/none #AcctGatherProfileType=acct_gather_profile/hdf5 JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=60 ## job execution settings # # requeue jobs on node failure, unless users ask otherwise JobRequeue=1 # max number of jobs in a job array MaxArraySize=1000 # max number of jobs pending + running MaxJobCount=10000 MpiDefault=none # Note: Apparently, the `MpiParams` option is needed also for non-mpi # jobs in slurm 2.5.3. MpiParams=ports=12000-12999 # track resource usage via Linux /proc tree ProctrackType=proctrack/linuxproc #ProctrackType=proctrack/cgroup # do not propagate `ulimit` restrictions found on login nodes PropagateResourceLimits=NONE # automatically return nodes to service, unless they have been marked DOWN by admins ReturnToService=1 TaskPlugin=task/none #TaskPlugin=task/cgroup #TaskEpilog=/etc/slurm/task_epilog #TaskProlog=/etc/slurm/task_prolog TmpFs=/tmp # limit virtual mem usage to 101% of real mem usage VSizeFactor=101 # misc timeout settings (commented lines show the default) # BatchStartTimeout=60 CompleteWait=35 #EpilogMsgTime=2000 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 ## `slurmctld` settings (controller nodes) # ControlMachine=<master> ControlAddr=<192.168.2.5> SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmctldTimeout=300 StateSaveLocation=/var/spool/slurm SlurmctldDebug=error SlurmctldLogFile=/var/log/slurm/slurmctld.log DebugFlags=backfill,cpu_bind,priority,reservation,selecttype,steps MailProg=/usr/bin/mail ## `slurmd` settings (compute nodes) # SlurmdPort=6818 SlurmdPidFile=/var/run/slurmd.pid SlurmdSpoolDir=/var/lib/slurm/slurmd SlurmdTimeout=300 SlurmdDebug=error SlurmdLogFile=/var/log/slurm/slurmd.log AuthType=auth/munge CryptoType=crypto/munge DisableRootJobs=NO ## Cluster nodes # NodeName=node1 CPUs=32 Boards=1 SocketsPerBoard=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=126000 State=UNKNOWN NodeName=node2 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=126000 State=UNKNOWN ## cpu partitions # PartitionName=cpu Nodes=node1,node2 MaxTime=INFINITE State=UP Default=YES ## Gpu Partition PartitionName=gpu Nodes=node1,node2 Default=Yes MaxTime=INFINITE State=UP
Restarting Slurm Controller Service
Once the slurm.conf file is properly configured and placed in /etc/slurm/, you can restart the Slurm controller (slurmctld) service using:
sudo systemctl restart slurmctld
Note: The slurm.conf file must be identical on all nodes in the cluster. To ensure consistency, it's a best practice to store it in a shared location (e.g., /export/tmp/slurm/) and copy it from there to each node.
For more details on each configuration directive, refer to the official documentation: https://slurm.schedmd.com/slurm.conf.html
Home > Ubuntu > Ubuntu HPC setup with slurm and linux containers > Ubuntu HPC slurm control daemon on master node