Install slurmctld on master node
From Notes_Wiki
Home > Ubuntu > Ubuntu HPC setup with slurm and linux containers > Install slurmctld on master node
Slurm Controller Daemon (slurmctld) Setup on slurm-master (LXC)
Note: To access the shell of any Linux container (e.g., slurm-master), run the following command from the infra node:
lxc exec <container-name> bash
Example:
lxc exec slurm-master bash
1. Install Required Packages
Run the following command inside the slurm-master container:
sudo apt install munge slurmctld
This installs:
- munge – for authentication between Slurm components.
- slurmctld – the Slurm controller daemon.
Sample slurm.conf Configuration
Note: The following is a sample Slurm controller configuration file (slurm.conf). You can use this as a reference template for setting up your cluster.
- Make sure to modify the node names, IP addresses, memory, CPU configuration, and other values according to your actual cluster setup.
Create slurm.conf File
Create the slurm.conf file at the following location on all nodes:
/etc/slurm/slurm.conf
Paste the below content int
# **Note:** This file needs to have identical contents on all nodes of # the cluster. See the `slurm.conf` man page for more information. # # Unique name for identifying this cluster entries in the DB ClusterName='''Cluster_Name''' ## scheduler settings # SchedulerType=sched/backfill SelectType=select/linear ## accounting settings # AccountingStorageType=accounting_storage/none # the "job completion" info is redundant if the accounting # infrastructure is enabled, so turn it off as it's an endless source # of authentication and DB connection problems ... JobCompType=jobcomp/none # No power consumption acct AcctGatherEnergyType=acct_gather_energy/none # No IB usage accounting AcctGatherInfinibandType=acct_gather_infiniband/none # No filesystem accounting (only works with Lustre) AcctGatherFilesystemType=acct_gather_filesystem/none # No job profiling (for now) AcctGatherProfileType=acct_gather_profile/none #AcctGatherProfileType=acct_gather_profile/hdf5 JobAcctGatherType=jobacct_gather/linux JobAcctGatherFrequency=60 ## job execution settings # # requeue jobs on node failure, unless users ask otherwise JobRequeue=1 # max number of jobs in a job array MaxArraySize=1000 # max number of jobs pending + running MaxJobCount=10000 MpiDefault=none # Note: Apparently, the `MpiParams` option is needed also for non-mpi # jobs in slurm 2.5.3. MpiParams=ports=12000-12999 # track resource usage via Linux /proc tree ProctrackType=proctrack/linuxproc #ProctrackType=proctrack/cgroup # do not propagate `ulimit` restrictions found on login nodes PropagateResourceLimits=NONE # automatically return nodes to service, unless they have been marked DOWN by admins ReturnToService=1 TaskPlugin=task/none #TaskPlugin=task/cgroup #TaskEpilog=/etc/slurm/task_epilog #TaskProlog=/etc/slurm/task_prolog TmpFs=/tmp # limit virtual mem usage to 101% of real mem usage VSizeFactor=101 # misc timeout settings (commented lines show the default) # BatchStartTimeout=60 CompleteWait=35 #EpilogMsgTime=2000 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 ## `slurmctld` settings (controller nodes) # ControlMachine='''master''' ControlAddr='''192.168.2.5''' SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmctldTimeout=300 StateSaveLocation=/var/spool/slurm SlurmctldDebug=error SlurmctldLogFile=/var/log/slurm/slurmctld.log DebugFlags=backfill,cpu_bind,priority,reservation,selecttype,steps MailProg=/usr/bin/mail ## `slurmd` settings (compute nodes) # SlurmdPort=6818 SlurmdPidFile=/var/run/slurmd.pid SlurmdSpoolDir=/var/lib/slurm/slurmd SlurmdTimeout=300 SlurmdDebug=error SlurmdLogFile=/var/log/slurm/slurmd.log AuthType=auth/munge CryptoType=crypto/munge DisableRootJobs=NO ## Cluster nodes # NodeName=node1 CPUs=32 Boards=1 SocketsPerBoard=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=126000 State=UNKNOWN NodeName=node2 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=126000 State=UNKNOWN ## cpu partitions # PartitionName=cpu Nodes=node1,node2 MaxTime=INFINITE State=UP Default=YES ## Gpu Partition PartitionName=gpu Nodes=node1,node2 Default=Yes MaxTime=INFINITE State=UP
Please refer to the below link for more details: https://slurm.schedmd.com/slurm.conf.html
Home > Ubuntu > Ubuntu HPC setup with slurm and linux containers > Install slurmctld on master node