Difference between revisions of "Install slurmctld on master node"

From Notes_Wiki
(Blanked the page)
Tags: Manual revert Blanking
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[Main Page|Home]] > [[Ubuntu]] > [[Ubuntu HPC setup with slurm and linux containers]] > [[Install slurmctld on master node]]


= Slurm Controller Daemon (slurmctld) Setup on slurm-master (LXC) =
'''Note:'''
To access the shell of any Linux container (e.g., slurm-master), run the following command from the infra node:
<pre> lxc exec &lt;container-name&gt; bash </pre>
Example:
<pre> lxc exec slurm-master bash </pre>
== 1. Install Required Packages ==
Run the following command inside the slurm-master container:
<pre> sudo apt install munge slurmctld </pre>
This installs:
* munge – for authentication between Slurm components.
* slurmctld – the Slurm controller daemon.
= Sample slurm.conf Configuration =
'''Note:'''
The following is a sample Slurm controller configuration file (slurm.conf). You can use this as a reference template for setting up your cluster.
: Make sure to modify the node names, IP addresses, memory, CPU configuration, and other values according to your actual cluster setup.
== Create slurm.conf File ==
Create the slurm.conf file at the following location on all nodes:
<pre>
/etc/slurm/slurm.conf
</pre>
Paste the below content int
<pre>
# **Note:** This file needs to have identical contents on all nodes of
# the cluster.  See the `slurm.conf` man page for more information.
#
# Unique name for identifying this cluster entries in the DB
ClusterName='''Cluster_Name'''
## scheduler settings
#
SchedulerType=sched/backfill
SelectType=select/linear
## accounting settings
#
AccountingStorageType=accounting_storage/none
# the "job completion" info is redundant if the accounting
# infrastructure is enabled, so turn it off as it's an endless source
# of authentication and DB connection problems ...
JobCompType=jobcomp/none
# No power consumption acct
AcctGatherEnergyType=acct_gather_energy/none
# No IB usage accounting
AcctGatherInfinibandType=acct_gather_infiniband/none
# No filesystem accounting (only works with Lustre)
AcctGatherFilesystemType=acct_gather_filesystem/none
# No job profiling (for now)
AcctGatherProfileType=acct_gather_profile/none
#AcctGatherProfileType=acct_gather_profile/hdf5
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=60
## job execution settings
#
# requeue jobs on node failure, unless users ask otherwise
JobRequeue=1
# max number of jobs in a job array
MaxArraySize=1000
# max number of jobs pending + running
MaxJobCount=10000
MpiDefault=none
# Note: Apparently, the `MpiParams` option is needed also for non-mpi
# jobs in slurm 2.5.3.
MpiParams=ports=12000-12999
# track resource usage via Linux /proc tree
ProctrackType=proctrack/linuxproc
#ProctrackType=proctrack/cgroup
# do not propagate `ulimit` restrictions found on login nodes
PropagateResourceLimits=NONE
# automatically return nodes to service, unless they have been marked DOWN by admins
ReturnToService=1
TaskPlugin=task/none
#TaskPlugin=task/cgroup
#TaskEpilog=/etc/slurm/task_epilog
#TaskProlog=/etc/slurm/task_prolog
TmpFs=/tmp
# limit virtual mem usage to 101% of real mem usage
VSizeFactor=101
# misc timeout settings (commented lines show the default)
#
BatchStartTimeout=60
CompleteWait=35
#EpilogMsgTime=2000
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
## `slurmctld` settings (controller nodes)
#
ControlMachine='''master'''
ControlAddr='''192.168.2.5'''
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmctldTimeout=300
StateSaveLocation=/var/spool/slurm
SlurmctldDebug=error
SlurmctldLogFile=/var/log/slurm/slurmctld.log
DebugFlags=backfill,cpu_bind,priority,reservation,selecttype,steps
MailProg=/usr/bin/mail
## `slurmd` settings (compute nodes)
#
SlurmdPort=6818
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmdTimeout=300
SlurmdDebug=error
SlurmdLogFile=/var/log/slurm/slurmd.log
AuthType=auth/munge
CryptoType=crypto/munge
DisableRootJobs=NO
## Cluster nodes
#
NodeName=node1 CPUs=32 Boards=1 SocketsPerBoard=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=126000 State=UNKNOWN
NodeName=node2 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=126000 State=UNKNOWN
## cpu partitions
#
PartitionName=cpu Nodes=node1,node2 MaxTime=INFINITE State=UP Default=YES
## Gpu Partition
PartitionName=gpu Nodes=node1,node2 Default=Yes MaxTime=INFINITE State=UP
</pre>
Please refer to the below link for more details: https://slurm.schedmd.com/slurm.conf.html
[[Main Page|Home]] > [[Ubuntu]] > [[Ubuntu HPC setup with slurm and linux containers]] > [[Install slurmctld on master node]]

Latest revision as of 12:13, 6 June 2025