@@ Line 1: / Line 1: @@
-[[Main Page|Home]] > [[Ubuntu]] > [[Ubuntu HPC setup with slurm and linux containers]] > [[Ubuntu HPC slurm control daemon on master node]]
-= Slurm Controller Daemon (slurmctld) Setup on slurm-master (LXC) =
-'''Note:'''
-To access the shell of any Linux container (e.g., slurm-master), run the following command from the infra node:
-<pre> lxc exec &lt;container-name&gt; bash </pre>
-Example:
-<pre>
-lxc exec slurm-master bash
-</pre>
-== 1. Install Required Packages ==
-Run the following command inside the slurm-master container:
-<pre>
-sudo apt install slurmctld
-</pre>
-This installs:
-* munge – for authentication between Slurm components.
-* slurmctld – the Slurm controller daemon.
-= MUNGE Key Generation and Distribution =
-'''Note:'''
-After installing munge, you must generate the MUNGE key on the slurm-master node and share the same key with all other nodes in the cluster, including:
-* Infra node
-* slurm-login (LXC)
-* Compute nodes (e.g., node1, node2)
-: The MUNGE key must be identical on all nodes for authentication between Slurm components to work.
-== Generate and Share MUNGE Key on slurm-master ==
-Run the following commands on the slurm-master container:
-<pre>
-sudo create-munge-key -f -r
-sudo cksum /etc/munge/munge.key
-sudo mkdir /export/tmp/munge
-sudo chmod go-rwx /export/tmp/munge
-sudo cp /etc/munge/munge.key /export/tmp/munge/munge.key
-</pre>
-This creates the key, checks its checksum, and places a copy in a shared path (e.g., /export/tmp/munge) accessible to other nodes.
-= Sample slurm.conf Configuration =
-'''Note:'''
-The following is a sample Slurm configuration file (slurm.conf). You can use this as a reference template for setting up your cluster.
-: Make sure to modify the node names, IP addresses, memory, CPU configuration, and other values according to your actual cluster setup.
-=== Create slurm.conf File ===
-Create the slurm.conf file at the following location on all nodes:
-<pre>
-/etc/slurm/slurm.conf
-</pre>
-Paste the below content int
-<pre>
-# **Note:** This file needs to have identical contents on all nodes of
-# the cluster.  See the `slurm.conf` man page for more information.
-#
-# Unique name for identifying this cluster entries in the DB
-ClusterName=<Cluster_Name>
-## scheduler settings
-#
-SchedulerType=sched/backfill
-SelectType=select/linear
-## accounting settings
-#
-AccountingStorageType=accounting_storage/none
-# the "job completion" info is redundant if the accounting
-# infrastructure is enabled, so turn it off as it's an endless source
-# of authentication and DB connection problems ...
-JobCompType=jobcomp/none
-# No power consumption acct
-AcctGatherEnergyType=acct_gather_energy/none
-# No IB usage accounting
-AcctGatherInfinibandType=acct_gather_infiniband/none
-# No filesystem accounting (only works with Lustre)
-AcctGatherFilesystemType=acct_gather_filesystem/none
-# No job profiling (for now)
-AcctGatherProfileType=acct_gather_profile/none
-#AcctGatherProfileType=acct_gather_profile/hdf5
-JobAcctGatherType=jobacct_gather/linux
-JobAcctGatherFrequency=60
-## job execution settings
-#
-# requeue jobs on node failure, unless users ask otherwise
-JobRequeue=1
-# max number of jobs in a job array
-MaxArraySize=1000
-# max number of jobs pending + running
-MaxJobCount=10000
-MpiDefault=none
-# Note: Apparently, the `MpiParams` option is needed also for non-mpi
-# jobs in slurm 2.5.3.
-MpiParams=ports=12000-12999
-# track resource usage via Linux /proc tree
-ProctrackType=proctrack/linuxproc
-#ProctrackType=proctrack/cgroup
-# do not propagate `ulimit` restrictions found on login nodes
-PropagateResourceLimits=NONE
-# automatically return nodes to service, unless they have been marked DOWN by admins
-ReturnToService=1
-TaskPlugin=task/none
-#TaskPlugin=task/cgroup
-#TaskEpilog=/etc/slurm/task_epilog
-#TaskProlog=/etc/slurm/task_prolog
-TmpFs=/tmp
-# limit virtual mem usage to 101% of real mem usage
-VSizeFactor=101
-# misc timeout settings (commented lines show the default)
-#
-BatchStartTimeout=60
-CompleteWait=35
-#EpilogMsgTime=2000
-#HealthCheckInterval=0
-#HealthCheckProgram=
-InactiveLimit=0
-KillWait=30
-#MessageTimeout=10
-#ResvOverRun=0
-MinJobAge=300
-#OverTimeLimit=0
-#UnkillableStepTimeout=60
-#VSizeFactor=0
-Waittime=0
-## `slurmctld` settings (controller nodes)
-#
-ControlMachine=<master>
-ControlAddr=<192.168.2.5>
-SlurmctldPidFile=/var/run/slurmctld.pid
-SlurmctldPort=6817
-SlurmctldTimeout=300
-StateSaveLocation=/var/spool/slurm
-SlurmctldDebug=error
-SlurmctldLogFile=/var/log/slurm/slurmctld.log
-DebugFlags=backfill,cpu_bind,priority,reservation,selecttype,steps
-MailProg=/usr/bin/mail
-## `slurmd` settings (compute nodes)
-#
-SlurmdPort=6818
-SlurmdPidFile=/var/run/slurmd.pid
-SlurmdSpoolDir=/var/lib/slurm/slurmd
-SlurmdTimeout=300
-SlurmdDebug=error
-SlurmdLogFile=/var/log/slurm/slurmd.log
-AuthType=auth/munge
-CryptoType=crypto/munge
-DisableRootJobs=NO
-## Cluster nodes
-#
-NodeName=node1 CPUs=32 Boards=1 SocketsPerBoard=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=126000 State=UNKNOWN
-NodeName=node2 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=126000 State=UNKNOWN
-## cpu partitions
-#
-PartitionName=cpu Nodes=node1,node2 MaxTime=INFINITE State=UP Default=YES
-## Gpu Partition
-PartitionName=gpu Nodes=node1,node2 Default=Yes MaxTime=INFINITE State=UP
-</pre>
-= Restarting Slurm Controller Service =
-Once the slurm.conf file is properly configured and placed in /etc/slurm/, you can restart the Slurm controller (slurmctld) service using:
-<pre>
-sudo systemctl restart slurmctld
-</pre>
-'''Note:'''
-The slurm.conf file must be identical on all nodes in the cluster.
-To ensure consistency, it's a best practice to store it in a shared location (e.g., /export/tmp/slurm/) and copy it from there to each node.
-For more details on each configuration directive, refer to the official documentation:
-https://slurm.schedmd.com/slurm.conf.html
-[[Main Page|Home]] > [[Ubuntu]] > [[Ubuntu HPC setup with slurm and linux containers]] > [[Ubuntu HPC slurm control daemon on master node]]

Anonymous

Search

Difference between revisions of "Install slurmctld on master node"

Latest revision as of 12:13, 6 June 2025

Navigation

Wiki tools

Page tools