|
Tags: Manual revert Blanking |
(3 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
| [[Main Page|Home]] > [[Ubuntu]] > [[Ubuntu HPC setup with slurm and linux containers]] > [[Install slurmctld on master node]]
| |
|
| |
|
| = Slurm Controller Daemon (slurmctld) Setup on slurm-master (LXC) =
| |
|
| |
| '''Note:'''
| |
| To access the shell of any Linux container (e.g., slurm-master), run the following command from the infra node:
| |
| <pre> lxc exec <container-name> bash </pre>
| |
|
| |
| Example:
| |
| <pre> lxc exec slurm-master bash </pre>
| |
|
| |
| == 1. Install Required Packages ==
| |
|
| |
| Run the following command inside the slurm-master container:
| |
| <pre> sudo apt install munge slurmctld </pre>
| |
|
| |
| This installs:
| |
|
| |
| * munge – for authentication between Slurm components.
| |
|
| |
| * slurmctld – the Slurm controller daemon.
| |
|
| |
|
| |
| = Sample slurm.conf Configuration =
| |
|
| |
| '''Note:'''
| |
| The following is a sample Slurm controller configuration file (slurm.conf). You can use this as a reference template for setting up your cluster.
| |
|
| |
| : Make sure to modify the node names, IP addresses, memory, CPU configuration, and other values according to your actual cluster setup.
| |
|
| |
| == Create slurm.conf File ==
| |
|
| |
| Create the slurm.conf file at the following location on all nodes:
| |
| <pre>
| |
| /etc/slurm/slurm.conf
| |
| </pre>
| |
|
| |
| Paste the below content int
| |
|
| |
| <pre>
| |
| # **Note:** This file needs to have identical contents on all nodes of
| |
| # the cluster. See the `slurm.conf` man page for more information.
| |
| #
| |
|
| |
| # Unique name for identifying this cluster entries in the DB
| |
|
| |
| ClusterName='''Cluster_Name'''
| |
|
| |
|
| |
| ## scheduler settings
| |
| #
| |
| SchedulerType=sched/backfill
| |
| SelectType=select/linear
| |
|
| |
|
| |
| ## accounting settings
| |
| #
| |
|
| |
| AccountingStorageType=accounting_storage/none
| |
|
| |
| # the "job completion" info is redundant if the accounting
| |
| # infrastructure is enabled, so turn it off as it's an endless source
| |
| # of authentication and DB connection problems ...
| |
| JobCompType=jobcomp/none
| |
|
| |
| # No power consumption acct
| |
| AcctGatherEnergyType=acct_gather_energy/none
| |
|
| |
| # No IB usage accounting
| |
| AcctGatherInfinibandType=acct_gather_infiniband/none
| |
|
| |
| # No filesystem accounting (only works with Lustre)
| |
| AcctGatherFilesystemType=acct_gather_filesystem/none
| |
|
| |
| # No job profiling (for now)
| |
| AcctGatherProfileType=acct_gather_profile/none
| |
| #AcctGatherProfileType=acct_gather_profile/hdf5
| |
|
| |
| JobAcctGatherType=jobacct_gather/linux
| |
| JobAcctGatherFrequency=60
| |
|
| |
|
| |
| ## job execution settings
| |
| #
| |
|
| |
| # requeue jobs on node failure, unless users ask otherwise
| |
| JobRequeue=1
| |
|
| |
| # max number of jobs in a job array
| |
| MaxArraySize=1000
| |
|
| |
| # max number of jobs pending + running
| |
| MaxJobCount=10000
| |
|
| |
|
| |
| MpiDefault=none
| |
| # Note: Apparently, the `MpiParams` option is needed also for non-mpi
| |
| # jobs in slurm 2.5.3.
| |
| MpiParams=ports=12000-12999
| |
|
| |
| # track resource usage via Linux /proc tree
| |
| ProctrackType=proctrack/linuxproc
| |
| #ProctrackType=proctrack/cgroup
| |
|
| |
| # do not propagate `ulimit` restrictions found on login nodes
| |
| PropagateResourceLimits=NONE
| |
|
| |
| # automatically return nodes to service, unless they have been marked DOWN by admins
| |
| ReturnToService=1
| |
|
| |
|
| |
| TaskPlugin=task/none
| |
| #TaskPlugin=task/cgroup
| |
| #TaskEpilog=/etc/slurm/task_epilog
| |
| #TaskProlog=/etc/slurm/task_prolog
| |
|
| |
| TmpFs=/tmp
| |
|
| |
| # limit virtual mem usage to 101% of real mem usage
| |
| VSizeFactor=101
| |
|
| |
|
| |
| # misc timeout settings (commented lines show the default)
| |
| #
| |
| BatchStartTimeout=60
| |
| CompleteWait=35
| |
| #EpilogMsgTime=2000
| |
| #HealthCheckInterval=0
| |
| #HealthCheckProgram=
| |
| InactiveLimit=0
| |
| KillWait=30
| |
| #MessageTimeout=10
| |
| #ResvOverRun=0
| |
| MinJobAge=300
| |
| #OverTimeLimit=0
| |
| #UnkillableStepTimeout=60
| |
| #VSizeFactor=0
| |
| Waittime=0
| |
|
| |
|
| |
| ## `slurmctld` settings (controller nodes)
| |
| #
| |
| ControlMachine='''master'''
| |
| ControlAddr='''192.168.2.5'''
| |
|
| |
| SlurmctldPidFile=/var/run/slurmctld.pid
| |
| SlurmctldPort=6817
| |
| SlurmctldTimeout=300
| |
|
| |
| StateSaveLocation=/var/spool/slurm
| |
|
| |
| SlurmctldDebug=error
| |
| SlurmctldLogFile=/var/log/slurm/slurmctld.log
| |
| DebugFlags=backfill,cpu_bind,priority,reservation,selecttype,steps
| |
|
| |
| MailProg=/usr/bin/mail
| |
|
| |
|
| |
| ## `slurmd` settings (compute nodes)
| |
| #
| |
| SlurmdPort=6818
| |
| SlurmdPidFile=/var/run/slurmd.pid
| |
| SlurmdSpoolDir=/var/lib/slurm/slurmd
| |
| SlurmdTimeout=300
| |
|
| |
| SlurmdDebug=error
| |
| SlurmdLogFile=/var/log/slurm/slurmd.log
| |
|
| |
| AuthType=auth/munge
| |
| CryptoType=crypto/munge
| |
|
| |
| DisableRootJobs=NO
| |
|
| |
|
| |
| ## Cluster nodes
| |
| #
| |
| NodeName=node1 CPUs=32 Boards=1 SocketsPerBoard=1 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=126000 State=UNKNOWN
| |
| NodeName=node2 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=126000 State=UNKNOWN
| |
|
| |
|
| |
| ## cpu partitions
| |
| #
| |
| PartitionName=cpu Nodes=node1,node2 MaxTime=INFINITE State=UP Default=YES
| |
|
| |
| ## Gpu Partition
| |
| PartitionName=gpu Nodes=node1,node2 Default=Yes MaxTime=INFINITE State=UP
| |
| </pre>
| |
|
| |
| Please refer to the below link for more details: https://slurm.schedmd.com/slurm.conf.html
| |
|
| |
| [[Main Page|Home]] > [[Ubuntu]] > [[Ubuntu HPC setup with slurm and linux containers]] > [[Install slurmctld on master node]]
| |