# Slurm Installation and Configuration - You can use vim/nano or whatever and paste them. - this page refer from [HPC Docs - Building and Installing Slurm](https://www.notion.so/sdc-nycu/Building-and-Installing-Slurm-1e27dadd80408063b5bff22abe230159) ## Scripts ### init installation - command0 ```bash! nano initInstall.sh ``` - script - just paste them. ```bash! #!/bin/bash sudo apt install -y lbzip2 build-essential fakeroot devscripts equivs wget nano vim openssh-server cd /opt wget https://download.schedmd.com/slurm/slurm-25.11.2.tar.bz2 tar -xaf slurm*tar.bz2 cd /opt/slurm-25.11.2 sudo mk-build-deps -i debian/control sudo debuild -b -uc -us dpkg-buildpackage -T clean sudo apt-get install --reinstall libmunge-dev libmunge2 ``` - command1 ```bash! chmod +x initInstall.sh ./initInstall.sh ``` ## Nodes installation | RPM name | DEB name | Login | Controller | Compute | DBD | |-----------------|-----------------------|-------|------------|---------|-----| | slurm | slurm-smd | X | X | X | X | | slurm-perlapi | slurm-smd-client | X | X | X | | | slurm-slurmctld | slurm-smd-slurmctld | | X | | | | slurm-slurmd | slurm-smd-slurmd | | | X | | | slurm-slurmdbd | slurm-smd-slurmdbd | | | | X | ### Controller Node Scripts #### commands ```bash! # just input what you need within the nano, refer below, check what platform you are using. nano ControllerNode.sh chmod +x ControllerNode.sh ./ControllerNode.sh ``` #### Diff platform - amd(x86) ```bash! #!/bin/bash cd /opt sudo apt install ./slurm-smd_25.11.2-1_amd64.deb sudo apt install ./slurm-smd-client_25.11.2-1_amd64.deb sudo apt install ./slurm-smd-slurmctld_25.11.2-1_amd64.deb ``` - arm ```bash! #!/bin/bash cd /opt sudo apt install ./slurm-smd_25.11.2-1_arm64.deb sudo apt install ./slurm-smd-client_25.11.2-1_arm64.deb sudo apt install ./slurm-smd-slurmctld_25.11.2-1_arm64.deb ``` ### Login Node Scripts If you are building a small/self-used cluster, you can just use controller node as a login node. #### commands ```bash! # just input what you need within the nano, refer below, check what platform you are using. nano LoginNode.sh chmod +x LoginNode.sh ./LoginNode.sh ``` #### Diff platform - amd(x86) ```bash! #!/bin/bash cd /opt sudo apt install ./slurm-smd_25.11.2-1_amd64.deb sudo apt install ./slurm-smd-client_25.11.2-1_amd64.deb ``` - arm ```bash! #!/bin/bash cd /opt sudo apt install ./slurm-smd_25.11.2-1_arm64.deb sudo apt install ./slurm-smd-client_25.11.2-1_arm64.deb ``` ### Compute Node Scripts #### commands ```bash! # just input what you need within the nano, refer below, check what platform you are using. nano ComputeNode.sh chmod +x ComputeNode.sh ./ComputeNode.sh ``` #### Diff platform - amd(x86) ```bash! #!/bin/bash cd /opt sudo apt install ./slurm-smd_25.11.2-1_amd64.deb sudo apt install ./slurm-smd-client_25.11.2-1_amd64.deb sudo apt install ./slurm-smd-slurmd_25.11.2-1_amd64.deb ``` - arm ```bash! #!/bin/bash cd /opt sudo apt install ./slurm-smd_25.11.2-1_arm64.deb sudo apt install ./slurm-smd-client_25.11.2-1_arm64.deb sudo apt install ./slurm-smd-slurmd_25.11.2-1_arm64.deb ``` --- ## Slurm Configuration - need to do on every computer ### Time Sync ```bash! sudo apt update sudo apt install -y chrony sudo systemctl enable --now chrony timedatectl chronyc sources -v ``` ### User/Group Sync - simple: Use linux itselves ```bash! export MUNGEUSER=1005 sudo groupadd -g $MUNGEUSER munge sudo useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge export SlurmUSER=1001 sudo groupadd -g $SlurmUSER slurm sudo useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u $SlurmUSER -g slurm -s /bin/bash slurm ``` - advance - LDAP! ### ONLY DO ON CONTROLL NODE!!! - generate key! ```bash! sudo dd if=/dev/urandom bs=1 count=1024 of=/etc/munge/munge.key sudo chown munge:munge /etc/munge/munge.key sudo chmod 0400 /etc/munge/munge.key ``` - Send to compute nodes!!! - [Chinese Tutorial](https://blog.gtwang.org/linux/linux-scp-command-tutorial-examples/) or you can just type "man scp" to learn it! - if your user is not in sudoers, somehow get to root and type below ```bash! usermod -aG sudo compute ``` ```bash! # do it on control node sudo cp /etc/munge/munge.key ~/munge.key sudo chown ryansuc:ryansuc ~/munge.key chmod 600 ~/munge.key scp ~/munge.key c0:~/munge.key scp ~/munge.key c1:~/munge.key # do it on three node ls -l ~/munge.key md5sum ~/munge.key ``` ```bash! # do it on compute nodes sudo mv ~/munge.key /etc/munge/munge.key sudo chown munge:munge /etc/munge/munge.key sudo chmod 0400 /etc/munge/munge.key ``` ```bash! ## start munge sudo systemctl enable munge sudo systemctl start munge ## check its status sudo systemctl status munge ``` ### slurm conf tools - [Full](https://slurm.schedmd.com/configurator.html) - [Simple](https://slurm.schedmd.com/configurator.easy.html) - command ```bash sudo vim /etc/slurm/slurm.conf ``` - config (example, and need to be modified by each node), ex. hostname(at least) ```bash= # slurm.conf file generated by configurator.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more information. # ClusterName=miniTest SlurmctldHost=ubuntu-master #SlurmctldHost= # #DisableRootJobs=NO #EnforcePartLimits=NO #Epilog= #EpilogSlurmctld= #FirstJobId=1 #MaxJobId=67043328 #GresTypes= #GroupUpdateForce=0 #GroupUpdateTime=600 #JobFileAppend=0 #JobRequeue=1 #JobSubmitPlugins=lua #KillOnBadExit=0 #LaunchType=launch/slurm #Licenses=foo*4,bar #MailProg=/bin/mail #MaxJobCount=10000 #MaxStepCount=40000 #MaxTasksPerNode=512 #MpiDefault= #MpiParams=ports=#-# #PluginDir= #PlugStackConfig= #PrivateData=jobs ProctrackType=proctrack/cgroup #Prolog= #PrologFlags= #PrologSlurmctld= #PropagatePrioProcess=0 #PropagateResourceLimits= #PropagateResourceLimitsExcept= #RebootProgram= ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm #SlurmdUser=root #SrunEpilog= #SrunProlog= StateSaveLocation=/var/spool/slurmctld #SwitchType= #TaskEpilog= TaskPlugin=task/affinity,task/cgroup #TaskProlog= #TopologyPlugin=topology/tree #TmpFS=/tmp #TrackWCKey=no #TreeWidth= #UnkillableStepProgram= #UsePAM=0 # # # TIMERS #BatchStartTimeout=10 #CompleteWait=0 #EpilogMsgTime=2000 #GetEnvTimeout=2 #HealthCheckInterval=0 #HealthCheckProgram= InactiveLimit=0 KillWait=30 #MessageTimeout=10 #ResvOverRun=0 MinJobAge=300 #OverTimeLimit=0 SlurmctldTimeout=120 SlurmdTimeout=300 #UnkillableStepTimeout=60 #VSizeFactor=0 Waittime=0 # # # SCHEDULING #DefMemPerCPU=0 #MaxMemPerCPU=0 #SchedulerTimeSlice=30 SchedulerType=sched/backfill SelectType=select/cons_tres # # # JOB PRIORITY #PriorityFlags= #PriorityType=priority/multifactor #PriorityDecayHalfLife= #PriorityCalcPeriod= #PriorityFavorSmall= #PriorityMaxAge= #PriorityUsageResetPeriod= #PriorityWeightAge= #PriorityWeightFairshare= #PriorityWeightJobSize= #PriorityWeightPartition= #PriorityWeightQOS= # # # LOGGING AND ACCOUNTING #AccountingStorageEnforce=0 #AccountingStorageHost= #AccountingStoragePort= #AccountingStorageType= #AccountingStoreFlags= #JobCompHost= #JobCompLoc= #JobCompParams= #JobCompPass= #JobCompPort= JobCompType=jobcomp/none #JobCompUser= #JobContainerType= JobAcctGatherFrequency=30 #JobAcctGatherType= SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurmd.log #SlurmSchedLogFile= #SlurmSchedLogLevel= #DebugFlags= # # # POWER SAVE SUPPORT FOR IDLE NODES (optional) #SuspendProgram= #ResumeProgram= #SuspendTimeout= #ResumeTimeout= #ResumeRate= #SuspendExcNodes= #SuspendExcParts= #SuspendRate= #SuspendTime= # # # COMPUTE NODES NodeName=ubuntu CPUs=1 RealMemory=1024 State=UNKNOWN NodeName=ubuntu-1 CPUs=1 RealMemory=1024 State=UNKNOWN PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP ``` - control ```bash! # log sudo mkdir -p /var/log/slurm sudo chown slurm:slurm /var/log/slurm sudo chmod 755 /var/log/slurm # state / spool sudo mkdir -p /var/spool/slurmctld sudo chown slurm:slurm /var/spool/slurmctld sudo chmod 755 /var/spool/slurmctld sudo systemctl start slurmctld sudo systemctl status slurmctld ``` - compute ```bash! sudo mkdir -p /var/spool/slurmd sudo mkdir -p /var/log/slurm sudo chown -R slurm:slurm /var/spool/slurmd /var/log/slurm sudo chmod 755 /var/spool/slurmd /var/log/slurm sudo systemctl start slurmd sudo systemctl enable slurmd sudo systemctl status slurmd ``` ### with GPU? - check if driver alive? if not -- fix it first! ![image](https://hackmd.io/_uploads/r1l55qPPIWe.png) - create gres.conf ```bash sudo vim /etc/slurm/gres.conf ``` ```bash NodeName=aaslab4070-System-Product-Name Name=gpu File=/dev/nvidia0 ``` ### success!! ![image](https://hackmd.io/_uploads/SkxhmpjSbx.png) - mutli-task? - check with squeue!! ![image](https://hackmd.io/_uploads/SyW_tvPL-e.png) --- ## Appendix ### OrbStack stuff - why need this? -- since although it is called vm, but it actually using docker and add some other stuff to make it works like a virtual machine --> it doesn't have user/password. - So when we trying to ssh from controll node to compute node, you wouldn't have any user and passwd. - create user and give them passwd ```bash! useradd -m compute passwd compute usermod -aG sudo compute ``` ```bash! useradd -m controll passwd controll usermod -aG sudo controll ``` - add ssh config in controller ```bash! Host c0 HostName 192.168.139.77 # use your own, with ip a to check User compute Host c1 HostName 192.168.139.38 # use your own, with ip a to check User compute ``` - command: since it is docker based, so it wouldn't be init with bash! ```bash! bash ``` ### vimrc - vim config for you to use ```bash syntax enable syntax on set ts=4 set expandtab set shiftwidth=4 set autoindent set number set relativenumber ``` --- ## Troubleshoot ### all active but nothing happen? - Question: - when you type below and show everything is active ```bash # on control node sudo systemctl status slurmctld # on compute node sudo systemctl status slurmd ``` - however it shows ![image](https://hackmd.io/_uploads/ryenesjH-g.png) - Answer1 - you might need to add host in control node! ```bash sudo vim /etc/hosts ``` ```bash 192.168.139.233 ubuntu-master 192.168.139.77 ubuntu 192.168.139.38 ubuntu-1 ``` - Answer2 - you might need to check whether your munge.key is the same? - ![image](https://hackmd.io/_uploads/SJ7ZUnoBWx.png) - with command ```bash sudo ls -l /etc/munge/munge.key sudo md5sum /etc/munge/munge.key ``` ### munge permission deny? - ![image](https://hackmd.io/_uploads/r1dTZpirbx.png) - check the situation - if /var, /log do not have ..---x, then you need to give. ```bash namei -l /var/log/munge/munged.log ``` - let everyone can pass through the /var/log ```bash sudo chmod 711 /var/log ``` - make munge group be able to write ```bash sudo mkdir -p /var/log/munge sudo chown munge:munge /var/log/munge sudo chmod 2770 /var/log/munge ``` - setup the chmod of log ```bash sudo touch /var/log/munge/munged.log sudo chown munge:munge /var/log/munge/munged.log sudo chmod 0640 /var/log/munge/munged.log ``` - restart munge ```bash sudo systemctl restart munge sudo systemctl status munge --no-pager ``` - and be able to write to the munge log ### everything is correct, but still act weird? ```bash sudo scontrol update NodeName=<node_name> State=RESUME ```