# Slurm Installation and Configuration
- You can use vim/nano or whatever and paste them.
- this page refer from [HPC Docs - Building and Installing Slurm](https://www.notion.so/sdc-nycu/Building-and-Installing-Slurm-1e27dadd80408063b5bff22abe230159)
## Scripts
### init installation
- command0
```bash!
nano initInstall.sh
```
- script
- just paste them.
```bash!
#!/bin/bash
sudo apt install -y lbzip2 build-essential fakeroot devscripts equivs wget nano vim openssh-server
cd /opt
wget https://download.schedmd.com/slurm/slurm-25.11.2.tar.bz2
tar -xaf slurm*tar.bz2
cd /opt/slurm-25.11.2
sudo mk-build-deps -i debian/control
sudo debuild -b -uc -us
dpkg-buildpackage -T clean
sudo apt-get install --reinstall libmunge-dev libmunge2
```
- command1
```bash!
chmod +x initInstall.sh
./initInstall.sh
```
## Nodes installation
| RPM name | DEB name | Login | Controller | Compute | DBD |
|-----------------|-----------------------|-------|------------|---------|-----|
| slurm | slurm-smd | X | X | X | X |
| slurm-perlapi | slurm-smd-client | X | X | X | |
| slurm-slurmctld | slurm-smd-slurmctld | | X | | |
| slurm-slurmd | slurm-smd-slurmd | | | X | |
| slurm-slurmdbd | slurm-smd-slurmdbd | | | | X |
### Controller Node Scripts
#### commands
```bash!
# just input what you need within the nano, refer below, check what platform you are using.
nano ControllerNode.sh
chmod +x ControllerNode.sh
./ControllerNode.sh
```
#### Diff platform
- amd(x86)
```bash!
#!/bin/bash
cd /opt
sudo apt install ./slurm-smd_25.11.2-1_amd64.deb
sudo apt install ./slurm-smd-client_25.11.2-1_amd64.deb
sudo apt install ./slurm-smd-slurmctld_25.11.2-1_amd64.deb
```
- arm
```bash!
#!/bin/bash
cd /opt
sudo apt install ./slurm-smd_25.11.2-1_arm64.deb
sudo apt install ./slurm-smd-client_25.11.2-1_arm64.deb
sudo apt install ./slurm-smd-slurmctld_25.11.2-1_arm64.deb
```
### Login Node Scripts
If you are building a small/self-used cluster, you can just use controller node as a login node.
#### commands
```bash!
# just input what you need within the nano, refer below, check what platform you are using.
nano LoginNode.sh
chmod +x LoginNode.sh
./LoginNode.sh
```
#### Diff platform
- amd(x86)
```bash!
#!/bin/bash
cd /opt
sudo apt install ./slurm-smd_25.11.2-1_amd64.deb
sudo apt install ./slurm-smd-client_25.11.2-1_amd64.deb
```
- arm
```bash!
#!/bin/bash
cd /opt
sudo apt install ./slurm-smd_25.11.2-1_arm64.deb
sudo apt install ./slurm-smd-client_25.11.2-1_arm64.deb
```
### Compute Node Scripts
#### commands
```bash!
# just input what you need within the nano, refer below, check what platform you are using.
nano ComputeNode.sh
chmod +x ComputeNode.sh
./ComputeNode.sh
```
#### Diff platform
- amd(x86)
```bash!
#!/bin/bash
cd /opt
sudo apt install ./slurm-smd_25.11.2-1_amd64.deb
sudo apt install ./slurm-smd-client_25.11.2-1_amd64.deb
sudo apt install ./slurm-smd-slurmd_25.11.2-1_amd64.deb
```
- arm
```bash!
#!/bin/bash
cd /opt
sudo apt install ./slurm-smd_25.11.2-1_arm64.deb
sudo apt install ./slurm-smd-client_25.11.2-1_arm64.deb
sudo apt install ./slurm-smd-slurmd_25.11.2-1_arm64.deb
```
---
## Slurm Configuration
- need to do on every computer
### Time Sync
```bash!
sudo apt update
sudo apt install -y chrony
sudo systemctl enable --now chrony
timedatectl
chronyc sources -v
```
### User/Group Sync
- simple: Use linux itselves
```bash!
export MUNGEUSER=1005
sudo groupadd -g $MUNGEUSER munge
sudo useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
export SlurmUSER=1001
sudo groupadd -g $SlurmUSER slurm
sudo useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u $SlurmUSER -g slurm -s /bin/bash slurm
```
- advance
- LDAP!
### ONLY DO ON CONTROLL NODE!!!
- generate key!
```bash!
sudo dd if=/dev/urandom bs=1 count=1024 of=/etc/munge/munge.key
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 0400 /etc/munge/munge.key
```
- Send to compute nodes!!!
- [Chinese Tutorial](https://blog.gtwang.org/linux/linux-scp-command-tutorial-examples/) or you can just type "man scp" to learn it!
- if your user is not in sudoers, somehow get to root and type below
```bash!
usermod -aG sudo compute
```
```bash!
# do it on control node
sudo cp /etc/munge/munge.key ~/munge.key
sudo chown ryansuc:ryansuc ~/munge.key
chmod 600 ~/munge.key
scp ~/munge.key c0:~/munge.key
scp ~/munge.key c1:~/munge.key
# do it on three node
ls -l ~/munge.key
md5sum ~/munge.key
```
```bash!
# do it on compute nodes
sudo mv ~/munge.key /etc/munge/munge.key
sudo chown munge:munge /etc/munge/munge.key
sudo chmod 0400 /etc/munge/munge.key
```
```bash!
## start munge
sudo systemctl enable munge
sudo systemctl start munge
## check its status
sudo systemctl status munge
```
### slurm conf tools
- [Full](https://slurm.schedmd.com/configurator.html)
- [Simple](https://slurm.schedmd.com/configurator.easy.html)
- command
```bash
sudo vim /etc/slurm/slurm.conf
```
- config (example, and need to be modified by each node), ex. hostname(at least)
```bash=
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=miniTest
SlurmctldHost=ubuntu-master
#SlurmctldHost=
#
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=512
#MpiDefault=
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologFlags=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmctld
#SwitchType=
#TaskEpilog=
TaskPlugin=task/affinity,task/cgroup
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# JOB PRIORITY
#PriorityFlags=
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStoragePort=
#AccountingStorageType=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompParams=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=
JobAcctGatherFrequency=30
#JobAcctGatherType=
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=ubuntu CPUs=1 RealMemory=1024 State=UNKNOWN
NodeName=ubuntu-1 CPUs=1 RealMemory=1024 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
```
- control
```bash!
# log
sudo mkdir -p /var/log/slurm
sudo chown slurm:slurm /var/log/slurm
sudo chmod 755 /var/log/slurm
# state / spool
sudo mkdir -p /var/spool/slurmctld
sudo chown slurm:slurm /var/spool/slurmctld
sudo chmod 755 /var/spool/slurmctld
sudo systemctl start slurmctld
sudo systemctl status slurmctld
```
- compute
```bash!
sudo mkdir -p /var/spool/slurmd
sudo mkdir -p /var/log/slurm
sudo chown -R slurm:slurm /var/spool/slurmd /var/log/slurm
sudo chmod 755 /var/spool/slurmd /var/log/slurm
sudo systemctl start slurmd
sudo systemctl enable slurmd
sudo systemctl status slurmd
```
### with GPU?
- check if driver alive? if not -- fix it first!

- create gres.conf
```bash
sudo vim /etc/slurm/gres.conf
```
```bash
NodeName=aaslab4070-System-Product-Name Name=gpu File=/dev/nvidia0
```
### success!!

- mutli-task?
- check with squeue!!

---
## Appendix
### OrbStack stuff
- why need this? -- since although it is called vm, but it actually using docker and add some other stuff to make it works like a virtual machine --> it doesn't have user/password.
- So when we trying to ssh from controll node to compute node, you wouldn't have any user and passwd.
- create user and give them passwd
```bash!
useradd -m compute
passwd compute
usermod -aG sudo compute
```
```bash!
useradd -m controll
passwd controll
usermod -aG sudo controll
```
- add ssh config in controller
```bash!
Host c0
HostName 192.168.139.77 # use your own, with ip a to check
User compute
Host c1
HostName 192.168.139.38 # use your own, with ip a to check
User compute
```
- command: since it is docker based, so it wouldn't be init with bash!
```bash!
bash
```
### vimrc
- vim config for you to use
```bash
syntax enable
syntax on
set ts=4
set expandtab
set shiftwidth=4
set autoindent
set number
set relativenumber
```
---
## Troubleshoot
### all active but nothing happen?
- Question:
- when you type below and show everything is active
```bash
# on control node
sudo systemctl status slurmctld
# on compute node
sudo systemctl status slurmd
```
- however it shows

- Answer1
- you might need to add host in control node!
```bash
sudo vim /etc/hosts
```
```bash
192.168.139.233 ubuntu-master
192.168.139.77 ubuntu
192.168.139.38 ubuntu-1
```
- Answer2
- you might need to check whether your munge.key is the same?
- 
- with command
```bash
sudo ls -l /etc/munge/munge.key
sudo md5sum /etc/munge/munge.key
```
### munge permission deny?
- 
- check the situation
- if /var, /log do not have ..---x, then you need to give.
```bash
namei -l /var/log/munge/munged.log
```
- let everyone can pass through the /var/log
```bash
sudo chmod 711 /var/log
```
- make munge group be able to write
```bash
sudo mkdir -p /var/log/munge
sudo chown munge:munge /var/log/munge
sudo chmod 2770 /var/log/munge
```
- setup the chmod of log
```bash
sudo touch /var/log/munge/munged.log
sudo chown munge:munge /var/log/munge/munged.log
sudo chmod 0640 /var/log/munge/munged.log
```
- restart munge
```bash
sudo systemctl restart munge
sudo systemctl status munge --no-pager
```
- and be able to write to the munge log
### everything is correct, but still act weird?
```bash
sudo scontrol update NodeName=<node_name> State=RESUME
```