###### tags: `env` {%hackmd Eg_dogKwTYGxkdzZ3O5Gkg %} # Slurm Installation ## slurm 23.02.2 ### step1: download & install Extremely easy, go to [NI SP](https://www.ni-sp.com/slurm-build-script-and-container-commercial-support/), download the [script](http://www.ni-sp.com/wp-content/uploads/2019/10/SLURM_Ubuntu_installation.sh), and just run it. - For the 23.02.2 version, you should - export VER=23.02.2, or - add VER=23.02.2 at the top of the script --- ### step2: modifiy setting After it finished, please modified the config file `/etc/slurm/slurm.conf`. The following is an example: ```rust SlurmctldHost=srv109(192.168.0.109) MpiDefault=none AuthType=auth/munge ProctrackType=proctrack/cgroup ReturnToService=2 SlurmctldPidFile=/run/slurmctld.pid SlurmdPidFile=/run/slurmd.pid SlurmdSpoolDir=/var/spool/slurm/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurm/ SwitchType=switch/none TaskPlugin=task/affinity SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core AccountingStorageType=accounting_storage/none ClusterName=compute JobAcctGatherType=jobacct_gather/none SlurmctldLogFile=/var/log/slurmctld.log SlurmdLogFile=/var/log/slurmd.log NodeName=srv109 NodeAddr=192.168.0.109 State=UNKNOWN CPUs=24 Boards=1 SocketsPerBoard=1 CoresPerSocket=24 ThreadsPerCore=1 RealMemory=64081 NodeName=srv102 NodeAddr=192.168.0.102 State=UNKNOWN CPUs=12 Boards=1 SocketsPerBoard=1 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=31385 PartitionName=compute Nodes=srv109,srv102 Default=YES MaxTime=INFINITE State=UP ``` - Note that when the hardware setting not CPUs Socket Boards=1:1(hw) SocketsPerBoard Note that if you do not have specific hardware specifications, such as the number of CPUs, Socket, SocketsPerBoard, ...etc, the default setting will only utilize one CPU. - You can use the following command `slurmd -C` to view the hardware specifications automatically detected by slurmd. That's take srv109 as an example: ![](https://hackmd.io/_uploads/rkPP0vaI3.png) - If a node is designed with heterogeneous cores, taking srv109 as an example, it has an i7-13700 CPU with 8 P-cores and 8 E-cores. Each P-core corresponds to 2 threads, while each E-core corresponds to 1 thread. In this case, the equation socket * core * thread does not equal the total number of CPUs. Currently, there are two approaches: - Not using E-cores: In this approach, the E-cores are not utilized, and only the P-cores are considered as the available processing units. So, in this case, the number of CPUs would be equal to the number of P-cores, which is 8. - Treating all threads as individual cores: In this approach, all threads, including both P-cores and E-cores, are treated as separate cores. Therefore, the total number of CPUs would be equal to the number of threads, which is calculated as follows: P-cores (8) * threads per P-core (2) + E-cores (8) * threads per E-core (1) = 24 cores. Please note that the term "CPU" can be used to refer to either physical cores or logical threads depending on the context. --- ### step3: auth/munge You need to have a key that can be used by all nodes, both slurmctld and slurmd. - If you follow the previous steps for installation, choose one key from all the nodes ('`/etc/munge/munge.key`') - or you can generate a now one. ```bash sudo rm /etc/munge/munge.key sudo /usr/sbin/mungekey ``` It will generate a key `/etc/munge/munge.key`. - copy this key to the folder `/etc/munge/` of all nodes, and change the ownership and the permissions of the key ```bash sudo chmod 400 /etc/munge/munge.key sudo chown munge: /etc/munge/munge.key ``` ### step4: restart munge & slurm services You should restart this two service for all nodes, and munge should restart before slurm ```bash systemctl restart munge systemctl restart slurmctld systemctl restart slurmd ``` ## Prescription - no guarantee of effectiveness - STATE=drain - wake it up by ```bash sudo scontrol update nodename=YOUR_NODE_NAME state=resume ```