Delft RaspberryPi HPC workshop notes

# Delft RaspberryPi HPC workshop notes # Workshop 1: Building a RaspberryPi cluster :::info Second part of this workshop can be found here: [Workshop 2: Parallel Computing with a RaspberryPi cluster](https://hackmd.io/lPtoZMKjSJ-oGiGCzObtLQ) ::: ![](https://i.imgur.com/E2cv6iB.jpg) **Notes prepared by:** Dennis Palagin, Jose Urra Llanusa, Jerry de Vos, Maurits Kok, Santosh Ilamparuthi, Andjelika Tomic, Manuel Garcia Alvarez, Arco Van Geest **Event description: https://www.eventbrite.co.uk/e/delft-open-hardware-hack-and-play-working-with-a-raspberrypi-cluster-tickets-169068136347** <iframe src="https://docs.google.com/presentation/d/e/2PACX-1vQE2eRRAoXbhbQKGnTRw6Ekc-dBVktXxwKVUCxZqW4Pg7IC2S9x3yAtbgi_9U2FSA/embed?start=false&loop=false&delayms=3000" frameborder="0" width="960" height="749" allowfullscreen="true" mozallowfullscreen="true" webkitallowfullscreen="true"></iframe> **What we will do:** Building a Raspberry Pi cluster with one Pi 3 controller node and two Pi Zero compute nodes. ::: info **By the end of this workshop you will** - Have setup a slurm cluster with two nodes ready to receive computational jobs requests and run them. **Learning Objectives** - Build a cluster computer using raspberry pi’s to introduce architectural aspects of clusters and supercomputers. - Get exposed to Linux operations to work remotely with computers via ssh and practice with the commandline - Get familiar with the general workflow for performing parallel computations on a cluster computer - Learn how to manage processes and allocate resources (memory, CPUs etc.) **Activities and tasks people should go through:** - Use the command line - Remotely access and perform tasks using ssh - Submit jobs to slurm - Transfer files from your working environment to a super computer - What do I do to check the queue, commands to submit to the queue - Submission scripts - Once you submitted the job, can I see which node is on, can I access the output files - How do I get the files, and visualize the results ::: We are generally following these tutorials: [The Missing ClusterHat Tutorial Part 1 by Davin L.](https://medium.com/@dhuck/the-missing-clusterhat-tutorial-45ad2241d738) [Building a Raspberry Pi cluster tutorial Part 2 by Garrett Mills](https://glmdev.medium.com/building-a-raspberry-pi-cluster-aaa8d1f3d2ca) [Building a Raspberry Pi cluster tutorial Part 3 by Garrett Mills](https://glmdev.medium.com/building-a-raspberry-pi-cluster-f5f2446702e8) ClusterHAT software and documentation: [ClusterCTRL.com](https://clusterctrl.com/setup-software) Slurm documentation: [docs.massopen.cloud](https://docs.massopen.cloud/en/latest/hpc/Slurm.html) Further reading for the curious: [HPC intro carpentry](https://carpentries-incubator.github.io/hpc-intro/) [HPC basics presentation](https://docs.google.com/presentation/d/10A0_0eNRBYd87E1h1YN6bsIFaZaua5qJkfBbnBKAr6o/present?slide=id.p) # Cluster setup ## Installation images 1. [Download the OS image for the controller node](https://dist1.8086.net/clusterctrl/buster/2020-12-02/2020-12-02-1-ClusterCTRL-armhf-full-CNAT.zip). 2. [Download image for node P1](https://dist.8086.net/clusterctrl/buster/2020-12-02/2020-12-02-1-ClusterCTRL-armhf-lite-p1.zip) 3. [Download image for node P2](https://dist.8086.net/clusterctrl/buster/2020-12-02/2020-12-02-1-ClusterCTRL-armhf-lite-p2.zip) --> More details: https://clusterctrl.com/setup-software 2. Install [**Etcher**](https://www.balena.io/etcher/) and flash the above images to empty microSD cards. The image for the controller needs to be flashed on a 32Gb microSD card, the node images could be flashed on 16Gb microSD cards. This step requires admin rights on your system. ## Working headless with the controller with SSH :::info Enabling ssh allows to access remotely a computer from another computer. ::: 3. Enable SSH on the flashed images for the **controller** and the **pi-zeroes**. For this, create an empty file named “ssh” in the “boot” folder (root folder of the microSD card). Each node in the cluster should have ssh enabled in order for the cluster to operate. 4. Assemble the Pi cluster with the microSD card in the controller node (Pi 3), connect the cluster to your local network with an Ethernet cable. Power on the cluster. More details: https://clusterctrl.com/setup-assembly 5. Find Raspberry Pi’s IP address. For this, install [Advanced IP Scanner](https://www.advanced-ip-scanner.com/) and scan your network. The device with a manufacturer “Raspberry Pi Foundation” is your Raspberry Pi: ![](https://i.imgur.com/uEZNBhI.png) 6. Now you can ssh to your Raspberry Pi, for example with [Bitvise SSH Client](https://www.bitvise.com/ssh-client-download). In my case, I use the Host “192.168.178.24”, as identified in the previous step. Use port “22”, username: “pi”, and password: “clusterctrl”. #### 7. Change password and set correct Locale on the master Pi, [read more on the tutorial]( https://medium.com/@dhuck/the-missing-clusterhat-tutorial-45ad2241d738) ## If you only have one SD card :::info Enabling USB boot for the nodes. If you have an SD card in each node, ignore these steps. ::: #### 8. Download image to be used for Pi Zero nodes [More details here](https://8086.support/index.php?action=faq&cat=23&id=97&artlang=en) and [here]( https://dist1.8086.net/clusterctrl/usbboot/buster/2020-12-02/2020-12-02-1-ClusterCTRL-armhf-lite-usbboot.tar.xz) #### 9. Copy the tar.xz file to the Pi, for example to /home/pi/Documents #### 10. Extract and copy the contents of the tar.xz file to the folders containing the file systems of your respective Pi Zero nodes /var/lib/clusterctrl/nfs/p1/, /var/lib/clusterctrl/nfs/p2/, /var/lib/clusterctrl/nfs/p3/, and /var/lib/clusterctrl/nfs/p4/, i.e.: ```: sudo tar -axf 2020-12-02-1-ClusterCTRL-armhf-lite-usbboot.tar.xz -C /var/lib/clusterctrl/nfs/p1/ sudo tar -axf 2020-12-02-1-ClusterCTRL-armhf-lite-usbboot.tar.xz -C /var/lib/clusterctrl/nfs/p2/ sudo tar -axf 2020-12-02-1-ClusterCTRL-armhf-lite-usbboot.tar.xz -C /var/lib/clusterctrl/nfs/p3/ sudo tar -axf 2020-12-02-1-ClusterCTRL-armhf-lite-usbboot.tar.xz -C /var/lib/clusterctrl/nfs/p4/ ``` #### 11. Configure the USB boot for all Pi Zero nodes: ``` sudo usbboot-init 1 sudo usbboot-init 2 sudo usbboot-init 3 sudo usbboot-init 4 ``` #### 12. Before powering on the Pi Zero, enable SSH by creating the "ssh" file in the "/boot" directory of each Pi Zero: ``` sudo touch /var/lib/clusterctrl/nfs/p1/boot/ssh sudo touch /var/lib/clusterctrl/nfs/p2/boot/ssh sudo touch /var/lib/clusterctrl/nfs/p3/boot/ssh sudo touch /var/lib/clusterctrl/nfs/p4/boot/ssh ``` ## Powering up the cluster and checking that SSH is working :::info Let's check the basic SSH functionality now. ::: #### 13. Power on Pi Zero nodes: ``` clusterhat on ``` Alternatively you can power on each node individually ``` clusterctrl on p1 clusterctrl on p2 clusterctrl on p3 clusterctrl on p4 ``` #### 14. Setting up SSH to Pi Zero nodes [more details here](https://xaviergeerinck.com/post/infrastructure/clusterhat-setup): By default, ClusterHat exposes its USB Hub interface over gateway 172.19.181.1 to connect to the PI Zeros we can then connect over IPs 172.19.181.1 – 172.19.181.4. To make our lives a bit easier, we set up a config files for the SSH Names in ~/.ssh/config (adjust to your number of nodes): Note 1: in order to edit text files via SSH, use the following command: ``` nano ~/.ssh/config ``` Note 2: to save the file, press "Ctrl+X", then confirm with "Y". Note 3: you might need to create the .ssh folder first. Use the following command: ``` mkdir .ssh ``` Now, to editing! ``` Host p1 HostName 172.19.181.1 User pi Host p2 HostName 172.19.181.2 User pi Host p3 HostName 172.19.181.3 User pi Host p4 HostName 172.19.181.4 User pi ``` By then adding a SSH Public Key to them we are able to more quickly login. So first create a SSH Key with: ``` ssh-keygen -t rsa -b 4096 ``` Use the default file location: /home/pi/.ssh/id_rsa [just enter]. Use "clusterctrl" as the passphrase. :::info For every ssh command you need to type your passphrase. To avoid this you can use ssh-agent. Start ssh-agent: ``` eval $( ssh-agent ) ``` Store your passphrase for 4 hours: ``` ssh-add -i ~/.ssh/id_rsa -t 4h ``` ::: **Consideration** When you create a key with ssh-keygen the access rights should be okay by default. If there are any problems with access rights, apply the following: ``` chmod 700 /home/pi/.ssh chmod 600 /home/pi/.ssh/id_rsa chmod 644 /home/pi/.ssh/id_rsa.pub ``` And copy it over to the Pi Zeros with: ``` cat ~/.ssh/id_rsa.pub | ssh pi@p1 -T "mkdir ~/.ssh && cat > ~/.ssh/authorized_keys" cat ~/.ssh/id_rsa.pub | ssh pi@p2 -T "mkdir ~/.ssh && cat > ~/.ssh/authorized_keys" cat ~/.ssh/id_rsa.pub | ssh pi@p3 -T "mkdir ~/.ssh && cat > ~/.ssh/authorized_keys" cat ~/.ssh/id_rsa.pub | ssh pi@p4 -T "mkdir ~/.ssh && cat > ~/.ssh/authorized_keys" ``` Alternatively, you can use the `ssh-copy-id` command, to avoid complex syntax of the previous example: ``` ssh-copy-id -i ~/.ssh/id_rsa.pub pi@p1 ssh-copy-id -i ~/.ssh/id_rsa.pub pi@p2 ssh-copy-id -i ~/.ssh/id_rsa.pub pi@p3 ssh-copy-id -i ~/.ssh/id_rsa.pub pi@p4 ``` After this step we can now connect to our PI Zeros without having to enter a password anymore! Just do “ssh p1”, “ssh p2”, “ssh p3”, or “ssh p4”. Congrats, the basic cluster functionality is there! ## Setting up locales, time zones, and synchronising time :::info This step is crucial for enabling automatic authentification for Munge and Slurm (see below). ::: Note: *From now on, some of the instructions will say "on a controller" and some "on the node". We recommend opening three separate instances of the terminal (one for the controller and one for each node) where you have logged into the respective nodes (ssh p1 and ssh p2):* ![](https://i.imgur.com/DXp8H5O.png) #### 15. Setup locales. Do these steps on the controller and on all nodes! ``` sudo raspi-config ``` + *navigate to "5 localisation options"* + press enter + *navigate to "L1 Locale"* + press enter + press enter + *wait for the generation of locales* + *navigate to "5 localisation options"* + press enter + *navigate to "L2 Timezone"* + press enter + *navigate to Europe* + *navigate to Amsterdam* + press enter + press TAB to *navigate to "Finish"* + press enter > DO THE ABOVE FOR ALL PI'S #### 16. Syncing the time Install the time syncing tool. To do this, run the following commands on your controller and on all the nodes as well: `sudo apt-get install -y ntpdate` ## Set up a shared drive or folder :::info Setting up a shared folder/drive enables all nodes to access the same storage. This will become crucial for executing parallel jobs with Slurm. ::: #### 17. Option 1: Set up a shared USB drive. Beware: do not reboot the Pi without unmounting the drive! [read more here](https://medium.com/@dhuck/the-missing-clusterhat-tutorial-45ad2241d738) #### 17. Option 2: I recommend setting up a shared folder. [read more here](https://epcced.github.io/wee_archlet/) Create a shared folder on the controller, and give it the correct permissions: ``` sudo mkdir /home/pi/cluster_shared_folder sudo chown nobody.nogroup -R /home/pi/cluster_shared_folder sudo chmod -R 777 /home/pi/cluster_shared_folder ``` Ensure that the NFS server is installed on your controller: ``` sudo apt-get install -y nfs-kernel-server ``` Update your `/etc/exports` file to contain the following line at the bottom: Note: in order to edit text files via SSH, use the following command: ``` sudo nano /etc/exports ``` Now, to editing! ``` /home/pi/cluster_shared_folder 172.19.181.0/24(rw,sync,no_root_squash,no_subtree_check) ``` After editing the exports file, run the following command to update the NFS server: ``` sudo exportfs -a ``` Now we can mount the same folder on every node: ``` sudo apt-get install -y nfs-common sudo mkdir /home/pi/cluster_shared_folder sudo chown nobody.nogroup /home/pi/cluster_shared_folder sudo chmod -R 777 /home/pi/cluster_shared_folder sudo mount 172.19.181.254:/home/pi/cluster_shared_folder /home/pi/cluster_shared_folder ``` This will keep the folder mounted until the next reboot. If you want it to re-mount automatically, you will need to edit your `/etc/fstab`, and add the following line: ``` 172.19.181.254:/home/pi/cluster_shared_folder /home/pi/cluster_shared_folder nfs defaults 0 0 ``` Create a test file inside the shared directory to ensure that you can see the file across all of the nodes: ``` echo "This is a test" >> /home/pi/cluster_shared_folder/test.txt ``` You should be able to see this file on all of the nodes and edit it as well. ## Install Munge authentification daemon :::info In order to turn our Pi cluster into an HPC, we need to install the scheduler called Slurm. However, first we need to install the authentification daemon Munge, to make sure that all the nodes can securely talk to each other. ::: #### 18. Install Munge [read more here](https://medium.com/@dhuck/the-missing-clusterhat-tutorial-45ad2241d738) Beware of munge synchronisation problems!!!! Before installing munge, make sure of the following things. More info in the [slurm documentation website](https://docs.massopen.cloud/en/latest/hpc/Slurm.html). **a. Very important. Check user IDs, file permissions and time synchronisation!!!!!** SLURM and MUNGE require consistent UID and GID across all servers and nodes in the cluster. Create the users/groups for slurm and munge, for example: ``` export MUNGEUSER=991 sudo groupadd -g $MUNGEUSER munge sudo useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge export SLURMUSER=992 sudo groupadd -g $SLURMUSER slurm sudo useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm ``` These same user accounts **should be created identically on all nodes**. This must be done prior to installing munge or slurm (which would create random UID/GID pairs if these users don’t exist). Check that everything is consistent with the following commands. For Munge: ``` $ id munge ``` You should see something like this, the same on all nodes: ``` uid=991(munge) gid=991(munge) groups=991(munge) ``` For Slurm: ``` $ id slurm ``` You should see something like this, the same on all nodes: ``` uid=992(slurm) gid=992(slurm) groups=992(slurm) ``` **b. Edit the Hosts File** To set up munge, we will first need to edit our `/etc/hosts` file to contain the addresses and hostnames of the other nodes in the cluster. This will make name resolution much easier and take out any guess work for the Pis. You will need to edit the `/etc/hosts` file on each of the nodes and the controller. You will want to all of the IP addresses and hostnames except for the one of the Pi you are longed into. Do this on the controller and the Pi-zeros. For example, add these lines to your controller’s `/etc/hosts` file: ``` 172.19.181.1 p1 172.19.181.2 p2 ``` On the first node (p1), add these lines: ``` 172.19.181.254 <cluster-name> 172.19.181.2 p2 ``` On the second node (p2), add these lines: ``` 172.19.181.254 <cluster-name> 172.19.181.1 p1 ``` Repeat this process for other nodes. After editing the hosts file, you will want to install and configure munge on each of the PIs. We will start with the controller. **c. Install Munge on the controller** ``` sudo apt-get install -y munge ``` **d. Create a secret key** ``` sudo /usr/sbin/create-munge-key ``` **e. Change the ownership of the munge.key and update the time** ``` sudo chown munge: /etc/munge/munge.key sudo chmod 400 /etc/munge/munge.key sudo ntpdate pool.ntp.org ``` **f. Install Munge on the nodes:** ``` sudo apt-get install -y munge ``` **g. Securely copy the munge.key file to the nodes:** After creating the munge.key securely copy `/etc/munge/munge.key` (e.g., via SSH) to all other hosts within the same security realm [read more in "using rsync with sudo destination machine"](https://askubuntu.com/questions/719439/using-rsync-with-sudo-on-the-destination-machine). Run the following command on each nodes/Pi-zeros (it copies the munge.key from the controller to your current node, from which you are running this command). In the code below, replace `<SUDOPASS>` with clusterctrl (or your new controller password, if you have changed it before) and replace pi@cnat with the name of your controller, e.g. pi@cluster4. ``` sudo rsync -avz --stats --rsync-path="echo <SUDOPASS> | sudo -Sv && sudo rsync" pi@<cluster-name>:/etc/munge/munge.key /etc/munge/munge.key ``` Alternatively, if you set a shared folder, you could copy the `munge.key` to `cluster_shared_folder`, and from there copy the key to the munge directory in each node (pi-zero), as follows: ```bash # On the controller: sudo cp /etc/munge/munge.key /home/pi/cluster_shared_folder/ # on each node (Pi-zero): sudo cp /home/pi/cluster_shared_folder/munge.key /etc/munge/ ``` Note: make sure all permissions/ownerships of the file stay the same. If the "munge -n | ssh p1 unmunge" test (see step 18.j below) gives you authentification error, re-try with "rsync" instead. **h. Make sure to set the correct ownership and update the time on all the nodes (but not on the controller):** ``` sudo chown -R munge: /etc/munge/ /var/log/munge/ sudo chmod 0700 /etc/munge/ /var/log/munge/ sudo ntpdate pool.ntp.org ``` **i. Enable and start the MUNGE service (both on the controller, and all nodes)** ``` sudo systemctl enable munge sudo systemctl start munge ``` You can check the status of munge with the following command: ``` sudo systemctl status munge ``` This should display something similar to this (check out for error messages!) ``` ● munge.service - MUNGE authentication service Loaded: loaded (/lib/systemd/system/munge.service; enabled; vendor preset: enabled) Active: active (running) since Sun 2021-12-12 00:00:23 CET; 12h ago Docs: man:munged(8) Process: 24600 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS) Main PID: 24602 (munged) Tasks: 4 (limit: 2063) CGroup: /system.slice/munge.service └─24602 /usr/sbin/munged Dec 12 00:00:23 cnat systemd[1]: Starting MUNGE authentication service... Dec 12 00:00:23 cnat systemd[1]: Started MUNGE authentication service. ``` **j. Run some tests:** ``` munge -n # Generates a decodes a Munge credential locally munge -n | unmunge # Generates and decodes the credential locally munge -n | ssh p1 unmunge # Generates a credential locally and decodes it over SSH. ```` The last test is crucial! No "Authentification errors" should be received at this stage. If you receive any errors, please repeat the munge.key generation and propagation again. ## Install Slurm scheduler :::info Now we are about to turn our Pi cluster into real HPC. ::: #### 19. Install Slurm [**Just follow this tutorial.**]( https://medium.com/@dhuck/the-missing-clusterhat-tutorial-45ad2241d738) *The relevant steps from the tutorial are below* **a. Install Slurm on Controller Pi** On the controller run the following commands to install slurm: ``` sudo apt-get install -y slurm-wlm ``` :::warning <details> <summary>Got an error?</summary> Quite possibly, this operation fails because the installer can't find the correct repositories. In this case, you have to update the repositories first: ``` sudo apt-get update ``` Pay attention to error messages! Sometimes, repositories change their "status". If this is the case, you might need to allow release info change on the repo if it complains that the repo release info has been changed from "stable" to "old-stable": ``` sudo apt-get update --allow-releaseinfo-change ``` After this, the Slurm installation should proceed as normal with the usual command: ``` sudo apt-get install -y slurm-wlm ``` </details> ::: This will take a moment. After it finishes, we will use the default slurm configuration and modify it to meet our needs. Copy the config file over from the slurm documentation folder: ``` cd /etc/slurm-llnl sudo cp /usr/share/doc/slurm-client/examples/slurm.conf.simple.gz . sudo gzip -d slurm.conf.simple.gz sudo mv slurm.conf.simple slurm.conf ``` Open the `/etc/slurm-llnl/slurm.conf` file and make the following edits: *Remember to use "sudo" and then the text editor command to open the file.* Set the control machine information: ``` SlurmctldHost=<cluster-name>(172.19.181.254) ``` *The <cluster-name> refers to the cluster name.* Ensure that the SelectType and SelectTypeParameters parameters are set to the following values: ``` SelectType=select/cons_res SelectTypeParameters=CR_Core ``` If you wish to change or set the name of your cluster, you can set it with the `ClusterName` parameter. I set mine to merely be cluster: ``` ClusterName=cluster ``` At the end of the file, there should be an entry for a compute node. Delete it, and put this in it’s place: ``` NodeName=<cluster-name> NodeAddr=172.19.181.254 CPUs=2 Weight=2 State=UNKNOWN NodeName=p1 NodeAddr=172.19.181.1 CPUs=1 Weight=1 State=UNKNOWN NodeName=p2 NodeAddr=172.19.181.2 CPUs=1 Weight=1 State=UNKNOWN ``` NOTE: I am only allocating 2 CPUs from the controller pi towards the cluster. If you wish to add all of the CPUs, you can change CPUs=2 to CPUs=4. However, it is recommended that you leave a few of the CPUs on the controller Pi to manage the cluster. You will also need to remove the default entry for PartitionName at the end of the file and replace it with our own custom Partition name. The following codeblock should be entirely on one line: ``` PartitionName=mycluster Nodes=<cluster-name>,p1,p2 Default=YES MaxTime=INFINITE State=UP ``` Save and close the `slurm.config` We will now need to tell slurm which resources it can access on the nodes. Create the following file: `/etc/slurm-llnl/cgroup.conf` and add the following lines to it: ``` CgroupMountpoint="/sys/fs/cgroup" CgroupAutomount=yes CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup" AllowedDevicesFile="/etc/slurm-llnl/cgroup_allowed_devices_file.conf" ConstrainCores=no TaskAffinity=no ConstrainRAMSpace=yes ConstrainSwapSpace=no ConstrainDevices=no AllowedRamSpace=100 AllowedSwapSpace=0 MaxRAMPercent=100 MaxSwapPercent=100 MinRAMSpace=30 ``` Next, we will need to create and write the following lines in order to whitelist system devices in `/etc/slurm-llnl/cgroup_allowed_devices_file.conf` ``` /dev/null /dev/urandom /dev/zero /dev/sda* /dev/cpu/*/* /dev/pts/* /home/pi/cluster_shared_folder* # note that the final line is the name of your NFS drive and should # edited to reflect that. ``` Copy these configuration files to the NFS drive (shared folder) that we set up earlier: ``` sudo cp slurm.conf cgroup.conf cgroup_allowed_devices_file.conf /home/pi/cluster_shared_folder ``` **b. Install Slurm on the nodes** Install the slurm client on each of the nodes: ``` sudo apt-get install -y slurmd slurm-client ``` :::warning <details> <summary>Got an error?</summary> Quite possibly, this operation fails because the installer can't find the correct repositories. In this case, you have to update the repositories first: ``` sudo apt-get update ``` Pay attention to error messages! Sometimes, repositories change their "status". If this is the case, you might need to allow release info change on the repo if it complains that the repo release info has been changed from "stable" to "old-stable": ``` sudo apt-get update --allow-releaseinfo-change ``` After this, the Slurm installation should proceed as normal with the usual command: ``` sudo apt-get install -y slurmd slurm-client ``` </details> ::: Copy the the configuration files that we made for slurm over to each of the nodes: ``` sudo cp cluster_shared_folder/slurm.conf /etc/slurm-llnl/slurm.conf sudo cp cluster_shared_folder/cgroup* /etc/slurm-llnl ``` **c. Enable and start Slurm** Finally, enable and start the slurm daemon and the slurm control daemon on the controller Pi: ``` sudo systemctl enable slurmd sudo systemctl start slurmd ``` Enable and start the slurm daemon on each node: ``` sudo systemctl enable slurmd sudo systemctl start slurmd ``` start slurmctld on the controller pi ``` sudo systemctl enable slurmctld sudo systemctl start slurmctld ``` You can check the status of slurmd and slurmctld with the following commands. For slurmd: ``` sudo systemctl status slurmd ``` This should display something similar to this (check out for error messages; the warning about opening the PID file can in most cases be safely ignored). ``` ● slurmd.service - Slurm node daemon Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled) Active: active (running) since Wed 2021-12-08 16:26:40 CET; 3 days ago Docs: man:slurmd(8) Process: 865 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCES Main PID: 888 (slurmd) Tasks: 1 CGroup: /system.slice/slurmd.service └─888 /usr/sbin/slurmd Dec 08 16:26:40 cnat systemd[1]: Starting Slurm node daemon... Dec 08 16:26:40 cnat systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (y Dec 08 16:26:40 cnat systemd[1]: Started Slurm node daemon. ``` For slurmctld: ``` sudo systemctrl status slurmctld ``` This should display something similar to this (check out for error messages!) ``` ● slurmctld.service - Slurm controller daemon Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabl Active: active (running) since Wed 2021-12-08 16:26:38 CET; 3 days ago Docs: man:slurmctld(8) Process: 821 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/ Main PID: 839 (slurmctld) Tasks: 7 CGroup: /system.slice/slurmctld.service └─839 /usr/sbin/slurmctld Dec 08 16:26:38 cnat systemd[1]: Starting Slurm controller daemon... Dec 08 16:26:38 cnat systemd[1]: slurmctld.service: Can't open PID file /run/slurmctld. Dec 08 16:26:38 cnat systemd[1]: Started Slurm controller daemon. ``` And that should be it to get things set up! Login to the controller node and test slurm to make sure it works. Run the `sinfo` command and you should get the following output: ``` PARTITION AVAIL TIMELIMIT NODES STATE NODELIST mycluster* up infinite 3 idle <cluster-name>,p[1-2] ``` Furthermore, you can print the hostname on all of the nodes with the following command: ``` srun --nodes=3 hostname ``` Which should produce a similar result to the following: ``` <cluster-name> p1 p2 ``` To see how many concurrent tasks you can run, you can use the following command: ``` srun --ntasks=4 hostname ``` Which should produce a similar result to the following: ``` <cluster-name> <cluster-name> p1 p2 ``` Check which partitions and which nodes you configured: ``` scontrol show partitions scontrol show nodes ``` **To check the Slurm status:** ``` sudo systemctl status slurmd sudo systemctl status slurmctld ``` **Test slurm (here, for 2 Pi zero nodes):** ``` sinfo srun --nodes=3 hostname srun --ntasks=4 hostname ``` ## Run jobs in parallel !!! :::info Congratulations! Finally, our Pi cluster is fully setup and ready to go. You can tell your friends that your Slurm is more stable than their Slurm, and that the uptime of your cluster has reached infinity at least three times already. ::: Now, just follow parts 2 and 3 of the original tutorial and have fun with your cluster: [Building a Raspberry Pi cluster tutorial Part 2 by Garrett Mills](https://glmdev.medium.com/building-a-raspberry-pi-cluster-aaa8d1f3d2ca) [Building a Raspberry Pi cluster tutorial Part 3 by Garrett Mills](https://glmdev.medium.com/building-a-raspberry-pi-cluster-f5f2446702e8) # Appendix 1: Error messages **Some possible error messages are discussed below.** ## Error 1: Raspbian servers are unresponsive Raspbian servers might be unresponsive sometimes. Try again in a couple of minutes. ## Error 2: Raspbian servers seem to be down all the time Disabling IPv6 could help sometimes: ``` sudo apt-get -o Acquire::ForceIPv4=true update ``` ## Error 3: Allow release info change on the repo You might need to allow release info change on the repo if it complains that the repo release info has been changed from "stable" to "old-stable": ``` sudo apt-get update --allow-releaseinfo-change ``` ## Error 4: Slurm gets stuck after re-install or re-start If you need to re-install or re-start slurm, it might get stuck sometimes, [read more about it here](https://bitsanddragons.wordpress.com/2020/08/24/slurm-20-02-4-error-slurmd-service-start-operation-timed-out-on-centos-7-8/) ## Error 5: Nodes report their status as "down" Nodes report their status as "down", not "idle". First, check the status of nodes and see what is the "reason" of them being "down": ``` scontrol show nodes ``` If the "reason" implies any authentification problems, double-check your Munge isntallation. If the nodes are reporting as being "down" after a reboot of the node, then a simple re-initialisation might help: ``` sudo scontrol update nodename=p1 state=idle ``` ## Error 6: Slurm does not work after changing the slurm.conf file To update Slurm after making changes to the slurm.conf file: ``` sudo scontrol reconfigure ``` ## Error 7: MPI/ORTE errors Sometimes, when Slurm and/or user IDs are not set properly, you might encounter the following message when running MPI jobs: ``` -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- ``` In some cases, this error might only be visible if the MPI job is exclusively using the worker nodes (`p1`, `p2`, `p3`, `p4`), whereas if it is the mixture of the controller (`cnat`) and the worker nodes, the error does not manifest itself. Things to check: 1. Make sure that the `slurm.conf` file is the same on all nodes! If the `slurm.conf` is correct on the controller, but not on the worker nodes, the error might not manifest itself until it is running on worker nodes exclusively. 2. Make sure the user exists on every node! If the job was submitted by non-default user (other than `pi`), and the job is exclusively using the worker nodes (`p1`, `p2`, `p3`, `p4`), it is required that the user exist on the worker nodes as well as on `cnat`, and that in all cases the user has the same `UID`! How to add/delete users with a particular `UID`: ``` # Add user: sudo adduser $username # In order to assign a particular UID (e.g. 1002): sudo adduser --iud 1002 $username # Remove user: sudo userdel $username ``` A similar discussion can be found on [StackOverflow](https://stackoverflow.com/questions/66523304/what-am-i-missing-getting-mpirun-to-schedule-across-multiple-nodes). # Appendix 2: How to turn off the Raspberry Pi safely When using the command line or a terminal window, you can enter the following to do a clean shutdown: ``` sudo shutdown -h now ``` To reboot: ``` sudo reboot ``` Check out [this discussion](https://raspberrypi.stackexchange.com/questions/381/how-do-i-turn-off-my-raspberry-pi) for more details. # Appendix 3: How to connect to wifi when preparing the SD cards To tell the Raspberry Pi to automatically connect to your WiFi network you need to edit a file called: `wpa_supplicant.conf`. To open the file in nano type the following command: `sudo nano /etc/wpa_supplicant/wpa_supplicant.conf` Scroll to the end of the file and add the following to the file to configure your network: ``` country=NL # Your 2-digit country code ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev network={ ssid="Test Wifi Network" psk="SecretPassWord" } ``` # Appendix 4 (Specific to this workshop's setup!!!): Setting up static ip address to connect to gateway network 1. Connect to the OPENHARDWARE WiFi (no password required) 2. Open a terminal (Win+R -> cmd) 3. Check the IP address of your wifi (ipconfig) 4. Find the IP address of the format 10.0.0.x, where "x" corresponds to the IP address of your WiFi connection. 5. Use this IP address to set up static IP to allow internet connection via OPENHARDWARE (Control Panel -> Network and Internet -> Network and Sharing Center -> Change Adapter Settings -> Wifi -> right click, Properties -> Internet Protocal Version 4 (TCP/IPv4) -> Properties) ![](https://i.imgur.com/LPC0ScO.png) # Appendix 5: Setting up a static ip address on the Raspberry Pi open terminal and type: ``` ifconfig ``` Check where it says something like "eth0", note down the ipaddress Then type the following command: ``` sudo nano /etc/dhcpcd.conf ``` Scroll down to the bottom, and add the following lines: ``` interface eth0 static ip_address=[THE_IP_ADDRESS_FROM_EARLIER]/24 static routers=10.0.0.111 static domain_name_servers=8.8.8.8 ``` # Summary **We have gone through several practices and concepts that are essential to work with computational tools, and you have build a baby super computer in only one afternoon** - hardware and software - ssh - linux - networking - Hacking and experimenting - Troubleshooting - Community work # Ideas, suggestions for next workshop We would appreciate very much your input on new ideas for the next workshop. The main goal of these workshops is to hack, play and try things to understand technology. It should accessible and fun for a wide audience with the only requirement of not being afraid to the command line :D ## Here is some nice material to get you inspired - https://homebrewserver.club/category/fundamentals.html - Interesting list of videos shared by @Yehor: https://www.youtube.com/user/cscsch/playlists ## Here are some starter ideas: - Setting up kubernetes in the cluster to demonstrate cloud technologies ## What do you have in mind? - Would be nice to install a file system such as HDFS and analize some big data with Apache Spark or Dask. - For geoscientist: [install Pangeo on HPC](https://pangeo.io/setup_guides/hpc.html#hpc) (uses Dask)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.