Installation notes

# Installation notes This documents contains notes related to the installation of the cluster. It is written to allow step by step reproduction of the installation. Each section details the configuration of a specific topic/feature. Later sections might depend on previous ones, and modify configurations. For instance, the DHCP minimal configuration is later complexified in the PXE and DNS sections. A git repository located on the head at `/root/cluster_config/` contains hard links or copies of the complete configuration files used in the cluster. ## Investments * RAID without spare drives is much less useful. We should buy 2 of these (https://www.ldlc-pro.com/fiche/PB00271800.html) to replace head drives in case of failures. * Cable management: https://www.ldlc-pro.com/fiche/PB00231648.html, https://www.ldlc-pro.com/fiche/PB00256200.html (check size) * 375x420mm boards + brackets to make shelves in the small corner at the right of the entrance (in place of the garbage pile). ## Head generalities ### Hardware The head is a Supermicro server. It has the following hardware configuration: * 2 8Gi DIMM DDR4 2666MHz * 2 Intel Xeon Silver 4110 CPUs @ 2.10GHz, with 8 cores each * 2 240G Intel SSD * 6 4T Toshiba HDD * MegaRAID card (unused) ### OS Ubuntu-server 20.04 is installed on the head. The system is installed on the 2 SSD mirrored in a software RAID1 setup. The raid utility is the standard `md`, and setup is performed at install time directly from the installer program. The 6 remaining disks are used for `/home`. They are not formatted at install time. We configure then later as a ZFS pool, with a RAIDz setup allowing 2 disks failures (see ZFS section). The usable space for `/home` is 14.1T. ### RAID storage Software RAID is preferred to the hardware RAID card because: * Hardware RAID cards use propretary solutions that are not standard. If the card fails, the exact same card has to be dug from somewhere to recover the data on disks. Software RAID by `md` or `zfs` is standard. Data can be recover from any machine running Linux with the required packages. * The userbase of software RAIDs is much wider than any proprietary hardware RAID solution. Therefore, software RAID tend to have less bugs. Moreover, bugs in software can be easily fixed by an update, while hardware solutions are typically never updated. * Software RAID is configured on the fly on the running system while hardware RAID typically requires reboot and BIOS access, which is difficult to get remotely. * On modern hardware, software RAID consumes a tiny fraction of the resources. The head has a beefy configuration and only runs a handful of services to the cluster so RAID performance will never be an issue. * Software RAID (at least ZFS) proposes extra features such as automatic detection and fixing of corrupted bits, which proves valuable for long term storage when "bit rot" becomes a problem. The MegaRAID card cannot be completely deactivated because it plays the role of the disk controller for the whole machine. To access the disks directly, one has to enter the setup utility of the card on boot, and set all disks on "JBOD" mode (Just a Bunch Of Disks). The two SSD drives need to be flagged as bootable devices in the MegaRAID setup for the rest of the machine to consider them as possible boot candidates. ## Grub Grub is the default Ubuntu bootloader. By default, it does not show at boot, which can be inconvenient if, after a faulty manipulation, a kernel becomes unbootable. In `/etc/default/grub`, set `GRUB_TIMEOUT_STYLE=menu` and `GRUB_TIMEOUT=10` to display the Grub menu for 10 seconds at boot, and give a chance to boot older kernels if necessary. Update the Grub configuration by running `update-grub`. ## ZFS ZFS allows to manage storage in logical storage devices with many capabilities. We use it to setup software RAID and enforce quotas on `/home`. A nice general introduction to ZFS can be found at https://arstechnica.com/information-technology/2020/05/zfs-101-understanding-zfs-storage-and-performance/. ### Zpool First, install ZFS. sudo apt install zfsutils-linux Next, create a zpool, *ie* a logical device built on top of an array of physical devices, with RAID. We use disks id as identifier, because they should not change. We choose to allow for 2 disk failures (`raidz2`), name the zpool `homedirs` and mount the resulting filesystem on `/home`. sudo zpool create -m /home homedirs raidz2 /dev/disk/by-id/scsi-STOSHIBA* ZFS is a powerful beast. Zpools are not mounted via the standard `/etc/fstab` but by dedicated daemons, that need to be enabled. sudo systemctl enable zfs.target sudo systemctl enable zfs-mount.service sudo systemctl enable zfs-import-scan.service Finally, run `mount /home`. Use `zpool status` to display info about the zpool. In particular, this command shows the health state of the disks. Use `zfs list` to see the available space. ### Quotas The purpose of quotas is to prevent disk filling by a faulty job, and encourage users not to use the head as a storage space (groups should have their storage on some other machine). Quotas for users *on the `/home` directory* can simply be enforced with zfs set userquota@<username>=<value> homedirs We chose to set a quota of 200G for each user. This should be done at user creation. The full list of user quotas can be seen by administrators with zfs userspace homedirs Users can see their quota with `zfs get userquota@<username>` and the currently used space with `zfs get userused@<username>`. ### Scrub ZFS stores extra parity bits which allow to detect and correct random bit corruption in zpools. This operation is called a "scrub". To check a zpool, run zpool scrub <poolname> The status and result of the scrub can be obtained by `zpool status <poolname>`. Scrubs can be quite long, and put the disks under stress, so it is advised to perform scrubs once a month on servers. Scrubing is a low priority operation, to limit impact on the other disk operations. An ongoing scrub can be paused or stopped with zpool scrub -p <poolname>` # Pause scrub zpool scrub -s <poolname>` # Stop scrub Scrubs can be performed automatically every month using `cron`. To start a scrub on the first day of every month at 4am, add the line `0 4 1 * * root { zpool scrub homedirs; zpool status homedirs; } >> /var/log/zfs/homedirs.log 2>&1` at the end of `/etc/crontab`. Create the log directory with `mkdir /var/log/zfs`. ### Alarms After the SMTP server is installed (see SMTP section), the ZFS Event daemon (ZED) can be configured to send emails when errors are detected in scrub. We want ZED to send notification emails as user ceres. However, `zed` runs as root, and checks that the supplied mail command exists with `command -v`. Therefore, we need to write a small script to send an email as ceres. In `/usr/local/bin`, create a file `ceresmail.sh` with the following content: /usr/local/bin/ceresmail.sh: #!/bin/bash exec sudo -u ceres mail "$@" Make this script executable by root and change permissions to `700` so that it does not appear as a valid program to normal users. Then, in `/etc/zfs/zed.d/zed.rc`, set `ZED_EMAIL_ADDR="root cluster-admin.lps@universite-paris-saclay.fr"` and `ZED_EMAIL_PROG="/usr/local/bin/ceresmail.sh"`. Set `ZED_NOTIFY_VERBOSE=1` to receive notifications after scrubs regardless of the pool health (useful for testing). Restart the daemon and make sure it is enabled with `systemctl restart zed` and `systemctl status zed`. For testing purposes, a test zpool can be created from a file and scrubbed. cd /tmp truncate -s 500M sparse_file zpool create test sparse_file zpool scrub test Destroy the zpool when you are done testing with zpool destroy test rm sparse_file ## Network ### Network conventions The internal cluster network is 192.168.0.0/22. The whole network is managed as a single subnet to avoid routing issues. We define a few conventions to clarify the network configuration. * Addresses in the range 192.168.0.* are reserved for special machines, configured and administrated individually. Those include the head, NAS etc... * The head has the address 192.168.0.1. * Addresses in the range 192.168.1.* are reserved for known nodes that have been manually added to the cluster. * Addresses in the range 192.168.3.* are reserved for unknown nodes that have not been configured and run on default settings. * Addresses in the range 192.168.2.* are available for future use. `systemd-networkd` is used to configure network on the head and nodes. No routing is done on the head, which means that nodes cannot access machines outside of the internal network. ### Head network configuration The head needs one interface connected to the outer world, and one on the internal network. The interface open to the rest of the world is configured via DHCP. Edit the file `/etc/systemd/network/10-eno0.network` as follows: /etc/systemd/network/10-eno0.network: [Match] Name=eno0 [Network] DHCP=ipv4 Manually assign a static IP address to the interface on the internal network by editing the file `/etc/systemd/network/10-ens4f0.network` as follows: /etc/systemd/network/10-ens4f0.network: [Match] Name=ens4f0 [Network] Address=192.168.0.1/22 Finally restart the network daemon systemctl restart systemd-networkd Check that the network configuration is as expected with `ip a`. ## DHCP Install the dhcp server. apt install isc-dhcp-server Edit `/etc/dhcp/dhcpd.conf` default-lease-time 43200; ddns-update-style none; subnet 192.168.0.0 netmask 255.255.252.0 { authoritative; option broadcast-address 192.168.3.255; option subnet-mask 255.255.252.0; include "/etc/dhcp/dhcpd.d/cluster_nodes.conf"; pool { range 192.168.3.1 192.168.3.254; deny known-clients; } } This configuration assigns static IP addresses to known nodes specified in `/etc/dhcp/dhcpd.d/cluster_nodes.conf`, and an automatic IP in the range 192.168.3.* for unknown nodes. The `cluster_nodes.conf` file will be generated by a script for a node database later on. FIXME On startup, the DHCP daemon issues a warning that no configuration is provided for interface `eno0`. This could be fixed by adding an empty subnet declaration as: subnet 129.175.80.0 netmask 255.255.252.0 { } However, doing so, the DHCP server will receive (and ignore) DHCP requests from the lab network. Some machines are ill-configured, and keep sending DHCP requests that flood the logs. I prefer having a warning message at daemon startup. ## PXE boot ### General description Setup possibilities with PXE are vast. We chose to setup a diskless configuration in which the nodes are stateless. Nodes disks do not contain system nor user data. They are only used as a scratch space to write data during calculations. To do so, nodes perform a PXE boot, during which they are instructed to fetch their root filesystem as an image that they load and keep in RAM. The root image is generated from scratch by a script on the head. This allows to centralise the nodes configuration Updating the nodes distribution only requires to rebuild the image by running the script. Since the full configuration is scripted, no fiddling and patching of individual nodes is required. Finally, since the whole system resides in RAM, system calls on the nodes are super-fast. This comes at the cost of some RAM: about 500M. This is very small on recent machines. User data in `/home` and specific libraries or programs in `/opt` are mounted from a NFS export (see NFS section). ### Boot sequence The PXE boot sequence is as follows: * The node is powered up. * BIOS starts PXE utility. * PXE performs a basic DHCP request. * The DHCP server on the head answers with an IP address, and the path to the bootloader on the TFTP server (also on the head). * The node loads and executes the bootloader. * The bootloader fetches the kernel and the initrd from the TFTP server. It starts the kernel with the configured options, including the location of the root filesystem stored as a squashfs image on a HTTP server (on the head). * The squashfs image is downloaded, mounted, and systemd is started. The system is up ! ### PXE installation On the head, apt install tftpd-hpa syslinux pxelinux initramfs-tools `pxelinux` provides the required bootloader, `syslinux` provides small scripts used by the bootloader to perform various operations and `initramfs-tools` allows to generate `initrd`. Configure the TFTP server. /etc/default/tftpd-hpa: TFTP_USERNAME="tftp" TFTP_DIRECTORY="/tftpboot" TFTP_ADDRESS=":69" By default `/etc/default/tftpd-hpa` contains a line `TFTP_OPTIONS="secure"`, which hides the directory root directory. This is of no use here and makes it unclear what value to pass by DHCP. Restart the TFTP daemon with `systemctl restart tftpd-hpa.service`. Create the TFTP directory and populate it with the bootloader and syslinux scripts. mkdir /tftpboot cp /usr/lib/PXELINUX/pxelinux.0 /tftpboot/ cp /usr/lib/syslinux/modules/bios/{ldlinux.c32,libcom32.c32,libutil.c32} /tftpboot/ Create a configuration directory for the bootloader. mkdir /tftpboot/pxelinux.cfg Create a configuration file `default` inside the directory. /tftpboot/pxelinux.cfg/default: DEFAULT linuxnode LABEL linuxnode KERNEL vmlinuz APPEND initrd=initrd rooturl=http://192.168.0.1/rootfs.sq boot=pxe `initrd`, `rootfs.sq` and `pxe` will be generated by the node building script. `rootfs.sq` is the squashfs image containing the root filesystem of the node. `pxe` is a custom boot script built into the `initrd` in the node building script. Modify the DHCP server configuration in `/etc/dhcp/dhcpd.conf` to announce the location of the PXE bootloader: * At the top of the file add the lines `allow booting;` and `allow bootp;` * Inside the internal subnet declaration, add the line `filename "/tftpboot/pxelinux.0";` Restart the DHCP server daemon with `systemctl restart isc-dhcp-server`. Install a HTTP server to serve the squashfs image. apt install nginx This should be enough to get a working HTTP server. Check that it is working by running `curl 192.168.0.1`. It should output a welcome message from nginx with ugly html tags. Anything copied into `/var/www/html/` will now be available for download over HTTP. The node building script will take care of copying the squashfs image there. The HTTP server only needs to be accessible from the internal cluster network. To instruct nginx to listen only on the local network, edit `/etc/nginx/sites-available/default`. In the server section, change the line `listen 80 default_server;` with `listen 192.168.0.1:80 default_server` and comment out the next line `listen [::]:80 default_server`. Restart the service to apply changes with `systemctl restart nginx`. If you have another machine on the network, check that it cannot access the Nginx server with `curl <external IP or hostname>`. ### Nodes building script We use `debootstrap` to generate the root of a minimal Ubuntu system, to which the configuration files and required packages are added. All the steps are performed by a script `/root/cluster_config/node/build.sh`. To function properly, the script requires `debootstrap` and the `zstd` compression program. apt install debootstrap zstd The node building script performs the following steps: * Bootstrap a minimal Ubuntu root filesystem with `debootstrap`. * Add squashfs and overlay modules to initramfs. * Add a custom boot script that mounts the squashfs root as an overlayfs to initramfs. * Generate the updated initramfs. * Write various configuration files to the root filesystem. * Extract kernel and initramfs (initrd). * Compress the root filesystem into a squashfs image. All node configurations should be performed via this script such that all nodes eventually have the same configuration. In order to avoid downloading again all packages even for small configuration changes, the scripts caches the last generated root in `rootcache.tar.zstd`. If the cache is present, the bootstrap step is skipped. Therefore, if one wants to update, add or remove packages, the cache must be removed before running the script. Finally, a small script `/root/cluster_config/nodes/update.sh` is provided to automatically copy the latest kernel, initrd and squashfs image to the TFTP and HTTP servers respectively. All in one, updating the nodes distribution is as simple as cd /root/cluster_config/nodes rm rootcache.tar.zst # Only required if packages are to be updated/removed/added <Edit build.sh> bash build.sh bash update.sh On next boot, the nodes will run the new system. Since debootstrap fetches up-to-date packages from the Ubuntu repositories, upgrading of the node packages can simply be performed by following the above procedure, without changing the script (but removing the cache). The following sections sometimes describe node configurations. These can be implemented by hand on a node for testing purposes, but should then be included into the `build.sh` script. ## NFS NFS allows to make a filesystem stored on one machine available to others over the network. We use it to share `/home` stored on the head accross the cluster. In addition, we share `/opt` in which we will install heavy softwares compiled by hand that need to be available on all machines of the cluster (eg: Slurm, PMIx, MPI...). ### Server configuration Install NFS server: apt install nfs-kernel-server Define exports in `/etc/exports`. `/home` needs to be writable on the nodes while `/opt` can be read only. /etc/exports: /home 192.168.0.0/22(rw,no_subtree_check,no_root_squash) /opt 192.168.0.0/22(ro,no_subtree_check,no_root_squash) The option `not_root_squash` disables the security feature that maps files owned by user 0 (root) to user nobody in the NFS share. Update the table of exported filesystems. exportfs -rv Restart NFS service systemctl restart nfs-kernel-server ### Client configuration Install NFS client: apt install nfs-common Edit `/etc/fstab` to mount filesystems automatically at startup. /etc/fstab: head:/home /home nfs rw,defaults 0 0 head:/opt /opt nfs ro,defaults 0 0 NFS will need to know head's address for mounting. Therefore, make sure that `/etc/hosts` contains a line `192.168.0.1 head` on the nodes. This is a workaround for an issue described in Systemd-networkd-wait-online section. ## Systemd-networkd-wait-online At system startup, systemd runs a service `systemd-networkd-wait-online` that ensures that the network is configured. The default configuration of `systemd-netword-wait-online` is crappy. The network is considered configured if *all* interfaces managed by `systemd-networkd` are in a state upper *or equal to* "degraded". On the nodes, since we don't know in advance which interface is going to be used, all interfaces are managed by `systemd-networkd`. Interfaces that are not plugged remain in configuring state, and cause `systemd-networkd-wait-online` to hang for the default timeout of 2 minutes. This delays the nodes boot. To specify that one configured interface is enough to consider the network usable, add the option `--any` to the ExectStart line of `/lib/systemd/system/systemd-networkd-wait-online.service`. Also specify that the interface should be in state "routable" to be considered online with the option `-o routable`. After the changes, the ExecStart line of `/lib/systemd/system/systemd-networkd-wait-online.service` should be ExecStart=/lib/systemd/systemd-networkd-wait-online --any -o routable This change should be scripted in the nodes build script. ### NFS issue With the configuration described above, the 2 minutes delay is indeed avoided. However, on some reboots, nodes fail to mount the NFS-shared `/opt` and `/home`, because NFS is unable to resolve the name `head` used in `/etc/fstab`. This is a recurrent issue on the web, which has no satisfactory solution. I tried tweaking the mount options in `/etc/fstab` (`x-systemd.after`, `x-systemd.requires`, `_netdev`, `bg` etc...) and forcing name resolution to be working before NFS mounts by adding an extra systemd service, but none of this worked. In the end, having the head address defined in `/etc/hosts` seems to be the only solution. This is not very satisfactory, since DNS should take care of this, but we will continue like this for the moment. ## SSH ### SSHD If ssh server was not installed at installation, install it with `apt install sshd`. Host keys are generated at installation. However, not all key types provide the same level of security. To harden a bit the configuration, write a configuration file for sshd in `/etc/ssh/sshd_config.d`. The file name must end with `.conf` to be properly included by `/etc/ssh/sshd_config`. /etc/ssh/sshd_config.d/hardening.conf: # Restrict to strong Key exchange algorithms KexAlgorithms curve25519-sha256@libssh.org,diffie-hellman-group-exchange-sha256 # Only use robust keys for server authentication HostKey /etc/ssh/ssh_host_ed25519_key # Only accept robust keys for client authentication PubkeyAcceptedKeyTypes ssh-ed25519-cert-v01@openssh.com,ssh-ed25519 # Restrict to strong symmetric ciphers Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com,aes128-gcm@openssh.com,aes256-ctr,aes192-ctr,aes128-ctr # Use strong message authentication codes if the cipher does not ensure integrity already MACs hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-512,hmac-sha2-256,umac-128@openssh.com With this configuration, the ssh server restricts possible choices to the safest ones at each step of the connection. In particular, only ed25519 keys are accepted for user authentication, and the server only proposes ed25519 key to identify itself. The fingerprint of the host key is `SHA256:nP5I0+w5KJZ7em4M3WExx6bE7I5YLWVvEWzHFN7H9uE`. Remove completely host dsa and ecdsa keys to be certain they cannot be used. rm /etc/ssh/ssh_host_dsa_key* rm /etc/ssh/ssh_host_ecdsa_key* We keep the RSA key in case someone comes with a very old SSH client that does not support ed25519 (sshd configuration would then have to be adapted). ### SSH inside of the cluster for users MPI requires ssh connections without keyboard input. Therefore, users need a passwordless ssh key for internal connections. Create a key pair for the user: sudo -u <user> ssh-keygen -t ed25519 -N "" -f id_ed25519_cluster Since `/home` is shared via NFS, so are the `/home/<user>/.ssh` for each user. So it is enough to copy the public key to the `authorized_keys` on the head for the key to be recognised on all nodes. sudo -u <user> cat /home/<user>/.ssh/id_ed25519.pub >> /home/<user>/.ssh/authorized_keys These steps can be performed at user creation. For every new host, ssh asks to accept the host key fingerprint. This will block MPI connections. Moreover, with $n$ hosts and $m$ users, there would be $m n^2$ fingerprints to add to the `known_hosts`. This will quickly become impractical. To avoid this inconvenience, we add a configuration file to ssh that disables host key checking for hosts in the local network, and does not write the key to `known_hosts`. In addition, we specify the key to be used for internal connections. Add the following file to `/etc/ssh/ssh_config.d/` on the head and the nodes to change the system-wide configuration: /etc/ssh/ssh_config.d/disable_host_key_check.conf: Match exec "host %h | awk '/ has address/ {print $4}' | grep 192.168.1.*" IdentityFile ~/.ssh/id_ed25519_cluster StrictHostKeyChecking no UserKnownHostsFile /dev/null LogLevel ERROR FIXME: This might slow down internode communications. If it is the case, StrictHostKeyChecking coud simply be disabled by the `/etc/ssh/ssh_config` on the nodes, and the IP check only use on the head. However, I think MPI only uses the SSH connection once to setup communications, which then use other protocols, in which case the IP check shouldn't be a problem. The exec line resolves the IP for the host name, and checks whether it belongs to the local network. LogLevel is lowered to disable `Warning: Permanently added <host> to the list of known hosts`. ### SSH root to the nodes For administration purposes, it is useful to be able to log as root on the nodes over ssh. If the NFS-shared `/home` breaks for some reason, normal users cannot ssh anymore and only root will be able to access the nodes. To allow root login over ssh, change the line `pam_unix.so nullok_secure` to `pam_unix.so nullok`. FIXME: Is that really necessary ? Copy root public key to `/root/.ssh/authorised_keys` on the nodes. ## DNS Managing `/etc/hosts` on all the nodes is not convenient. Whenever a node is added or remove, the corresponding entry has to be pushed to all nodes, and the build script modified to make changes permanent across reboots. Therefore we install a DNS on the head that will centralise name resolution for the cluster. ### Server configuration We use the standard `bind9` DNS implementation. A nice tutorial: https://ubuntu.com/server/docs/service-domain-name-service-dns. Beware though, it does not take into account the presence of `systemd-resolved`. By default on Ubuntu, `systemd-resolved` handles the DNS quieries. In fact it acts as a small DNS listening on local loopback, forwards the requests to atual DNS and allow some operations inbetween such as local caching and DNS queries encryption. This is fine for a client machine, but causes problem on a server, because `systemd-resolved` uses the same port as the DNS, and there is no way to change it. DNS port could be changed, but clients would then need to know what port to use. It is simpler to disable `systemd-resolved` on the server and replace it with `bind9` alltogether. The DNS will handle fully qualified domain names (FQDN), so we need to choose a domain name for the cluster. We choose `ceres.lps.u-psud.fr` as a natural domain name. Machines in the cluster will have names like `nodex.ceres.lps.u-psud.fr`. #### Get rid of `systemd-resolved` Rewrite `/etc/resolv.conf` to specify that DNS queries should be sent to the local DNS listening on local loopback, and that names that are not fully qualified (eg `nodex`) should be also searched in the cluster domain. This way, the host name will be enough to resolve machines in the cluster. /etc/resolv.conf: nameserver 127.0.0.1 search ceres.lps.u-psud.fr Make sure `/etc/resolv.conf` is not a symlink to some `systemd-resolved` file. In `/etc/systemd/resolved.conf` set DNSStubListener=no Disable `systemd-resolved` entirely. systemctl disable systemd-resolved systemctl stop systemd-resolved #### `bind9` configuration First install `bind9`. apt install bind9 We will configure `bind9` as a primary DNS on the cluster zone. See https://bind9.readthedocs.io/en/v9_16_5/reference.html# for details on the configuration options. Modify global DNS option in `/etc/bind/named.conf.option`: * Add lines `listen-on {127.0.0.1; 192.168.0.0/22; };` and `allow-query-on {127.0.0.1; 192.168.0.0/22; };` to ensure that the DNS only serves the cluster network and local name resolution on the head. * Add the line `forwarders {129.175.80.8; };` to forward DNS queries that the local server cannot answer to the University DNS. Actually, for more reliability, check the DNS advertised by DHCP in `/var/lib/dhcp/dhclient.lease` and add them all to the forwarders. Declare a new zone on which the DNS will be authoritative in `/etc/bind/named.conf.local`. /etc/bind/named.conf.local: zone "ceres.lps.u-psud.fr" { type master; file "/etc/bind/db.ceres.lps.u-psud.fr"; }; Copy the local zone database as a starting point for the cluster zone database. cp /etc/bind/db.local /etc/bind/db.ceres.lps.u-psud.fr Zone databases contain a list of records, each on one line. The first field is the label of the record. `@` is a placeholder for the zone name (here `ceres.lps.u-psud.fr.`). The second field is the class of the record. `IN` for "internet" is almost always used. The third field is the record type. We use `SOA` that describes a source of authority, `NS` that describes a name server, `A` that records an IPv4 address and `CNAME` that contains an alias to an existing address. Modify the `db.ceres.lps.u-psud.fr` as follows to set the DNS address, add a first DNS record for the head along with an alias. /etc/bind/db.ceres.lps.u-psud.fr: $TTL 604800 @ IN SOA ceres.lps.u-psud.fr. root.ceres.lps.u-psud.fr. ( 2 ; Serial 604800 ; Refresh 86400 ; Retry 2419200 ; Expire 604800 ) ; Negative Cache TTL ; @ IN NS ceres.lps.u-psud.fr. @ IN A 192.168.0.1 head IN A 192.168.0.1 ceres IN CNAME head Nodes records will be appended at the end of this file with the format `<nodename> IN A <nodeaddr>`. FIXME Check the DNS configuration with `named-checkconf` and the cluster zone configuration with `named-checkzone ceres.lps.u-psud.fr /etc/bind/db.ceres.lps.u-psud.fr`. Restart the DNS server with `systemctl restart bind9`. #### DHCP configuration Instruct the DHCP server to send DNS address and search domains to clients by adding the following lines in the subnet declaration of `/etc/dhcp/dhcpd.conf`: option domain-name-servers 192.168.0.1; option domain-name "ceres.lps.u-psud.fr"; ### Client configuration On Ubuntu, by default, `systemd-networkd` automatically uses DNS and search domains provided by the DHCP server. Therefore, no configuration is required on the client side. Note that because of the issue described in Systemd-networkd-wait-online section, the head's address still needs to be hardcoded in `/etc/hosts`. ## Users management Users account credentials will be stored in a LDAP database to ensure that they are consistent across the cluster (see next section). In addition, we create a local user "ceres" on the head that will be used to build and install software in `/opt`. Performing these installations as a normal user instead of root ensures that the installation script will not write outside of the ceres-owned intallation directory and mess with the system. This is a standard good administrative practice. In addition, we create this user on the nodes as well by copying the lines starting with `ceres:` in `/etc/passwd` and `/etc/shadow` on the head to the nodes image. This is done by the image building script. ## LDAP, NSS and PAM https://ubuntu.com/server/docs/service-ldap Users need to be present and have consistent UID across the cluster. Copying `/etc/passwd`, `/etc/shadow`, `/etc/group` and `/etc/gshadow` on all hosts and maintaining them up to date is impractical and the risk of falling in an unconsistent state is high. In the same way that DNS allows to centralise host name resolution, LDAP, on conjonction with NSS and PAM allow to centralise user authentication. LDAP is a general remotely accessible hierarchical database. It can be used to store user credentials which can then be used by NSS and PAM to authenticate users against a centralised database. LDAP can be used for much more than user centralisation. Its configuration is therefore a bit tedious, and its semantics are quite cryptic. ### LDAP server configuration The following tutorial explains how to set up a LDAP server for user authentication: https://computingforgeeks.com/install-and-configure-openldap-server-ubuntu/. Fully specify the hostname of the head. hostnamectl set-hostname head.ceres.lps.u-psud.fr Make sure that the host name can be resolved either via DNS (see DNS section) or by adding a line to `/etc/hosts`. Install the LDAP server. apt install slapd The LDAP database (or tree) will be initialised with an admin entry, specifying the LDAP account used for LDAP administration. During installation, enter a password for this admin account. The content of the freshly installed LDAP tree can be displayed with `slapcat`. It contains two entries, identified by their *distinguished name* (`dn`) each containing a set of *attributes* in "key: value" format. The first entry `dc=ceres,dc=lps,dc=u-psud,dc=fr` is the root of the tree, and the second `cn=admin,dc=ceres,dc=lps,dc=u-psud,dc=fr` is the admin account. By default, the LDAP server listens on all interfaces. Moreover, when authenticating to the server, the connection is not protected and LDAP credentials (such as those of the admin account) are exanged in cleartext, and thus vulnerable to evesdropping. To limit the risk, instruct the LDAP daemon to only listen on local sockets (for local connexions on the head) and internal network. In `/etc/default/slapd`, set `SLAPD_SERVICES` to `SLAPD_SERVICE="ldapi:/// ldap://192.168.0.1"`. Restart the daemon with `systemctl restart slapd`. Connections are still unprotected, but at least they do not leave the internal network, and the LDAP server cannot be accessed from outside. Connections could be encrypted with SSL, but it is quite complicated to set up. FIXME : should we do it ? In order to finish the server configuration as an authentication database, we need to manipulate the LDAP tree and add information about users. For this, we need the LDAP client programs. ### LDAP client configuration on the head, database set-up Install the LDAP client package. apt install ldap-utils This package provides client programs such as `ldapsearch`, `ldapadd`, `ldapmodify` to query and modify a LDAP tree. Client programs use the configuration file `/etc/ldap/ldap.conf`. In this file, specify the LDAP tree root with `BASE dc=ceres,dc=lps,dc=u-psud,dc=fr` and the server address as `URI ldapi:///` to use sockets to connect locally. This saves the use of `-H` and `-b` options in later commandline calls to specify server address and tree base respectively. We will now populate the tree with the required entries to make it usable as an authentication database. Entries are added by writting so-called "ldif" modification files which are then submitted to the server. Write a file `basedn.ldif` to add one entry to store users and one to store group in the LDAP tree. basedn.ldif: dn: ou=people,dc=ceres,dc=lps,dc=u-psud,dc=fr objectClass: organizationalUnit ou: people dn: ou=groups,dc=ceres,dc=lps,dc=u-psud,dc=fr objectClass: organizationalUnit ou: groups Modification of the database requires authenticated access as admin to the LDAP server. Add the 2 entries using the following command: ldapadd -x -D cn=admin,dc=ceres,dc=lps,dc=u-psud,dc=fr -W -f basedn.ldif The `-x` option requests a "simple" connection to the LDAP server (ie. not protected), `-D` specifies the account to connect as (here admin), `-W` requests a password prompt to enter account password interactively and -f specifies the file to be submitted. `slapcat` should now display the additionnal entries. Let's now write a file to create a new entry for user "testuser". ldapuser.ldif: dn: uid=testuser,ou=people,dc=ceres,dc=lps,dc=u-psud,dc=fr objectClass: inetOrgPerson objectClass: posixAccount objectClass: shadowAccount cn: testuser sn: Test User userPassword: {SSHA}Zn4/E5f+Ork7WZF/alrpMuHHGufC3x0k loginShell: /bin/bash uidNumber: 2000 gidNumber: 2000 homeDirectory: /home/testuser The uid and gid must be free. The hash of the password, to be stored in the `userPassword` attribute, can be generated with `slappasswd`. Add the file to the LDAP tree with ldapadd -x -D cn=admin,dc=ceres,dc=lps,dc=u-psud,dc=fr -W -f ldapuser.ldif Likewise, add a group "testgroup". ldapgroup.ldif: dn: cn=testgroup,ou=groups,dc=ceres,dc=lps,dc=u-psud,dc=fr objectClass: posixGroup cn: testgroup gidNumber: 2000 memberUid: testuser And add the entry with the usual `ldapadd -x -D cn=admin,dc=ceres,dc=lps,dc=u-psud,dc=fr -W -f ldapgroup.ldif`. ### NSS and PAM configuration on the head Now that the LDAP tree contains user credentials, we can configure the head to use the LDAP server as an authentication source. The following instructions are taken mostly from: https://computingforgeeks.com/how-to-configure-ubuntu-as-ldap-client/. NSS, which handles name resolution sources and PAM, which take care of authentication, need to be configures to use the LDAP server as a source of authentication data. Install the required libraries. apt install libnss-ldap libpam-ldap During installation, enter the LDAP server address `ldapi:///`, the base of the LDAP tree `dc=ceres,dc=lps,dc=u-psud,dc=fr` and the LDAP version `3`. Select `Yes` to make a local root database admin and `No` to login requirement. Enter the distinguished name of the admin account for the database `cn=admin,dc=ceres,dc=lps,dc=u-psud,dc=fr` and its password. After the installation, edit `/etc/nsswitch.conf` and add `ldap` at the end of `passwd`, `group` and `shadow` lines to add the LDAP server as a source of authentication data. In order to allow users to change their password themselves, look for the line `password [success=1 user_unknown=ignore default=die] pam_ldap.so use_authtok try_first_pass` in `/etc/pam.d/common-password` and remove the keyword `use_authok`. Enable creation of home directory on first login by adding the line `session optional pam_mkhomedir.so skel=/etc/skel umask=077` after the last `session optional ...` entry in `/etc/pam.d/common-session`. Check that everything is working by trying to switch user with `su testuser`. The list of users known to the system can be displayed with `getent passwd`. If everything is in order, users on the LDAP server should be displayed along with the ones defined locally in `/etc/passwd`. In case of problem, check server address and tree base in `/etc/ldap.conf`. **Note that this file, used by NSS, is different than `/etc/ldap/ldap.conf` used by the LDAP client programs.** Also check `/etc/ldap.secret` which should contain the LDAP admin password. ### Node configuration Make sure that the head is reachable from the nodes with `host head`. Install the client package and authentication libraries. apt install ldap-utils In `/etc/ldap/ldap.conf` set `BASE dc=ceres,dc=lps,dc=u-psud,dc=fr` and `URI ldap://head`. To configure the LDAP client on the nodes, we cannot rely on the interactive configuration. The interactive step can be skipped in the script by prefixing `apt install` with `DEBIAN_FRONTEND=noninteractive`. DEBIAN_FRONTEND=noninteractive apt install libnss-ldap libpam-ldap `/etc/ldap.conf` must then be written explicitely as: /etc/ldap.conf: uri ldap://head base dc=ceres,dc=lps,dc=u-psud,dc=fr ldap_version 3 Since users are not supposed to change their password from the nodes, it is not necessary to edit `/etc/pam.d/common-password`. Home directories are created from the head at user creation, so it is not necessary to edit `/etc/pam.d/common-session` on the nodes either. This should be enough for `getent passwd` to display users stored in the LDAP tree, and to connect to their accounts (with `su <user>` for instance). ## NSCD To fetch credentials fasters, NSS calls can be cached by a small daemon `nscd`. It requires no additional configuration and can be simply installed on both head and nodes with apt install nscd When a user or group is updated, the cache contains inconsistent data. It can be refreshed with `systemctl reload nscd`. The daemon also caches DNS lookups, therefore it also needs to be refreshed when DNS records are altered. To avoid flushing all caches one can use `nscd --invalidate=hosts` to only refresh host cache. ## SMTP A SMTP server allows to send emails from the head. It can be useful to report errors to admins automatically, can be used by Slurm to send mail to users about running jobs, and it can be used to automate mailing list management by sending commands to Sympa in mails. We use the OpenSMTP implementation of the SMTP server. ### OpenSMTP installation Make sure that the head's hostname is properly set to its domain name (ceres.lps.u-psud.fr). If not, update it with `hostnamectl set-hostname <hostname>`. Install the package with `apt install opensmtpd`. The installer asks for the domain that the server will handle. It should be `ceres.lps.u-psud.fr`. It then asks for an alias for root. You can put user ceres, although it seems that this configuration is ignored. The default configuration file `/etc/smtpd.conf` is enough for what we need (listening on local and sending mails away). In particular, this configuration does not allow receiving emails from outside. Make sure that the daemon is running and enabled with `systemctl status opensmtpd`. ### Usage In order to communicate with the SMTP daemon, a few utility programs must be installed. apt install mailutils This provides a `mail` command that can be used to send emails. University SMTP server seems to accept all emails coming from a domain in the internal network, so emails can apparently be sent to @u-psud.fr and @universite-paris-saclay.fr addresses without issues. In particular, mails can be sent to the cluster mailing lists. User ceres should be used to send alert emails to admins. Add ceres@ceres.lps.u-psud.fr to the subscribers of the cluster-admin.lps@universite-paris-saclay.fr. Since the head cannot receive email, change the "receive" setting of this address to "no mail" to avoid getting errors from the list. Then, as user ceres, an alert message can be sent with mail -s "Mail subject" cluster-admin.lps@universite-paris-saclay.fr << EOF Mail content blablabla EOF See ZFS section to configure automatic email alerts. Sympa mailing lists can be managed by sending commands in the body of a mail. We use it to automatically add new users to the cluster-user mailing list. Add ceres@ceres.lps.u-psud.fr to the owners of the list. Then commands can be sent to the SYMPA server by sending emails to lps-sympa@listes2.di.u-psud.fr. The list of possible cammands can be found at https://listes.universite-paris-saclay.fr/lps/help/mail_commands. ## WakeOnLAN WakeOnLAN allows machines to be powered up from the network, by sending a "magic packet" in the broadcast domain. It is useful to have WakeOnLAN enabled on the head to have a chance to bring it back up without moving to the lab if it goes down. Note that if you don't have access to another alive machine in the same subnet, WakeOnLAN cannot help you. WakeOnLAN is required on the nodes to allow powering off of idle nodes. WakeOnLAN needs to be supported by the network card (all modern ones support it), enabled in BIOS (usually is on servers) and enabled at the OS level (depends on systems). ### WakeOnLAN on the head Check whether WOL is already enabled with `ethtool <interface connected to the outside world (ens4f0)>`. "Supports Wake-on" should have flag "g". If not, check BIOS configuration (usually under "Power management" or equivalent) and network card capabilities. If "Wake-on" field has flag "g", then WOL is enabled. If not run `ethtool -s <interface> wol g`. In any case, to make the change persistent and OS-controlled, modify the default link configuration by copying `/lib/systemd/network/99-default.link` to `/etc/systemd/network/99-default.link`. Add a line `WakeOnLAN=magic` in the `[Link]` section of the copied file. ### WakeOnLAN on the nodes Make sure that WOL is enable in BIOS. Add a file to enable WOL on all interfaces. /etc/systemd/network/10-enable_wol.link: [Match] OriginalName=* [Link] WakeOnLan=magic This change should be made in the nodes build script. ## NTP For Slurm to work properly, clocks must be synchronised accross the cluster. We use the standard Network Time Protocol for that. https://ubuntu.com/server/docs/network-ntp On the head, `chrony` is used for both syncing and serving NTP to the rest of the cluster. On the nodes, `systemd-timesyncd` is used as a more lightweight NTP client. ### Head configuration Since the head needs to act as both client and server, we use `chrony` instead of `systemd-timesyncd` which can only act as a client. Install `chrony`: apt install chrony #### Head as client Set correct local timezone: timedatectl set-timezone Europe/Paris Synchronize with university NTP server. Add the following line to `/etc/chrony/chrony.conf`: /etc/chrony/chrony.conf: server ntp.u-psud.fr iburst prefer Restart the daemon: systemctl restart chronyd After a few minutes, check that everything is fine with `chronyc sources -v`. The university server should appear prefixed with a `^*`, which indicates the synchonised server. #### Head as server To let the head act as a NTP server, simply add the following line to `/etc/chrony/chrony.conf`. The same daemon `chronyd` acts as both server and client. /etc/chrony/chrony.conf: allow 192.168.0.0/22 Restart the daemon with `systemctl restart chronyd`. Add the following line to `/etc/dhcpd.conf` to send the NTP server address via DHCP. option ntp-servers 192.168.0.1; Restart the server with `systemctl restart isc-dhcp-server`. ### Node configuration The nodes only need a basic NTP client. We use `systemd-timesyncd` for that. The NTP address is obtained from DHCP by default if the server sends it, so only the timezone has to be configured. This can be done when the system is booted with Set timedatectl set-timezone Europe/Paris or by overwritting `/etc/timezone` with `Europe/Paris`, and creating a symlink with the proper time file at `/etc/localtime` with `ls -s /usr/share/zoneinfo/Europe/Paris /etc/localtime`. Restart the service with systemctl restart systemd-timesyncd Check that the clock is synchronised with `timedatectl status` or `timedatectl timesync-status`. #### Issue with DHCP It should be possible to get the timezone by DHCP, by setting `option pcode "CET-1CEST,M3.5.0,M10.5.0/3";` and `option tcode "Europe/Paris";` in `/etc/dhcp/dhcpd.conf` on the head, and adding a section [DHCPv4] UseTimezone=yes in the network configuration of the nodes. However, this does not work. Networkd on the node reports an error with some permission issue for setting the timezone. FIXME ## Compilers and build tools Install essential tools for building software from source, as well as standard compilers. Available compilers are automatically detected when building software, and affect the built plugins. Make sure that all the languages you want support for have their corresponding compiler installed. apt install gcc g++ gfortran make cmake ## Munge Munge must be installed before Slurm. It provides a unified authentication method between the hosts in the cluster, and allow them to spawn processes on each other without root privileges. The clocks of the communicating machines need to be synchronised (see NTP section). The easiest way of intalling Munge is via the repo. Slurm needs development package as well. apt install munge libmunge-dev During the apt installation, a default key is generated with `/dev/urandom`. It is recommended to generate a new one using `/dev/random` with the command create-munge-key -r -f This can take quite some time (took 20min for me). After the new key is generated, restart the daemon with `sytemctl restart munge`. The apt installer creates automatically a new munge system user. The munged daemon should be run as this user (normally, default systemd unit file `/lib/systemd/system/munge.service` does that), and permissions for the various directories and files set according to the documentation in `/usr/share/doc/munge/QUICKSTART.gz` (directories are created with proper permission if Munge is installed from the apt reposiries). On the nodes, the developement libraries are not needed so simply `apt install munge`. All machines in the cluster need to have the same munge key in `/etc/munge/munge.key`. The key should be owned by the munge user and have permissions `400`. The munge user doesn't need to have the same UID/GID on all hosts (and it won't). Therefore, when copying the munge key from the head to the node, remember to change the ownernership with `chown munge:munge /etc/munge/munge.key`. Test that the installation is working with munge -n munge -n | unmunge munge -n | ssh <host> unmunge ## PMIx PMIx is the "glue" between Slurm and the cluster's HPC components (MPI, OMP, Network fabric...). OpenMPI comes with its own version of PMIx, however it does not work well with Slurm. We decide to build a proper PMIx and link all other programs against it. This ensures that they are all using the same version of PMIx. We build PMIx from source and install it to `/opt` as user "ceres", along with the other HPC components to have a recent version and avoid bloating nodes image. Make sure that `munge` is installed (see Munge section above). Install `hwloc` and `libevent` along with their development headers on the head apt install hwloc libhwloc-dev libevent-dev On the nodes, the development headers are not necessary so only install `hwloc`, `libevent-2.1-7`, `libevent-pthreads-2.1-7` and `libevent-core-2.1-7`. Download the source code from https://openpmix.github.io/downloads and extract it in `/usr/local/src`. Slurm 21.08 only supports PMIx up to version 3. Create the installation directory `mkdir /opt/pmix-<version>` and transfer ownership with `chown -R ceres:users /opt/pmix-<version>`. Move into the source tree and run as user "ceres" ./configure --prefix=/opt/pmix-<version> make -j 16 make install Make sure that `/opt` is properly exported via NFS on the head, and mounted on the nodes (see NFS section). Both on the head and nodes, add `/opt/pmix-<version>/lib` to `LD_LIBRARY_PATH` by adding a line `export LD_LIBRARY_PATH=/opt/pmix-<version>/lib:$LD_LIBRARY_PATH` to `/etc/bash.bashrc`, **above the non-interactive kill-switch**. ## Slurm ### Building from source Make sure that `munge` is installed along with its development tools (see Munge section). Make sure that PMIx is installed (see PMIx section). Install extra packages for Slurm. `hwloc` allows to use CPU cores as consumable resources. Otherwise, CPU is the finer level of granularity. `json-c` is required for power-save features. apt install libz4-dev hwloc libhwloc-dev libjson-c-dev Download Slurm code as a tarball. Extract it (for instance in `/usr/local/src/slurm-<version>`). Create the installation directory (eg: `/opt/slurm-<version>`) and transfer ownership to "ceres:users". Move into the extracted slurm directory and run as user "ceres" ./configure prefix=<intall directory> with-systemdsystemunitdir=/etc/systemd/system --with-pmix=<pmix install dir> make -j <n> make install with $n$ the number of available cores on the head, for faster parallel build. ### Slurm controller setup Add the installation direction to `PATH` by adding the line `export PATH=<install dir>/bin:<install dir>/sbin:$PATH` at the top of `/etc/bash.bashrc` The Slurm controller daemon does not need to run as root. Therefore, create a `slurm` user. It must exist on all hosts so it is better to add it to the LDAP tree (see LDAP section). To create a system user in LDAP, set `userPassword: *`, `loginShell: /usr/sbin/nologin` and `homeDirectory: /nonexistent`. Create the various directories used by the slurm controller daemon. The required permissions are listed in the "FILE AND DIRECTORY PERMISSIONS" section of `man slurm.conf`. mkdir /var/log/slurm chown -R slurm:slurm /var/log/slurm sudo -u slurm touch /var/log/slurm/slurmctld.log sudo -u slurm chmod 600 /var/log/slurm/slurmctld.log mkdir /var/spool/slurmctld #SaveStateLocation chown slurm:slurm /var/spool/slurmctld Create the configuration directory. By default it is `<install dir>/etc/`. Generate a configuration file. The tool https://slurm.schedmd.com/configurator.html can be used to get a first version. Useful dicussions of Slurm configuration can be found in the FAQ https://slurm.schedmd.com/faq.html and at https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration. See the next subsection for a description of our Slurm configuration. We use cgroup plugins to track processes and restrict resources usage, which require a `cgroup.conf` file to be placed in the same folder. An example file is available in the source directory under `etc/cgroup.conf.example`, and might be used as a starting point. See the next subsection for our cgroup configuration. Enable and start the controller daemon. systemctl enable slurmctld systemctl start slurmctld Slurm configuration file, libraries and binaries need to be accessible on all hosts. Thus, share the Slurm installation directory via NFS. Add a line `<install dir> 192.168.0.0/22(ro,no_subtree_check,no_root_squash)` to `/etc/exports`, update export tables with `exportfs -rv` and restart the NFS server with `systemctl reload nfs-server`. ### Slurm configuration Slurm configuration files are located in `<install dir>/etc`. `slurm.conf` is the main configuration file. Extra plugins may add additional configuration files. Here, we discuss choices made on the Slurm configuration. See also comments the `slurm.conf` and `cgroup.conf` files used in the cluster. #### General options Name the cluster with `ClusterName=cluster`. Slurm can monitor several clusters at the same time. With only one cluster, this parameter is essentially useless. Specify the host of the Slurm controller with `SlurmctldHost=head`. Slurm supports backup controllers as well. The Slurm user running `slurmctld` is specified with `SlurmUser=slurm`. By default, rebooted nodes will be considered DOWN if they don't come back online in 1 minute. Our nodes, especially old ones, typically take much longer to boot. Increase the resume timeout to 5 minutes with `ResumeTimeout=300`. The option `ReturnToService=1` instruct that nodes put in DOWN state by Slurm controller due to being unresponsive become available again when they register with a valid configuration. Set `RebootProgram=/usr/sbin/reboot` to enable node reboot from `scontrol reboot`. Specify the default MPI to be used by `srun` with `MpiDefault=pmix_v3`. An alternative MPI can be specified when running `srun` with `--mpi` option. Programs using OMP threads get the number of threads per task from the environment variable `OMP_NUM_THREADS`. We want it to be set to the number of CPUs per task specified by Slurm, or 1 if not set (single thread). We do that through the TaskProlog script run by Slurm before every task. Create a script in `<install dir>/etc/scripts`: <install dir>/etc/scripts/set_omp_num_threads.sh: #!/bin/bash if [[ -z $OMP_NUM_THREADS ]]; then if [[ -n $SLURM_CPUS_PER_TASK ]]; then echo export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK else echo export OMP_NUM_THREADS=1 fi fi Set `TaskProlog=<install dir>/etc/scripts/set_omp_num_threads.sh` in `slurm.conf`. Deactivate accounting with `AccountingStorageType=accounting_storage/none` and `JobAcctGatherType=jobacct_gather/none`. #### Process tracking Slurm can track processes by several methods. The most reliable one uses the cgroup plugin and is enabled with `ProctrackType=proctrack/cgroup`. It requires a configuration file `cgroup.conf` with general configurations for the cgroup plugin. <install dir>/etc/cgroup.conf: CgroupAutomount=yes CgroupMountpoint=/sys/fs/cgroup #### Job scheduling Slurm proposes various scheduling algorithms. We use `SchedulerType=sched/backfill` which schedules job in a first in first out (FIFO) way, and allows for later job to be run before earlier ones as long as this does not delay the allocation of the earlier jobs. This is essentially filling the gaps between long, big jobs with small, short ones. In practice, since we usually don't specify time limits for our jobs and rarely have the cluster running full, this backfill scheduling will be of no use and the scheduler will behave like a standard FIFO pipeline. #### Resources allocation By default, the finest level of resource granularity are CPUs (*ie* sockets), which means that job can only be allocated full CPUs, even though they might not use all cores. We use the "cons_tres" plugin which greatly expands the available types of resources, including sockets, cores, threads and memory. This changes the meaning of "CPUs" in Slurm's terminology, to now mean the number of hardware threads (the finest resource granularity). The plugin is enabled with `SelectType=select/cons_tres` and we instruct cores to be consumable resources with `SelectTypeParameters=CR_Core,CR_ONE_TASK_PER_CORE,CR_CORE_DEFAULT_DIST_BLOCK,CR_PACK_NODES`. The system still resolves the resources down to hardware threads, but will allocate one full core per task, even if more than one hardware threads are present in the core. By default the Task/affinity plugin allocates cores in a cyclic fashion between sockets within one node. If parallel jobs are use inter-process communications, it is usually preferable to allocate cores as close as possible to each other. This is requested by the `CR_CORE_DEFAULT_DIST_BLOCK` and `CR_PACK_NODES` parameters. By default, jobs submitted to a partition with resource requirements exceeding the current available resources in the partition are put in PENDING state until more resources are added to the partition. Since nodes and partitions are rarely modified, we prefer to reject jobs exceeding partitions capacities with `EnforcePartLimits=ANY`. #### Tasks binding The Task plugin handles binding of tasks to allocated resources. Following recommendations in the end of the "TASK/CGROUP PLUGIN" section of `cgroup.conf` man page, we set `TaskPlugin=task/affinity,task/cgroup` in `slurm.conf` and `TaskAffinity=no` and `ConstrainCores=yes` in `cgroup.conf`. With this configuration, the TaskAffinity plugin is used to bind tasks to the allocated resources in an efficient manner (eg: preferably allocate cores on the same socket, socket in the same node etc...) and the TaskCgroup plugin is used to fence tasks into the specified resources. In addition, we specify `ConstrainKmemSpace=no` in`cgroup.conf` to avoid known kernel issues and set `TaskPluginParam=Cores` in `slurm.conf` to use cores as the preferred level of resource granularity. This ensures that when tasks are run on hyperthreaded nodes, if the number of requested tasks is smaller than or equal to the number of nodes, one task is allocated a full core instead of a hardware thread. #### Scratch setup We want to use nodes disks as a scratch storage, that users could use for IO intensive jobs. When a job is allocated, users could write to local `/scratch/job-<jobID>` instead of NFS-shared `/home/<username>` and copy all generated files at the end of the job. To make sure that a node disk is properly partitioned and mounted on `/scratch` we use Slurm's ability to periodically run health monitoring scripts on all responding compute nodes. We write a script `scratch_check.sh` that checks whether `/scratch` is mounted. If not, it partitions `/dev/sda` with a single partition spanning all disk, creates an XFS filesystem on it and mounts it in `/scratch`. We choose XFS because it efficiently handles large files, likely to be generated by IO intensive jobs. This requires adding `xfsprogs` to the package of the nodes image. In `slurm.conf`, we then set `HealthCheckProgram=<install dir>/etc/scripts/scratch_check.sh` and `HealthCheckInterval=65500`. The script will be run as root on all nodes every 65500s (maximum value is 65536) and at `slurmd` startup, after boot for instance, before registration to `slurmctld`. Next, we want users to have their own subdirectory `/scratch/job-<jobID>` accessible to them only, similarly to `/home/<username>`. We use Slurm's "Prolog" and "Epilog" scripts to create the user scratch directory for the duration of their job. Those scripts are run as root on the nodes at the beginning and end of each job. The script `create_user_scratch.sh` creates a `/scratch/job-<jobID>` directory owned by the user running the job, with permissions 700. The script `remove_user_scratch.sh` deletes this directory once the job is complete. In `slurm.conf`, specify Prolog and Epilog scripts with `Prolog=<install dir>/etc/scripts/create_user_scratch.sh` and `Epilog=<install dir>/etc/scripts/remove_user_scratch.sh`. By default, the Prolog script is only run on allocated nodes when they are actually used by a job step. Since we probably don't use `srun` to start commands as job steps, we prefer having the Prolog script run on all allocated nodes. This is requested with `PrologFlags=Alloc` in `slurm.conf`. #### Power save The cluster is rarely fully used, and some nodes remain idle for extended periods of time. A lot of power can be saved by powering them down. Set `SuspendTime=3600` to power down nodes in idle state for more than one hour. To poweroff nodes, Slurm needs a script to run. The script is run on the head, as user SlurmUser (slurm). To poweroff the nodes, slurm needs to be able to ssh to them as a privileged user. We want to avoid fiddling with `sudo` configuration: there are many pitfalls with big security implications. Therefore, we chose to give SlurmUser a ssh key that will be granted root login to the nodes, but only permitted to run `/usr/sbin/poweroff`. Create a SSH key for SlurmUser. Since SlurmUser is a system user, it does not have a home. We store the key pair in `/var/opt/slurm`. ssh-keygen -t ed25519 -N "" -f /var/opt/slurm/id_ed25519_slurm_poweroff chown slurm:slurm /var/opt/slurm/id_ed25519_slurm_poweroff* On the nodes, add the public key to `/root/.ssh/authorized_keys` with options `restrict` to disable all optional features of the SSH connection (tty allocation, agent forwarding, X forwarding etc...) and `command` to force execution of one (and only this one) command at login. /root/.ssh/authorized_keys on the nodes, add the line: restrict,command="/usr/sbin/poweroff" <content of /var/opt/slurm/id_ed25519_slurm_poweroff.pub> With this setting, if SlurmUser is compromised on the head, it can only power off nodes and has no root privileges on the head. This change should be included in the nodes build script. Next, write the script used to power nodes off. It takes as first argument the list of nodes to be powered off. This list can contain Slurm node ranges. They are converted to a simple list of hostnames with `scontrol show hostnames`. The output of the script is not logged by Slurm so redirect standard output and error to a log file `/var/log/slurm/poweroff.log` manually. <install dir>/etc/scripts/poweroff_node.sh: #!/bin/bash { NODELIST=$(/opt/slurm-21.08.0/bin/scontrol show hostnames "$1") for NODE in $NODELIST do echo "$(date): powering off $NODE" ssh -i /var/opt/slurm/id_ed25519_slurm_poweroff \ -o StrictHostKeyChecking=no \ -o UserKnownHostsFile=/dev/null \ "root@$NODE" done } >> /var/log/slurm/poweroff.log 2>&1 In `<install dir>/etc/slurm.conf`, specify the location of the script with `SuspendProgram=<install dir>/etc/scripts/poweroff_node.sh` To power nodes back up, we use WakeOnLAN. Make sure it is properly enabled on the nodes according to the WakeOnLAN section. The following script will be run on the head as SlurmUser when a job is allocated resources on powered off nodes. It takes as first argument the list of nodes, possibly with slurm ranges. The list is converted to a list of hosts, and a magic packet is sent to every host. The node database `/var/opt/slurm/nodedb` is used to obtain the MAC address of the target nodes. The standard output and error of the script is manually logged to `/var/log/slurm/powerup.log`. #!/bin/bash { NODELIST=$(/opt/slurm-21.08.0/bin/scontrol show hostnames "$1") for NODE in $NODELIST do echo "$(date): powering up $NODE" NODEMAC=$(/usr/bin/awk -v node="$NODE" \ '!/^($|[:space:]*#)/ { if ($3 == node) { print $1 } }' \ /var/opt/slurm/nodedb) wakeonlan -i 192.168.3.255 "$NODEMAC" done } >> /var/log/slurm/powerup.log 2>&1 Specify the location of the script in `<install dir>/etc/slurm.conf` with `ResumeProgram=<install dir>/etc/scripts/powerup_node.sh` Allow a timeout of 5 minutes before a powering up node is considered down with `ResumeTimeout=300` and instruct `sbatch` to wait 5 minutes before considering that an allocated job that has not started is failed with `BatchStartTimeout=300`. Finally, restart the Slurm controller with `systemctl restart slurmctld`. #### Nodes declarations We put the node declarations in a separate configuration file `<install dir>/etc/slurm.d/nodes.conf`, which we include in the main configuration file with `Include slurm.d/nodes.conf`. In `nodes.conf` declare each node with one line specifying its name and hardware configuration. It can be obtained as the first output line of `slurmd -C` run on the node. In order to preferably allocate nodes with little memory, add a parameter `Weight=xxx` for each node, and set the value equal to that of `RealMemory`. #### Partitions declarations Declare the partitions (aka. queues). As for nodes declaration, we put partitions declarations in a separate file `<install dir>/etc/slurm.d/nodes.conf` that we include **below** the nodes declarations with `Include slurm.d/partitions.conf`. Partitions definitions can be as simple as `PartitionName=<name> Nodes=<node list>`. Nodes ranges can be used in the node list. It might be convenient to define a default partition with `Default=YES` to save partition selection with `-p` option in job submissions. See `slurm.conf` man page for a list of available option. ### Node configuration Install only the required libraries. apt install libhwloc15 libnuma1 As discussed in `cgroup.conf` man page, Debianoid distributions like Ubuntu disable memory and swap cgroups by default. Enable them on the nodes by adding the kernel options `cgroup_enable=memory swapaccount=1` to the `APPEND` line in `/tftpboot/pxelinux.cfg/default`. Mount the NFS shared Slurm installation by adding the line `head:<install dir> <mount dir> nfs ro,defaults 0 0` to `/etc/fstab`. Add the install directory to PATH by adding the line `export PATH=<install dir>/bin:<install dir>/sbin:$PATH` at the top of `/etc/bash.bashrc`. Create the necessary directories.`slurmd` needs to run as root so no permission adjustment is required. mkdir /var/log/slurm mkdir /var/spool/slurmd Start the daemon. systemctl enable slurmd systemctl start slurmd Check that the daemons are running with `systemctl status`, and read the logs. Try running `sinfo` on the head to get a summary of the resources managed by Slurm. For idle nodes power off to work, make sure that you have setup access with the poweroff key as described in "Power save" subsection of "Slurm configuration" and enabled WakeOnLAN, as described in "WakeOnLAN" section. ### Usage Slurm's takes care of two main roles on the cluster: * Allocate resources on the cluster for jobs to run on (resource manager role). * Start jobs according to configured queueing policies and monitor their execution (scheduler role). A job is (most often) a bash script that runs on the allocated resources. In Slurm terminology, queues are called "partitions". #### Job submission Jobs can be submitted with sbatch [OPTIONS] <job script> Available options are listed in the man page of `sbatch`. A few useful ones: * `-J jobname`: specify the name of the job. If not specified, the script name is used. * `-o outputfile`, `-e outputfile`: by default, standard output and error of the jobs are redirected to `slurm-%j.out` with `%j` the job id. These two options allow to specify alternative files for output and error respectively. See the "filename pattern" section of `man sbatch` for the full list of available placeholders. * `-p partition`: specify the partition (*ie.* queue) on which to submit the job. If not specified, the job is submitted to the default partition, indicated by a start `*` in `sinfo`. If no default partition is configured, this option is mandatory. * `-n ntasks`: specify the number of tasks the job will need. Slurm will allocate the right number of CPUs to satisfy the requirement. If not specified, resources for one task is allocated per node. * `-w nodelist`: specify a list of nodes on which to allocate resources. If the selected nodes are not sufficient to satisfy the requested resources (eg: number of cores), additional nodes are used. If the requested resources are present on the requested nodes, but are already in use, then the job is pending until resources are freed. If not specified, Slurm allocates the job to any node(s) satisfying the job requirements. If a serial job is submitted with a nodelist containing several nodes, one copy of the job will run on each specified node. This is usually not what one wants. This option should only be used if you know what you are doing : better let Slurm take care of resource allocation. * `-N nodecount`: specify the number of nodes to allocate to the job. This number cannot be exceeded so it can be used to make sure that only specific nodes will be used if used in conjunction with `-w`. All options can also be hardcoded in the batch script by putting lines with format `#SBATCH [option]` at the top of the file, below the shebang `#!/bin/bash`. If an option is specified both in the command line and in the script, the command line value takes precedence. Running `slurmd -C` on a node prints its hardware configuration on one line, ready for use in `slurm.conf`. Node hardware update requires restart of `slurmctld` and all `slurmd`. #### Nodes reboot Nodes can be rebooted by Slurm with `scontrol reboot [ASAP] <Node list>`. Slurm will flag the selected nodes as "REBOOT" as reboot them once when they become idle. If "ASAP" option is used, the selected nodes are put in "DRAIN" state to prevent further jobs to be allocated on them. "ALL" can be used as node list to reboot all responding nodes. For this to work, a reboot program must specified in `slurm.config` with `RebootProgram=/usr/sbin/reboot`. #### Convenience alias Slurm does not provide easy access to the exact occupation of each node, probably with the idea of not encouraging user-fiddling with the allocation. `sinfo -N -o "%8N %6T %13C %z"` prints the number of available, allocated and idle cores on each node, in a format vaguely similar to `qstat` in SGE. Add a line in `/etc/bash.bashrc` alias snodes='sinfo -N -o "%8N %6T %13C %z"' to be able to quickly access this information with `snodes`. ### Notes #### Memory as a consumable resource I tried reserving some memory for the system on nodes. This requires setting memory as a consumable resource by setting `SelectTypeParameter=CR_Core_Memory` in `slurm.conf` and `ConstrainRAMSpace=yes` in `cgroup.conf`. To reserve some memory for system on all nodes, add a default node declaration `NodeName=DEFAULT MemSpecLimit=2048` **above** the include line. With this configuration, jobs taking too much memory are killed by the cgroup policy. However, this sets memory as a consumable resource, which affects jobs allocations. If no memory is specified in `sbatch` calls with `--mem` for instance, the jobs request by default all memory of the allocated node. Therefore, even if they only use a few cores on the node, no other job will be allocated to the same node, because all memory is already allocated and memory cannot be oversubscribed. This results in very inefficient use of cluster resources. I don't believe that users will be willing to request memory in addition to cores for their jobs (they would need to have an idea of how much memory they need). Therefore, I disabled the configuration described in this subsection. As a consequence, memory usage is not monitored and jobs might fill the memory of the nodes they run on, causing other jobs to be killed by the nodes kernel. This is the price to pay for not bothering about memory in jobs submissions. #### nss_slurm As far as Slurm is concerned, the LDAP server could have been avoided by using the nss_slurm plugin, which send user credentials along with job info. See https://slurm.schedmd.com/nss_slurm.html. This plugin, however, only transfers user details for job steps. The LDAP server remains a much more flexible and reliable solution for the whole cluster. https://slurm.schedmd.com/power_save.html ## SlurmDBD https://slurm.schedmd.com/accounting.html SlurmDBD is the accounting database daemon of Slurm. It interfaces with a third-party database to store data about running and past jobs. It is necessary to set up a fair share priority for the job on the cluster. FIXME for the time being, we intall SlurmDBD on the head. However, it would be nice to have it on a separate machine (eg NAS) in the future. Also, MariaDB might be a better choice. SlurmDBD requires a backend database manager to interact with, and currently supports MySQL and MariaDB. Hence, we will first install MySQL on the head. ### MySQL configuration On the head, install MySQL client libraries and recompile Slurm with them. apt install libmysqlclient-dev On the machine that will run the database, install the database server. apt install mysql-server Then secure the server running `mysql_secure_installation`. We don't need the password strength checking plugin. Root password is set to the same as root login password. Answer yes to all subsequent questions to remove anonymous user, restrict listening to localhost, remove the test database and apply all changes. Following https://slurm.schedmd.com/accounting.html, change the default InnoDB parameters by adding the following lines to `/etc/mysql/my.cnf`: /etc/mysql/my.cnf: [mysqld] innodb_buffer_pool_size=1024M innodb_log_file_size=64M innodb_lock_wait_timeout=900 Restart the database server with `systemctl restart mysql-server`. Next, create a slurm user in MySQL and a new database for SlurmDBD. Run `mysql` to enter the MySQL shell. In that shell run the following commands. Successful commands are followed by `Query OK`. create user 'slurm'@'localhost' identified by 'PASSWORD' `PASSWORD` should be replaced with a password. You can generate a random strong one with `openssl rand -base64 12`. Keep good note of this password as it will be required for SlurmDBD configuration. Simple quotes in the command above are required. Then grant all access to the created slurm user on the future slurm database with grant all on slurm_acct_db.* TO 'slurm'@'localhost'; Ensure that InnoDB is available by running `show engines`. Finally, create the slurm accounting database by running create database slurm_acct_db; ### SlurmDBD configuration Copy the example `slurmdbd.conf` from source directory (`/usr/local/src/slurm-<version>/etc/slurmdbd.conf.example`) to installation directory (`/opt/slurm-<version>/etc/slurmdbd.conf`). Since this file will contain credentials for the database, change ownership and permission to only grant access to slurm user. chown slurm: slurmdbd.conf chmod 600 slurmdbd.conf In `slurmdbd.conf` change `StoragePass` value to the password of the slurm user in MySQL. Start and enable `slurmdbd`. ### Slurm configuration Modify `slurm.conf` to enable fair-share priority of the jobs. Add the lines `AccountingStorageType=accounting_storage/slurmdbd` and `AccountingStorageHost=localhost` to instruct `slurmctl` to send accounting data to the `slurmdbd` running on localhost. Add the line `JobAcctGatherType=jobacct_gather/linux` to use the standard linux plugin for job accounting. Enable the multifactor priority plugin with `PriorityType=priority/multifactor`. Give an equal weight to age and fairshare factors in priority calculation with `PriorityWeightAge=1000` and `PriorityWeightFairshare=1000`. Restart `slurmctld`. Finally, create an account on the Slurm database with `sacctmgr add account standard fairshare=100`. On this account create users and give them all the same fairshare with `sacctmgr add user <username> defaultaccount=standard fairshare=10`. This should be added to the user creation script. Fairshares are normalised so only relative fairshare values between accounts and between users in each account matter for the fairshare calculation. The fairshare of each association (user + account) can be checked with `sshare -a`. It takes a few minutes before the fairshare factors are actually properly calculated. ## OpenMPI ### Build and install Make sure that PMIx is properly installed (see PMIx section). Download tarball from openmpi.org and extract it into build location (eg: `/usr/local/src/openmpi-<version>`). Move into the extracted source directory. Create the installation directory (eg: `/opt/openmpi-<version>`) and transfer ownership to "ceres:users". As user "ceres", run ./configure --prefix=<install dir> --with-pmix=<pmix install dir> make all -j <n cores> make install OpenMPI is built with Slurm support by default. Add path to executables and libraries to the environment variables. For the changes to be effective for non-interactive shells, the following lines must be added **above** the non-interactive kill switch in `~/.bashrc` export PATH=<install dir>/bin:$PATH export LD_LIBRARY_PATH=<install dir>/lib:$LD_LIBRARY_PATH Edit `/etc/exports` to share the installation directory. Read-only access is sufficient. ### Run Compile test programs in `/usr/local/src/openmpi-4.1.1/examples` by issuing a `make`. `hello_c` just prints a message from every process. No inter-process communication is involved. Useful to make sure that mpi binaries and remote hosts are accessible. `ring_c` passes message around a ring of processes and decrements a value. Useful to check inter-process communication. `connectivity_c` checks connectivity between all processes. If Slurm is not installed run MPI programs with mpirun --host <list of hosts> -np <number of processes> ring_c By default, the number of processes spawned on one host cannot exceed the number of cores on this host. If Slurm is installed, `mpirun` gets the host list and host specifications from it. Then, by default, `mpirun` will spawn as many process as allocated cores. The number of processes can still be adjust with `-n` option of `mpirun` in the batch script if one wants to run a program on fewer cores than the allocated ones. ### Troubleshooting Issues that can arise: * Host unreachable => check network config. Can you ping the host ? * `mpirun` just hangs => check that you can connect to the hosts via ssh without user intervention. Use passless keys, accept host keys before hand or set StrictHostKeyChecking no to avoid it alltogether. * Connection from unexpected IP address => this can happen if a host is connected to several networks (eg. the head, connected to both external and internal network). By default OpenMPI tries to use all the available networks. Instruct it to use specific networks with the option `--mca btl_tcp_if_include 192.168.10.0/24`. This should not happen with Slurm. ### Intel MPI Benchmark Dowload and unzip https://github.com/intel/mpi-benchmarks/archive/refs/heads/master.zip Change CC in `src_c/P2P/Makefile` to `CC=mpicc` and CC and CXX in `src_cpp/Makefile` to `CC=mpicc` and `CXX=mpiCC`. Run `make all`. There are many benchmarks available. The documentation can be found here : https://software.intel.com/content/www/us/en/develop/documentation/imb-user-guide/top.html. The PingPong benchmark of the program IMB-MPI1 can be used to assess communication speed between pairs of processes. mpirun -np 2 --host host1,host2 --mca btl_tcp_in_include 192.168.10.0/24 ./IMB-MPI1 PingPong ## Python We install a system-wide Python3 on the head. We don't want to bloat node's image with python, so we create a virtual environment in `/opt` (as user ceres) which is NFS-shared with the nodes. The virtualenv is silently activated in bashrc. We choose not to use Anaconda because it is bloated. We don't use Miniconda either because the standard python with `virtualenv` does the job. ### Installation Install `pip` and `virtualenv` from the repositories. apt install python3-pip python3-virtualenv The dependencies will automatically pull python development headers required by LAMMPS for instance. ### Virtualenv creation We create a virtual environment in `/opt` as user ceres. All python packages will be installed in this environment. As root, create the virtual environment directory in `/opt` and transfer ownership mkdir /opt/python3-venv chown -R ceres:users /opt/python3-venv As user ceres, create the virtual environment in the directory with `virtualenv /opt/python3-venv`. Activate the virtual environment automatically by adding a line `VIRTUAL_ENV_DISABLE_PROMPT=1 source /opt/python3-venv/bin/activate` above the non-interactive kill switch in `/etc/bash/bashrc`. The environment variable set at the beginning of the line disables the prompt modification, such that the virtual environment is transparent to users. The same line should be added to the `/etc/bash.bashrc` of the nodes image. As user "ceres", install packages you need with pip inside the environment by simply running `pip install <package name>` with the virtual environment activated. For instance pip install numpy scipy matplotlib **Make sure you always install new packages as user ceres, inside the virtual environment.** ## FFTW FFTW is one of the best libraries for FFT calculations. It can be used with LAMMPS. We build FFTW from source in `/opt` to have the last version, and control optimisation options. Download the last version of the code in a tarball from https://fftw.org. Extract it in `/usr/local/src/fftw-<version>`. Move into the build directory and run as user "ceres" ./configure --prefix=/opt/fftw-<version> --enable-shared --enable-threads --enable-openmp --enable-mpi --enable-sse2 --enable-avx --enable-avx2 --enable-avx512 --enable-generic-simd128 --enable-generic-simd256 This requests building of shared libraries (necessary for LAMMPS), threading support with both internal and OMP plugins, and support for all standard CPU vector instructions. Then, run `make -j 16` as user "ceres" to compile FFTW and `make install` (also as user "ceres") to install it in the configured location. Add FFTW libraries to `LD_LIBRARY_PATH` on both the head and the nodes by the line `export LD_LIBRARY_PATH=/opt/fftw-<version>/lib:$LD_LIBRARY_PATH` to `/etc/bash.bashrc`. You can check that everything is working by running `./bench --verify 256 -onthreads=4` from the `tests` directory inside the source tree. ## LAMMPS LAMMPS is a powerful, flexible and parallel molecular dynamics program. We build it from source to get the last version and choose optimisations plugins. LAMMPS can be built with `cmake`, which makes the build process and handling of the plugins much simpler than with the standard `make` build. apt install cmake zlib1g-dev pkg-config LAMMPS requires `mpicxx` in `/usr/bin` to correctly detect the MPI installation. The easiest solution is to create a symbolic link from the MPI installation with `ln -s /opt/mpi-<version>/bin/mpicxx /usr/bin`. Download LAMMPS code as a tarball and extract it in `/usr/local/src`. LAMMPS supports out-of-source compilation, which allows to build several LAMMPS with various packages in different directories. In the source directory, create a directory `build_most`. Move into this directory, and run cmake -D CMAKE_INSTALL_PREFIX=/opt/lammps-<version> \ -C ../cmake/presets/most.cmake \ -D BUILD_OMP=yes -D PKG_USER-OMP=yes \ -D PKG_USER-INTEL=yes \ -D FFT=FFTW3 \ -D FFTW3_INCLUDE_DIR=/opt/fftw-<version>/include/ \ -D FFTW3_LIBRARY=/opt/fftw-<version>/lib/libfftw3.so \ -D FFTW3_OMP_LIBRARY=/opt/fftw-<version>/lib/libfftw3_omp.so \ -D FFT_FFTW_THREADS=on \ ../cmake This will generate appropriate Makefiles. Build LAMMPS with `make -j <n cores>` as user ceres. Create the installation directory in `/opt`, transfer ownership to "ceres:users", and install LAMMPS to the configured location with `make install`, as user ceres. Add LAMMPS to PATH by adding a line in `/etc/bash.bashrc` on both the head and the nodes: export PATH=/opt/lammps-<version>/bin:$PATH You can check that LAMMPS works as intended by running the simulations available is the `examples` directory of the build tree. ## Intel suite A bunch of products from Intel OneAPI are installed in `/opt/intel` as user ceres. ## Message of the day When users log into the cluster from SSH, a welcome message is displayed, with various information about the system. This message is called the MOTD (Message of the day) and is generated by a bunch of scripts located in `/etc/update-motd.d` by PAM. The script `50-motd-news` displays useless ads from Ubuntu. Moreover, it sends detailed information about the running system to Canonical. Disable it by setting `ENABLED=0` in `/etc/default/motd-news` Completely disable the dynamic MOTD by making all scripts non executable with `chmod -x /etc/update-motd.d/*`. Write a simple, static MOTD in `/etc/motd`. It can be edited to communicate with users. ## Prometheus + Grafana It is sometimes useful to have a monitoring tool that aggregates and displays various data on the running cluster. Ganglia used to be a popular solution for this, but is no longer maintained and officially deprecated. There exist many alternatives. We will use Prometheus to gather metrics on the cluster. Prometeus data will be visualised with Grafana. ### Node Exporter installation on head https://medium.com/devops-dudes/install-prometheus-on-ubuntu-18-04-a51602c6256b https://grafana.com/oss/prometheus/exporters/node-exporter/ Node Exporter is a Prometheus lightweight exporter gathering essential metrics on a machine. We will install it on the head. It is distributed as a standalone binary. Download the archive at https://prometheus.io/download/#node_exporter and extract it somewhere. The archive contains a binary `node_exporter`. Move it to `/usr/local/bin` and change ownership to `root`. tar xf <node_exporter archive> mv <node_exporter archive>/node_exporter /usr/local/bin chown root: /usr/local/bin/node_exporter Remove the remaining files of the archive. Create a system user that will run `node_exporter`. useradd -rs /bin/false node_exporter Create a Systemd unit file with the following content: /etc/systemd/system/node_exporter.service: [Unit] Description=Node Exporter After=network.target [Service] User=node_exporter Group=node_exporter Type=simple ExecStart=/usr/local/bin/node_exporter [Install] WantedBy=multi-user.target Reload units, enable and start the unit. systemctl daemon-reload systemctl enable node_exporter.service systemctl start node_exporter.service The Node Exporter is now running and is ready to answer queries from a Prometheus server on port 9100. ### Prometheus installation We now install the Prometheus server on the head which will query Node Exporters. Download the archive at https://prometheus.io/download/. Extract it somewhere and copy the binaries to `/usr/local/bin` and change ownership to `root`. tar xf <prometheus archive> mv <prometheus archive>/prometheus <prometheus archive>/promtool /usr/local/bin chown root: /usr/local/bin/prometheus /usr/local/bin/promtool Create directories for Prometheus configuration files and data. mkdir /etc/prometheus /var/lib/prometheus Move the configuration files from the archive to the configuration directory. mv <prometheus archive>/console_libraries <prometheus archive>/consoles <prometheus archive>/prometheus.yml /etc/prometheus Remove the remaining files of the archive. Create a system user that will run the Prometheus server. useradd -rs /bin/false prometheus Change the ownership of Prometheus directories to prometheus user. chown -R prometheus: /etc/prometheus /var/lib/prometheus Write a Systemd unit file for the prometheus daemon. /etc/systemd/system/prometheus.service: [Unit] Description=Prometheus After=network.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/prometheus \ --config.file /etc/prometheus/prometheus.yml \ --storage.tsdb.path /var/lib/prometheus/ \ --web.console.templates=/etc/prometheus/consoles \ --web.console.libraries=/etc/prometheus/console_libraries [Install] WantedBy=multi-user.target Reload unit files, enable and start the Prometheus server. systemctl daemon-reload systemctl enable prometheus systemctl start prometheus The http interface can be accessed by pointing a browser to http://ceres.lps.u-psud.fr:9090. On the displayed page, navigate to Status > Targets and check that the Promtheus daemon can access its own metric from port 9090. Targets can be added in the configuration file `/etc/prometheus/prometheus.yml`. To scrap metric from the Node Exporter running on the head, add a target under the `scrape_configs` section - job_name: "node_exporter" static_configs: - targets: ["localhost:9100"] Restart the Prometheus server with `systemctl restart prometheus`. The http interface should now display the new target. ### Grafana installation Prometheus metrics display interface is quite clumsy and complicated. Instead, we will use Grafana to display metrics gathered by Prometheus. We install Grafana on the head following https://grafana.com/docs/grafana/latest/installation/debian/#1-download-and-install. The OSS version is enough for us, and lighter in corporate crap. After the installation is complete, enable and start Grafana server. systemctl daemon-reload systemctl enable grafana-server systemctl start grafana-server The server is reachable over http on port 3000 by default. https://grafana.com/docs/grafana/latest/getting-started/getting-started-prometheus/ https://grafana.com/docs/grafana/latest/auth/ldap/ ### Adding head dashboard to Grafana Grafana displays metrics in dashboads. Dashboards can be created from scratch, but it is easier to import some already created. We will use the standard Node exporter dashboard to display head's metrics. https://grafana.com/grafana/dashboards/1860 Point a browser to http://ceres.lps.u-psud.fr:3000. Click on the "+" in the left pannel and click "Import dashboard". Enter the dashboard ID from Grafana's website. Once the dashboard is imported, it should display metrics in various graph. Save the dashboard to add it permanently to the Grafana server. ### Slurm exporter Build and install the Slurm exporter on the head, which will gather metrics from `slurmctld`. https://github.com/vpenso/prometheus-slurm-exporter. Copy the binary `prometheus-slurm-exporter` to `/usr/local/bin` and copy the systemd unit file to `/etc/systemd/system`. In the systemd unit file, change the ExecPath to `/usr/local/bin/prometheus-slurm-exporter` and add a line `Environment=<slurm path>/bin/:$PATH` to help the daemon find Slurm's binaries. Reload daemons and enable the exporter with `systemctl daemon-reload`, `systemctl enable prometheus-slurm-exporter` and `systemctl start prometheus-slurm-exporter`. In `/etc/prometheus/prometheus.yml`, add a scrape section. - job_name: 'slurm_exporter' scrape_interval: 30s scrape_timeout: 30s static_configs: - targets: ['localhost:8080'] Restart the prometheus daemon. In Grafana, import the Slurm exporter dashboard to display collected metrics: https://grafana.com/grafana/dashboards/4323. ## Netdata Sometimes, one might need finer metrics than those gathered by Prometheus. Netdata gathers thousands of metrics every seconds. We install it, so that it is available if we need it, but we won't let it run all the time. ### Installation The installation is performed as root with bash <(curl -Ss https://my-netdata.io/kickstart.sh) --stable-channel --disable-telemetry The scripts installs the dependencies with `apt` and then builds Netdata. The provided options disable user statistics and enables automatic updates to stable versions. The installer also creates a netdata system user and provides a systemd unit file to run `netdata` daemon as netdata user, which is started and enabled by default. Those steps should be enough to get statistics on the head. The dashboard is accessible from the lab network, by pointing a web browser to http://ceres.lps.u-psud.fr:19999. By default, Netdata gathers thousands of metrics, every second. Recent data is stored in RAM and then gradually pushed to disk. ### Edit configuration files Configuration files are stored in `/etc/netdata`. To edit them, `cd` to the configuration directory and run `edit-config`. This script fetches stock configuration files from `/usr/lib/netdata/conf.d` and copies them to `/etc/netdata`. This ensures that the changes made by the administrators are not overridden on updates. The script uses the program specified in `EDITOR` to open the files. ### Head's configuration #### Restrict dashboard access to cluster users By default, `netdata` provides access to the metrics via a http server listening on port 19999, and accepts all connections. Although `netdata` runs as unprivileged user and metrics are read-only, they still contain sensitive information about the system (eg: users list). As a first security measure, instruct the http server to only bind to `localhost` and only provide dashboard access by changing the lines in the "[web]" section of `/etc/netdata/netdata.conf`: bind to = localhost=dashboard allow connections from = localhost Restart `netdata` with `systemctl restart netdata` to enforce the changes. With this configuration, it is no longer possible to access the dashboard by pointing a web browser to http://ceres.lps.u-psud.fr. Instead, one needs to connect from *within* ceres. This can be achieved with ssh. Forward local port 19999 (or any other) to port 19999 on ceres with `ssh -L 19999:localhost:19999 -N <username>@ceres.lps.u-psud.fr` on your machine. All connections made on port 19999 on your machine are now forwarded to port 19999 of ceres, on which `netdata` is listening. Thus, the dashboard can simply be accessed by pointing you browser to http://localhost:19999. This configuration ensures that only users able to connect to the cluster can access the dashboards. TO DO: further restrict this access to admins. #### Honor do-not-track policy By default, the http server does not honor the do-not-track policy of browser because it breaks access to NetdataCloud in some cases. Since we don't use NetdataCloud, we can safely honor DNT policies by changing the following line in the "[web]" section of `netdata.conf` respect do not track policy = yes ## Kwant Kwant is a Python package with external non-python dependencies used to model quantum transport. We install it under in the virtualenv, after installing all its dependencies under `/opt`. ### OpenBLAS Build OpenBLAS following the README ## To do / Ideas for improvement To do: * benchmark mpi * SensorIP * Backup Possible future improvements; * Use jumbo frames to speed up local network. (see `all-subnets-local` option of dhcpd). * Use orlando as a failover head. * Use a distributed filesystem (such as gluster or lustre) for /home with a second server. This could increase IO speed. * Use fiber connections on critical sections of the network (eg from head/nas to switches).