Administrator guide

# Administrator guide This page documents various tasks that administrators might need to perform. For convenience, scripts are provided for most tasks. The scripts straightforwardly implement the steps described here, so one can always use the commands explicitly for a finer control. ## Warning **Any machine registered to the cluster and declared as a node to Slurm will have its first disk automatically formatted as scratch space. This will result in irreversible loss of all data contained on that disk.** See scratch space configuration in Slurm section of the installation notes for details. ## Administration guidelines We want the cluster to be easy to maintain and update. This requires homogeneous and good administrative practices. * Read the docs ! It is where you are most likely to find answer to your questions. * Document thoroughly any change you make, in the right section of the documentation. Explicit and motivate choices if you made some. Explain background concepts required to understand what you did. * If you add a new configuration file, create a hard link under `/root/cluster_configuration/head/"path_to_config"` to keep all custom configurations in one place. * Keep the hardware inventory up to date. * Discuss with other administrators if unsure. Several brains are usually more likely to find a satisfying solution. * Scientific/specific software can be heavy. To avoid bloating the node's image, it is usually better to install them in `/opt`, which is shared over NFS. * When building software by hand (eg: OpenMPI, Slurm, LAMMPS...), build in `/usr/local/src/<yourprog-version>` and only *install* into `/opt/<yourprog-version>`. It is not necessary to share the whole build tree with the cluster. * To prevent hand-built software from messing up with the system, make the install directory in `/opt` owned by `ceres:users`. Build and install as user `ceres`. Make sure that group permission do not allow writing in the installation directory. ## Communication Two mailing lists are used for cluster-related communications. * cluster-admin.lps@universite-paris-saclay.fr: This list contains cluster administrator. It can be used by admins to discuss about the machine. Moreover, users can send messages to this list to get help. User messages will only be sent to the list after one moderator has accepted it. * cluster-user.lps@universite-paris-saclay.fr: This list contains all cluster users. It can be used by administrators to broadcast messages to all users, in a newsletter fashion (warning about planned down time, explaining unexpected failure etc...). Users cannot send message to this list. Remember to add newly created users to this list ! Lists can be administrated from https://listes.universite-paris-saclay.fr/lps. Log in with your university credentials. On the top rightmost menu, select "my lists". The various lists you have access to should be listed. ## Users and groups management Users and groups are stored in a LDAP database served by the head for the whole cluster. It ensures uniform and centralised authentication on all machines. Modifying the LDAP database is a bit tedious. One needs to write a modification file, and then submit it to the database. The primary group of all users is `users` (GID 100). Users can then belong to extra groups that define, for instance, the Slurm queues they are allowed to use. User and group IDs on the LDAP database start at 2000. ### User creation **These steps are performed by the script `create_user.sh`.** Determine the next available UID. To make sure that LDAP users don't clash with previously locally defined users, we use UIDs > 2000 for LDAP users. The list of users currently known to the system can be obtained with `getent passwd`. Generate a random password with `openssl rand -base64 10`. Write it down and give it to the user. Users can change their password later on with the `passwd` command (advise them to use a strong one !). Hash the password with ` ldappasswd -s <password>`. **Mind the space at the beginning of the command: it prevent the command, and hence the password, to be stored in bash history**. There is no such problem if you use the script, or store the password in a file. Create a LDAP modification file: dn: uid="username",ou=people,dc=ceres,dc=lps,dc=u-psud,dc=fr objectClass: inetOrgPerson objectClass: posixAccount objectClass: shadowAccount cn: "username" sn: "username" userPassword: "password hash" loginShell: /bin/bash uidNumber: "UID" gidNumber: 100 homeDirectory: /home/"username" mail: "user email address" Submit the modification file as the LDAP admin with `ldapadd -x -D cn=admin,dc=ceres,dc=lps,dc=u-psud,dc=fr -W -f <modif file>`. You will need the LDAP administrator password. Refresh the name service cache with `systemd reload nscd`. Enforce quota for the new user on `/home` with `zfs set userquota@"username"=250G homedirs`. Create home directory from `/etc/skel` with `cp -r /etc/skell /home/"username"`. Give ownership to the user `chown -R "username":users /home/"username"` and set permissions `chmod -R 700 /home/"username"`. Generate a SSH key for passwordless communications inside the cluster with `sudo -u "username" ssh-keygen -t ed25519 -N "" -f /home/"username"/.ssh/id_ed25519_cluster`. The name of the key **must** be `id_ed25519_cluster` for the SSH configuration in `/etc/ssh/ssh_config.d/internal_cluster_network.conf` to work properly. Copy the generated public key to the authorized keys to allow public key authentication inside the cluster. `sudo -u "username" cat /home/"username"/.ssh/id_ed25519_cluster.pub >> /home/"username"/.ssh/authorized_keys`. `sudo` only apply to the command, but the output redirection is performed as root. Thus, the ownership of the file needs to be changed with `chown <username>:users /home/<username>/.ssh/authorized_keys`, otherwise users won't be able to copy their keys. Since home directories are shared by all machines, this is enough for all nodes to know the public key and no `ssh-copy-id` is necessary. **Remember to add the user to the cluster-user mailing list. This is not (yet) done automatically.** ### User deletion **These steps are performed by the script `delete_user.sh`.** Remove the LDAP entry corresponding to the user with `ldapdelete -x -D cn=admin,dc=ceres,dc=lps,dc=u-psud,dc=fr -W uid="username",ou=people,dc=ceres,dc=lps,dc=u-psud,dc=fr`. You will need the LDAP administrator password. Refresh the name service cache with `systemd reload nscd`. Remove home directory with `rm -r /home/"username"`. The script asks before deleting it. **Remember to remove the user to the cluster-user mailing list. This is not (yet) done automatically.** ### Group creation **These steps are performed by the script `create_group.sh`.** Find the next available group ID, above 2000. `gentent group` lists the groups currently known to the system (LDAP and local). Write a LDAP modification file for the new group entry: dn: cn="group name",ou=groups,dc=ceres,dc=lps,dc=u-psud,dc=fr objectClass: posixGroup cn: "group name" gidNumber: "group ID" Submit the modification with `ldapadd -x -D cn=admin,dc=ceres,dc=lps,dc=u-psud,dc=fr -W -f "ldif file"`. You will need the LDAP administrator password. Refresh the name service cache with `systemd reload nscd`. ### Adding user to group **These steps are performed by the script `add_to_group.sh`.** Write a LDAP modification file: dn: cn="group name",ou=groups,dc=ceres,dc=lps,dc=u-psud,dc=fr changetype: modify add: memberUid memberUid: "username" Confusingly, the `memberUid` field contains the username and not UID. Submit the modification with `ldapmodify -x -D cn=admin,dc=ceres,dc=lps,dc=u-psud,dc=fr -W -f "ldif file"`. You will need the LDAP administrator password. Refresh the name service cache with `systemd reload nscd`. For performance reasons, Slurms keeps a cache of UID allowed to access each partition with group-based access. The cache is updated every 10 minutes. To refresh the cache and propagate the group membership to Slurm immediately, run `scontrol reconfigure`. ### Removing user from group **These steps are performed by the script `remove_from_group.sh`.** Write a LDAP modification file: dn: cn="group name",ou=groups,dc=ceres,dc=lps,dc=u-psud,dc=fr changetype: modify delete: memberUid memberUid: "username" Confusingly, the `memberUid` field contains the username and not UID. Submit the modification with `ldapmodify -x -D cn=admin,dc=ceres,dc=lps,dc=u-psud,dc=fr -W -f "ldif file"`. You will need the LDAP administrator password. Refresh the name service cache with `systemd reload nscd`. ### Group deletion **These steps are performed by the script `delete_group.sh`.** Simply delete the group entry in LDAP with `ldapdelete -x -D cn=admin,dc=ceres,dc=lps,dc=u-psud,dc=fr -W cn="group name",ou=groups,dc=ceres,dc=lps,dc=u-psud,dc=fr`. You will need the LDAP administrator password. Refresh the name service cache with `systemd reload nscd`. ## Access ### Fail2ban Fail2ban is running on the machine to prevent brute force attacks on ssh passwords. If one enters a bad password more than 5 times in a 10 minutes window, its IP address is banned for the next 10 minutes. The status of the jail can be checked by privileged users using fail2ban-client status sshd Fail2ban logs can be found in `/var/log/fail2ban.log`. To unban an IP before the end of the sentence, run fail2ban-client set sshd unbanip IP_ADDRESS_TO_UNBAN ## Storage quotas The purpose of quotas is to prevent disk filling by a faulty job, and encourage users not to use the head as a storage space (groups should have their storage on some other machine). By default, a quota of 200G is applied at user creation. Quotas for users *on the `/home` directory* are updated with zfs set userquota@<username>=<value> homedirs The full list of user quotas can be seen by administrators with zfs userspace homedirs Users can see their quota with `zfs get userquota@<username>` and the currently used space with `zfs get userused@<username>`. ## Nodes management All nodes in the cluster should boot over PXE. Nodes can be *registered* or *unknown*. Registered nodes are given a static IP address in the range `192.168.1.*` and a hostname by DHCP. Unknown nodes are given an automatic IP address in the range `192.168.3.*` and no hostname. Only registered nodes can be included in Slurm queues and used for calculations. The nodes are named according to their physical location in the cluster. The name pattern is `node-x-yy-z` with * `x` is the position of the rack mount. It goes from 0 to 2. The leftmost rack mount, as seen from the hot aisle, has index 0. * `yy` is the 2-digit vertical position of the node in the rack mount. It corresponds to the value displayed on the rack mount itself. * Some nodes only occupy half a rack width. `z` is 1 is the node occupies the right half (as seen from the hot aisle) and 0 otherwise. The list of registered nodes is maintained in the file `/var/opt/slurm/nodedb` on the head. It should be considered as the **only source of truth** for the nodes list. Its content is used to generate DHCP and DNS configurations. The DHCP configuration ensures that the nodes get a fixed IP address and the correct hostname. The DNS configuration ensures that each node name can be resolved in the whole cluster. ### Node registration Make sure that the node BIOS is configured to perform PXE boot in priority. Make sure that WakeOnLAN is enabled in BIOS. This is usually found under some "Power management" section. **Make sure that hyperthreading is disabled in BIOS.** Hyperthreading is meant for desktop applications with many pauses in their process. It is of no use in a HPC cluster which typically runs program that make use of all available CPU time. Hyperthreading settings can usually be found under the "Processor" section of the BIOS. It is sometimes called "Logical processors". Collect the MAC address of the network interface connected to the internal cluster network. It can be obtained: * From the BIOS * By connecting the node to the internal cluster network, and booting it. The node will soon appear in the DHCP server logs in the `192.168.3.*` range, along with its MAC address. Sometimes, 2 MAC can appear for a node. One is usually the IPMI MAC (not the one we want). In that case, the BIOS may be more convenient to get the correct MAC. Run the scrip `register_node.sh`, which automatically performs the following tasks: * Determine the next available IP address in the range 192.168.1.*. * Add a line `<MAC address> <IP address> <node name>` to `/var/opt/slurm/nodedb`. * Update DHCP and DNS (see below). When the node is registered, it can be added to Slurm. ### Node decommissioning Remove the node from Slurm. Run the script `decommission_node.sh` which automatically performs the following steps: * Remove the line corresponding to the node in `/var/opt/slurm/nodedb`. * Update DHCP and DNS (see below). ### DHCP update The script `update_dhcp.sh` generates the file `/etc/dhcp/dhcpd.d/cluster_nodes.conf` from `/var/opt/slurm/nodedb`. For each entry in `/var/opt/slurm/nodedb`, it writes one host declaration of the form host "nodename" { hardware ethernet "MAC address"; fixed-address "IP address"; option host-name ""node name""; } And then restarts the DHCP server with `systemctl restart isc-dhcp-server`. ### DNS update The script `update_dns.sh` generates the DNS zone file `/etc/bind/db.ceres.lps.u-psud.fr` from `/var/opt/slurm/nodedb`. For each entry in `/var/opt/slurm/nodedb`, it writes one DNS record of the form. "node name" IN A "IP address" And then restarts the DNS server with `systemctl restart bind9` and refresh the host name caching daemon with `nscd --invalidate=hosts`. ### Rename a node The simplest and cleanest solution to rename a node is to decommission it, and register it back with a new name. ## Slurm Slurm configuration file is `/opt/slurm-21.08.0/etc/slurm.conf`. For easier management, it includes `/opt/slurm-21.08.0/etc/slurm.d/nodes.conf` and `/opt/slurm-21.08.0/etc/slurm.d/partitions.conf` which contain nodes and partitions (aka queues) declarations respectively. ### New node declaration Only registered nodes can be put under Slurm control (see Node registration in Node management section). Declare the new registered node to Slurm by adding a line specifying the available resources on the node and its name to `/opt/slurm-21.08.0/etc/slurm.d/nodes.conf`. The line is a space separated list of "key=value" pairs, starting with `NodeName=<node name>`. It can be automatically obtained by running `slurmd -C` on the node (first output line). In order to allocate nodes with small memory first, add a parameter `Weight=xxx` to the line, with the same value as `RealMemory`. Note that since we configured Slurm with the consumable resources plugin to allow for finer resource management, the `CPUs` keyword now counts the number of hardware threads on the node and not of the actual number of CPUs. The number of actual CPUs is the number of sockets. It is a good practice to reflect the physical order of the machines in the node declaration file, because Slurm will consider adjacent definitions as adjacent nodes, and preferably allocate jobs on contiguous arrays of nodes. Adding or removing nodes requires a restart of `slurmctld` with `systemctl restart slurmctld` and all `slurmd` daemons running on the nodes. This can be safely performed on a live cluster because Slurm daemons save the state of the queues and jobs to disk, so running jobs will be recovered and the restart is transparent. If the node was booted before being declared to Slurm, then `slurmd` will have failed on the node, because it couldn't determine its name. Simply restart the daemon on the new node with `systemctl restart slurmd`. Check that the node is available (idle) with `sinfo`. ### Node removal If the node is currently running jobs, put it in "DRAIN" state to prevent further jobs to be allocated to it with `scontrol update NodeName=<node name> State=DRAIN Reason="Removal"`. The node will remain in "DRAINING" state until all jobs running on it terminate, at which point it will be in "DRAINED" state. Remove the node from all partitions in `/opt/slurm-21.08.0/etc/slurm.d/partitions.conf`. Remove the line corresponding to the node in `/opt/slurm-21.08.0/etc/slurm.d/nodes.conf`. The node can now be safely decommissioned by following steps in Node management section. ### Node reboot Nodes can be rebooted with `scontrol reboot [ASAP] <node list|ALL>`. If ASAP is specified, the selected nodes are put in DRAIN state. Otherwise, their state is only set to REBOOT, and they will reboot next time they idle. ### Partition declaration Simply add a line in `/opt/slurm-21.08.0/etc/slurm.d/partitions.conf`. A valid partition declaration requires at least a `PartitionName=<name>` at the beginning of the line and an item `Nodes=<node list>`. A list of possible options can be found in the "PARTITION CONFIGURATION" section of `slurm.conf`. Reload configuration with `scontrol reconfigure`. No daemon restart is required for partitions updates. ### Partition removal Make sure that the partition is not being used, otherwise running jobs will be killed. The partition can be drained with `scontrol update PartitionName=<name> State=DRAIN Reason="Removal"` if necessary. Remove the line corresponding to the partition from `/opt/slurm-21.08.0/etc/slurm.d/partitions.conf`. Reload configuration with `scontrol reconfigure`. No daemon restart is required for partitions updates. ### Nodes and partitions state Nodes states can be altered with `scontrol update NodeName=<node name> State=<state>`. * DRAIN prevents new jobs to be allocated to the node. The node will have state DRAINING until all jobs allocated to it terminate, at which point its state will switch to DRAINED. * RESUME manually resumes a DOWN or DRAIN node. The node will remain DOWN if it is unreachable or register with an invalid configuration, and back IDLE otherwise. Likewise, partitions state can be modified with `scontrol update PartitionName=<name> State=<state>`. * DRAIN prevents new jobs to be allocated to the partition. * UP makes a DOWN or DRAIN partition available again. ### Power save Slurm is configured to power off nodes after one hour of IDLE state. Nodes are powered back up whenever a job is allocated to them. They can also be explicitly powered up by administrators by changing their state to POWER_UP with scontrol update NodeName=<node name> State=POWER_UP Likewise, IDLE nodes can be manually powered off with scontrol update NodeName=<node name> State=POWER_DOWN If a node is powered off or on by other means, Slurm will put it in DOWN state with reason "Node unexpectedly rebooted". ## Python package installation Python3 is available for all cluster users through a virtualenv that is silently activated for all shells. The virtual environment directory is `/opt/python3-venv` which is shared with the nodes, and is owned by user "ceres". **All python packages should be installed in the virtual environment as user ceres.** To install a python package for all users, run `pip install <package name>` as user ceres, with the virtual environment activated. If the virtual environment is activated, the variable `VIRTUAL_ENV` is set, and can be checked with `echo $VIRTUAL_ENV`. The virtual environment is activated silently for all shells. If necessary, it can be stopped by running `deactivate`. It can be activated again manually with `source /opt/python3-venv/bin/activate`. The command can be prefixed by `VIRTUAL_ENV_DISABLE_PROMPT=1` to leave the shell prompt unchanged in the virtualenv. Users with specific development needs can create their own virtualenv in their home directory. ## Standard package updates / installations ### Apt updates on the head The head can be simply updated with apt update apt upgrade Some upgrades, such as kernel upgrades, require a reboot. Since the head hosts NFS-shared homes for the nodes, it may only be rebooted when all nodes are idle, otherwise jobs writing to `/home` will block. New packages can be installed from the Ubuntu repositories with `apt install <package name>`. ### Apt updates on the nodes The nodes build script always downloads up to date packages from the Ubuntu repositories. Hence, to update packages on the nodes, simply rebuild the image. To install new packages on the nodes, try adding them to the debootstrab include list at the beginning of the script. Since we bootstrap a minimal system, only core packages can be installed this way and cached in `rootcache.tar.zst`. If the debootstrap install fails, install by adding a line `chroot "$WORK_DIR" /usr/bin/apt install -y <package name>` later in the script, after extra repositories have been added. Since nodes keep all their root filesystem in memory, it is preferable to install only required packages to avoid bloating the image. Remember to remove the cache before running the build script, otherwise the core packages won't be downloaded again. rm /root/cluster_config/nodes/rootcache.tar.zst <Edit build.sh> bash /root/cluster_config/nodes/build.sh bash /root/cluster_config/nodes/update.sh Nodes will get the new image on next reboot. See Slurm section for a way of planning node reboots. ## Monitoring Various metrics of running cluster can be displayed by pointing a web browser to http://ceres.lps.universite-paris-saclay.fr:3000. ## Message of the day The message of the day, displayed at login can be edited in `/etc/motd` to communicate with users (warning about a planned downtime for instance).