User guide - HackMD

# User guide This page contains guides for the usage of the cluster. If you cannot find answers to your questions here, you can send an email to cluster-admin.lps@universite-paris-saclay.fr. Modification or improvement requests can also be sent to the same address. ## Accessing the cluster The cluster is only accessible from the internal lab network. You can enter the lab network by * connecting an ethernet cable to your computer in the lab, * starting a VPN from outside of the lab (ask IT to setup a VPN access for you), * jumping through a bridge machine. ### First connection Ceres can be reached over SSH (Secure SHell) from the lab's network using ssh <your username>@ceres.lps.u-psud.fr On first connection attempt, the server will offer it's public key to prove its identity. Ceres public key has the following fingerprint: SHA256:nP5I0+w5KJZ7em4M3WExx6bE7I5YLWVvEWzHFN7H9uE (ED25519) Check that the fingerprint displayed is the same as this one (this ensures that you are really connecting to Ceres, and not some rogue machine trying to impersonate it), and accept it. The server identity is know known and will be checked on every future SSH connections. Once a key is accepted, you will be asked for your password. After entering the correct password, a welcome message is displayed: you are connected to Ceres. ### Key based authentication To avoid typing your username and the server address every time, a section can be added on the client machine (ie. your computer) in `~/.ssh/config`. If the file does not exist already, create it. ~/.ssh/config: Host ceres Hostname ceres.lps.u-psud.fr User <your username> Now you only need to type `ssh ceres` to connect to the cluster. Your password will still be required though. To avoid typing the password, you can use public key authentication. Then a pair of cryptographic keys is used to authenticate you (see https://en.wikipedia.org/wiki/Public-key_cryptography). Ceres only accepts `ed25519` type public keys. If you don't have a key pair already, you can generate one with: ssh-keygen -t ed25519 Accept the default location for the key, and protect it with a passphrase if you like. This defeats the purpose of saving password typing, but ensures a high level of security since an attacker would then need both you private key and its passphrase to impersonate you. Once your key pair is generated, send your public key to Ceres using ssh-copy-id ceres You can now connect to the cluster by typing `ssh ceres` without being prompted for your password (if the key is not passphrase protected). Note that the generated private key `id_ed25519` grants access to your account on every machine to which you copied your public key, so it should be **kept secret**. ## Password change To change your password, log in to the cluster and type `passwd`. Enter your old password, and the new one twice. Remember that choosing a weak password may allow an attacker to enter the cluster, and leverage vulnerabilities to gain higher privileges. Always use **strong** passwords (eg: 10 random characters) to secure not only yours, but everybody's data. ## Data storage The cluster should not be used for storage. Groups usually have a dedicated machine for data storage: discuss with your group to know how to use it. Moreover, to prevent ill-configured job from filling up the disks, a quota of 200G is enforced for every user. You can view your currently used space with `zfs get userused@<user name>`. Contact administrators if you need a larger quota. ## Slurm ### Overview and jargon Ceres is a computing cluster, *ie.* a bunch of computers (called *nodes*) working together to run programs. Users can submit *jobs* to the cluster. Slurm is the program that takes care of allocating node resources to jobs, and running the jobs on the allocated resources. The nodes are separated in various *partitions* (*aka* queues) with different properties. A *job* is (most often) a bash script that runs *tasks* on the allocated resources. In a computer, operations are performed by "the processor". To understand how computing resources are handled by Slurm, it is useful to look more closely to processors structure. * *Sockets* are physical processors receptacles on the motherboard of the computer. One computer can have multiple processors, plugged on different sockets of the motherboard. * One processor may contain several *cores* which are relatively independent components performing operations in parallel. On modern hardware, essentially all processors are multi-cores. Some nodes in the cluster carry processors with as much as 20 cores. In general, only one task should be allocated per core. ![](https://i.imgur.com/Y1rI5fv.png) A job running on only one core is called *serial*. A job running on multiple cores at the same time is called *parallel*. ### List partitions and nodes The partitions you have access to can be displayed with `sinfo`. The partition with a star `*` is the default one, to which job will be submitted if no partition is explicitly requested (with `-p` option, see next section). For a node oriented summary, use `sinfo -Nl`. `snodes` is a convenience alias that prints the number of available, allocated and idle cores on each nodes. Nodes are separated in a partition called *main* which contains recent nodes. This is the default partition. The partition *slow* contains older nodes. Nodes can be in several states: * Idle: all resources on the node are free and can be allocated to jobs. * Mix: some resources on the nodes are free and can be allocated to jobs. * Alloc: all resources on the node are currently used by job(s). * Down: the node cannot be used. * Drain: no new jobs can be allocated to that node. Already running jobs are still running. Nodes can be put in that state by administrators to prepare removal or update of the node. * Reboot: a reboot request is pending on the node. The node will reboot when it becomes idle. A `~` following a node state in `sinfo` output means that the node is currently powered down. In order to save power, Ceres will automatically power off nodes that have been in idle state for more than one hour. When resources are allocated on powered off nodes, they are automatically powered up. This can take a couple of minutes, just be patient :). To get detailed information about a node, including its hardware configuration, run `scontrol show node <nodename>`. ### Submit a job To submit job, run sbatch [OPTIONS] <job script> Available options are listed in the man page of `sbatch`. A few useful ones: General options: * `-J jobname`: specify the name of the job. If not specified, the script name is used. * `-o outputfile`, `-e outputfile`: by default, standard output and error of the jobs are redirected to `slurm-%j.out` with `%j` the job id. These two options allow to specify alternative files for output and error respectively. See the "filename pattern" section of `man sbatch` for the full list of available placeholders. * `-p partition`: specify the partition (*ie.* queue) to which the job is submitted. If not specified, the job is submitted to the default partition, indicated by a start `*` in `sinfo`. If no default partition is configured, this option is mandatory. Resource selection options: * `--mem size`: requires that allocated nodes have at least the specified memory. * `--mem-per-cpu`: requires that allocated nodes have at least the specified amount of memory available per allocated cpu. * `--mincpus n`: specifes a minimum number of logical cpus/processors per node. * `-w nodelist`: specify a list of nodes on which to allocate resources. **It is best to specify nodes features (number of cores, memory...) required by your job and let Slurm decide on which nodes to allocate resources**. If the nodes selected with this option are not sufficient to satisfy the requested resources (eg: number of cores), additional nodes are used. If the requested resources are present on the requested nodes, but are already in use, then the job is pending until resources are freed. If not specified, Slurm allocates the job to any node(s) satisfying the job requirements. * `-N nodecount`: specify the number of nodes to allocate to the job. This number cannot be exceeded so it can be used to make sure that only specific nodes will be used if used in conjunction with `-w`. Parallel job options: * `-n ntasks`: specify the number of tasks the job will need. Slurm will allocate the right number of cores to satisfy the requirement. If not specified, resources for one task is allocated. If a number of nodes is requested with `-w` or `-N`, then one task is allocated on each node by default. * `-c ncpus`: specifies the number of cores to be allocated for each task. If not specified, one core will be allocated per task. You can ask to receive an email to follow progress of your job with `--mail-type ALL --mail-user <your email address>`. All options can also be hardcoded in the batch script by putting lines with format `#SBATCH [option]` at the top of the file, below the shebang `#!/bin/bash`. If an option is specified both in the command line and in the script, the command line value takes precedence. Some example job scripts are available in your home directory, in `~/slurm_examples`. ### List submitted jobs Submitted jobs can be viewed with `squeue`. `squeue -u <username>` only displays jobs from the specified user. Jobs can be in several states, including * `PD`: pending, the job is awaiting resource allocation. * `R`: running. * `CF`: configuring, the job has been allocated resources on powered-off nodes which are currently booting. The job will start as soon as all allocated nodes are up. This can take a couple of minutes. Detailed information about a job can be obtained with `scontrol show job <job id>`. ### Serial job example The following script runs a serial program. slurm_examples/serial.sh: #!/bin/bash # #SBATCH -J serialtest echo "Test serial job" echo "Running on $(hostname)" for i in 1 2 3 do date sleep 10s done echo "Finished !" It can be submitted with sbatch serial.sh With no option, the job will be allocated one core on an available node of the default partition. ### Parallel job example Parallel programs use specific mechanisms to distribute their calculations on several processes running at the same time. There are two main tools used to write parallel programs: MPI and OpenMP. MPI allows to distribute work across multiple communicating processes (called *tasks*). MPI is very flexible and allows to run tasks on different machines. MPI programs are usually started with the `mpirun` command. OpenMP distributes work across *threads*. Threads are lighter than processes, but also more restrictive. All threads essentially need to run on the same machine. The number of threads used by a parallel program using OpenMP is controlled by an environment variable `OMP_NUM_THREADS`. Using `srun` is the recommended way for running parallel programs in a job. It creates a *job step* which is under Slurm's control, allowing fine hardware allocation policies configured in Slurm to be enforced. The parallelisation is specified with `-n` and `-c` options that can either be provided to `srun` in the job script or directly at job submission to `sbatch`. The option `-n <ntasks>` specifies the number of MPI tasks to spawn. If the number of tasks is not specified to `sbatch`, then 1 task is implied. If the `-n` option is not provided to `srun` inside a job script, the value specified in the `sbatch` call is the default. One MPI task occupies one core by default. The option `-c <nthreads>` specifies the number of cores to be allocated for each task. It allows to allocate extra core on which to run multiple threads of an OpenMP program. If the number of threads is not specified to `sbatch`, then only one is used per MPI task (see `-n`). One OpenMP thread occupies one core. Some programs support both MPI and OpenMP parallelism. They can be run with both `-n` and `-c` options. This will spawn $ntasks$ tasks each using $nthreads$ threads. Thus, the job will need at least $ntasks \times nthreads$ cores to run. The following script runs a parallel job that prints hostname from all allocated cores, and tests that all spawned processes are able to communicate with each other. slurm_examples/parallel.sh: #!/bin/bash sleep 10s srun hostname sleep 10s srun ./connectivity_c `hostname` is a serial program, so calling it with `srun` simply runs copies of it on each allocated core. `connectivity_c` however is a parallel program: it runs interacting processes on each core. The job can be submitted with sbatch -n <ntasks> parallel.sh The following script runs a LAMMPS simulation that supports both MPI and OpenMP parallelisation. #!/bin/bash srun lmp -in in.melt It can be submitted with sbatch -n 4 -c 2 lammps.sh This will run LAMMPS on 4 MPI tasks, each using 2 threads. The environment variable `OMP_NUM_THREADS` is automatically set to the number of CPUs per task specified with `-c`. Note that the number of MPI tasks and CPUs per tasks could also be specified at the job step level by supplying options to `srun`. The closer the cores running processes, the faster the interprocess communications. Processes on the same socket communicate much faster than processes on the same node but different sockets. Processes on the same node communicate much much faster than processes on different nodes. In practice, you should **not** run communication intensive programs on more than one node, since the communication time between processes will explode, and result in a very inefficient use of cluster resources. Attempting to spread threads of an OpenMP program across different nodes will fail. ### Scratch storage Jobs can use local storage on nodes, located at `/scratch/job-<jobID>`. This can speed up IO intensive jobs because read/write operations on the network-shared `/home/<username>` are typically slower than local disk access. The scratch storage is created at job allocation and destroyed when the job terminates, so all produced data need to be copied back to `/home` within the job script. The following script runs a serial job and uses the local scratch storage. slurm_examples/scratch.sh: #!/bin/bash echo "I am an IO intensive job !" echo "Let's use scratch" sleep 20s # Generate 10MB of zeroes... dd if=/dev/zero of="/scratch/job-$SLURM_JOB_ID/bigfile.dat" bs=1M count=10 # Copy the file from scratch to home when the work is done. cp "/scratch/job-$SLURM_JOB_ID/bigfile.dat" . The scratch space is local to each node. Therefore if you run a parallel job on more than one node, you should expect problems (eg: output scattered in different local scratch storages). ### Interactive job Slurm allows to allocate resources for an interactive shell. Commands run in the shell will be executed on the allocated resources. This can be useful to debug a program interactively while having it running on the cluster. To allocate resources for a shell, run salloc [OPTIONS] The options are essentially the same as for `sbatch` (`-n`, `-c`, `-N`, `-J`...). After the allocation is granted, the prompt is returned and commands can be run interactively on the allocated resources. When you are done with your job, exit the shell by typing `exit` or hitting `Crtl-D`. This will free the allocated resources for other uses. ## Available software Some standard scientific softwares are available on the cluster. If you need a specific software that you think could be of general use, please contact administrators to add it system-wide. Specific software however should be installed locally by users in their home. Ask administrators for help if you don't know how to do it. ### Python A Python3 is available for all users. A few standard packages are installed as well. If you need a package that is not installed, please contact administrators. #### Details Python is provided via a virtual environment silently activated for all users. Should you need to, it can be deactivated with `deactivate`. Note that doing so, you will not have access to the installed packages anymore. To reactivate the virtual environment, log off and in again. Users with very specific needs and sufficient know-how can always create their own local `virtualenv` and do all they want there. Anaconda or Miniconda can also be used instead of virtualenv. To activate a conda environment in a Slurm job, use `source /full/path/to/your/conda/install/bin/activate yourenv` instead of `conda activate yourenv`. ### Intel compilers Intel compilers are available, along with many other Intel tools, under `/opt/intel`. ### FFTW FFTW 3.3.10 is installed in `/opt/fftw-3.3.10` with threading, and vector instructions support. ### LAMMPS The parallel molecular dynamics program LAMMPS is installed system-wide and can be called as `lmp`. It includes most optional packages including accelerator packages for OpenMP threading and Intel specific optimisations.