Stanford Sherlock, Globus, and AMD Tutorial

# Stanford Sherlock, Globus, and AMD Tutorial There is very detailed documentation available: https://www.sherlock.stanford.edu/docs/. Help is available here: https://www.sherlock.stanford.edu/docs/user-guide/troubleshoot/. ## Getting into Sherlock Official documentation: https://www.sherlock.stanford.edu/docs/getting-started/connecting/#login Use ssh on the command line to access the Sherlock cluster: ssh username@sherlock.stanford.edu Use your SUnet password to log in; it will require Duo authentication. ![](https://hackmd.io/_uploads/S1tnJF0R3.png) ## Sending Files Official documentation: https://www.sherlock.stanford.edu/docs/storage/data-transfer/ Here is information about the filesystems: https://www.sherlock.stanford.edu/docs/storage/filesystems/?h=file. Your user has 15GB of storage but you can put files in group storage or use Oak. Send files from your local computer to the cluster using scp: scp test.txt username@sherlock.stanford.edu: ![](https://hackmd.io/_uploads/B1KOeYC0h.png) ### Globus I use Globus (https://www.sherlock.stanford.edu/docs/storage/data-transfer/?h=globus#globus) to send files because it's more convenient. Globus is better for large file transfers and you also don't have to log in every time you send files. Log in here with Stanford ID: https://www.globus.org/, then click File Manager on the left and add a new collection, SRCC Sherlock and go to /home/users/username/. ![](https://hackmd.io/_uploads/BkBl-F0Rh.png) You can bookmark this collection. ![](https://hackmd.io/_uploads/SyhQWKAA3.png) To send files to and from your computer, you need to install Globus Connect Personal: https://www.globus.org/globus-connect-personal and select that as the other collection. Then click "Start" to transfer files. ![](https://hackmd.io/_uploads/Hkv1fFRCh.png) ## Submitting Jobs Official documentation: https://www.sherlock.stanford.edu/docs/user-guide/running-jobs/, https://www.sherlock.stanford.edu/docs/getting-started/submitting/ You should debug on your local computer and then you can use scp or Globus to send your Python (or whichever language you use) file to Sherlock. #### Modules module spider is useful to see which modules are available and you can load without having to install. For example, to see the different version of python, run module spider python ![](https://hackmd.io/_uploads/SyktNK00h.png) To load the module, in this case Python 3.9.1, run module load python/3.9.1 or equivalently ml python/3.9.1 If this module is not available, you can pip install the package. ### Interactive Session You can run an "interactive session" using sh_dev with various options: $ sh_dev -h sh_dev: start an interactive shell on a compute node. Usage: sh_dev [OPTIONS] Optional arguments: -c number of CPU cores to request (OpenMP/pthreads, default: 1) -n number of tasks to request (MPI ranks, default: 1) -N number of nodes to request (default: 1) -m memory amount to request (default: 4GB) -p partition to run the job in (default: dev) -t time limit (default: 01:00:00) -r allocate resources from the named reservation (default: none) -J job name (default: sh_dev) -q quality of service to request for the job (default: normal) Note: the default partition only allows for limited amount of resources. If you need more, your job will be rejected unless you specify an alternative partition with -p. This can be useful for debugging. ### Submitting Jobs Usually, if you are running a job that will take a long time, you should submit it using a batch job with a .slurm file. You can write what you want into the slurm file and it will send the job to a compute node to run your file for you. This is my slurm file for the PINN code: 1 #!/bin/bash 2 # 3 #SBATCH --partition=serc 4 #SBATCH --job-name=test1 5 6 #SBATCH --output=test_%A.out 7 #SBATCH --error=test_%A.err 8 #SBATCH --nodes=1 9 #SBATCH --ntasks=1 10 #SBATCH --cpus-per-task=2 11 #SBATCH --constraint=GPU_SKU:A100_SXM4 12 #SBATCH --time=2-00:00:00 13 #SBATCH --gpus=1 14 #SBATCH --mem-per-cpu=20G 15 16 module load python/3.9.0 17 module load py-tensorflow/2.10.0_py39 18 module load viz 19 module load py-matplotlib/3.4.2_py39 20 21 python3 pde_tester_multistage.py To submit the batch job, run sbatch job-file.slurm ![](https://hackmd.io/_uploads/Skk5HKC0n.png) The error/output files will be in <job_number>.err and <job_number.out>, and you can see the output of your job. You can also create/modify files in your Python code and they will appear as the job runs. #### My .slurm file, in more detail Line 3 3 #SBATCH --partition=serc selects the SERC partition, which can be faster sometimes since there are fewer people using it. Here is more information about partitions: https://www.sherlock.stanford.edu/docs/user-guide/running-jobs/?h=partitions#resource-requests Line 4 gives the job name. Lines 8-14 specify the resources needed. Since I want to run tensorflow with the GPU, I requested 1 gpu on line 14. I also wanted to use the A100 GPU which is faster, so I added line 11, which I sometimes comment out. The compromise is that the more resources you request, the longer you will wait in the queue before your job is run. Lines 16-19 is loading the modules I need to use. I think there is a way to use .yml environments instead but I don't know how. You can use module spider to see which versions of which modules are available. Line 21 is actually running the python script. # SDSS AMD Cluster Tutorial Email Brian: btempero@stanford.edu if you have issues or questions ## Connect/Login to AMD Cluster Login using ssh: ssh username@sdss-amd.stanford.edu Use your SUnet password to log in. Duo authentication is not required but you have to be on the Stanford network to access the cluster. ![](https://hackmd.io/_uploads/BypFuNJ16.png) If you are not at Stanford, you can use the Stanford VPN (download at https://uit.stanford.edu/service/vpn) which is Cisco AnyConnect on Mac. This will ask for Duo authentication. ![](https://hackmd.io/_uploads/ByRruVyy6.png) ## Sending Files Each user has 500GB of storage. Please do not store any critical information on this cluster as it is not backed up. Use scp to send files; Globus Connect Personal is installed but you can only send to a Globus endpoint (for example, Sherlock) and not to your personal computer. ![](https://hackmd.io/_uploads/SknP5Eyk6.png) ## Sending Jobs The time limit for a job is 2 days. Use the command sinfo to see the slurm queues and the time limits. ![](https://hackmd.io/_uploads/r1NC5V11T.png) #### Modules Brian: "If you want you can load anaconda (https://www.anaconda.com/) for your own personal virtual environment. This will help with your personalized needs. Remember that you will need to reload your conda environment once you get a compute node." -> I did not do this, but I am sure it is useful. The cluster uses modules for some of the common tools. Here are some basic module commands: module avail (to show what modules are available) module load <ModuleName> (to load a module) module list (to show what modules you have loaded) module purge (to remove loaded modules) If you want to install python modules/packages, use pip: pip3 install --user <python module> If you need an application installed, email Brian and he can get it installed. He responds very quickly. #### Interactive Mode The sdss-amd partition is “sdss." Here is a sample of how to get to allocate a compute node and run a module in interactive mode with 3 GPU’s. I think this is the same as sh_dev from before but I am not sure. srun --partition=sdss --gres=gpu:3 --pty bash (will give you 3 gpu) This will creates a new "job" that you are on, and so you can run jobs in real time. ![](https://hackmd.io/_uploads/B1eJaV1JT.png) If I create a Python file containing print("hello") then I can run the file: ![](https://hackmd.io/_uploads/SyDIp41ya.png) ### Submitting Jobs Here is a sample/example of how to get to a compute node and run a module in batch mode. Create a job named slurm-job.slurm #!/usr/bin/env bash #SBATCH -o slurm.sh.out #SBATCH -p sdss echo "In the directory: `pwd`" echo "As the user: `whoami`" echo “write this is a file" > analysis.output sleep 60 Submit the job: $ sbatch slurm-job.sh Submitted batch job 106 ![](https://hackmd.io/_uploads/HymQAVyy6.png) Then the output files are in slurm.sh.out as described in #SBATCH -o slurm.sh.out: ![](https://hackmd.io/_uploads/H1RD0V11a.png) #### My Slurm File I don't know why but I don't have to module load all the modules I am using. This runs the same PINN job as on Sherlock: #!/bin/bash # #SBATCH --partition=sdss #SBATCH --job-name=multistage-job # #SBATCH --output=test_%A.out #SBATCH --error=test_%A.err #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=2 #SBATCH --time=6:00:00 #SBATCH --gpus=1 #SBATCH --mem-per-cpu=20G  module load rocm/rocmtools python3 vcurrent_pde_tester_multistage.py ROCM refers to the GPU developed by AMD. I am not sure if my code is actually GPU enabled or not but it says it is. ![](https://hackmd.io/_uploads/B1eV-Hykp.png) ### References If you have not used a SLURM based cluster before you must familiarize yourself with the following documentation. Only the “SLURM USERS” secton: https://slurm.schedmd.com/ Please look at the following sites for examples, tips, and tricks: https://icme.stanford.edu/resources/hpc-compute-resources/icme-cluster https://web.stanford.edu/group/farmshare/cgi-bin/wiki/index.php/Main_Page