# Stanford Sherlock, Globus, and AMD Tutorial
There is very detailed documentation available: https://www.sherlock.stanford.edu/docs/. Help is available here: https://www.sherlock.stanford.edu/docs/user-guide/troubleshoot/.
## Getting into Sherlock
Official documentation: https://www.sherlock.stanford.edu/docs/getting-started/connecting/#login
Use ssh on the command line to access the Sherlock cluster:
ssh username@sherlock.stanford.edu
Use your SUnet password to log in; it will require Duo authentication.

## Sending Files
Official documentation: https://www.sherlock.stanford.edu/docs/storage/data-transfer/
Here is information about the filesystems: https://www.sherlock.stanford.edu/docs/storage/filesystems/?h=file. Your user has 15GB of storage but you can put files in group storage or use Oak.
Send files from your local computer to the cluster using scp:
scp test.txt username@sherlock.stanford.edu:

### Globus
I use Globus (https://www.sherlock.stanford.edu/docs/storage/data-transfer/?h=globus#globus) to send files because it's more convenient. Globus is better for large file transfers and you also don't have to log in every time you send files.
Log in here with Stanford ID: https://www.globus.org/, then click File Manager on the left and add a new collection, SRCC Sherlock and go to /home/users/username/.

You can bookmark this collection.

To send files to and from your computer, you need to install Globus Connect Personal: https://www.globus.org/globus-connect-personal and select that as the other collection.
Then click "Start" to transfer files.

## Submitting Jobs
Official documentation: https://www.sherlock.stanford.edu/docs/user-guide/running-jobs/, https://www.sherlock.stanford.edu/docs/getting-started/submitting/
You should debug on your local computer and then you can use scp or Globus to send your Python (or whichever language you use) file to Sherlock.
#### Modules
module spider
is useful to see which modules are available and you can load without having to install.
For example, to see the different version of python, run
module spider python

To load the module, in this case Python 3.9.1, run
module load python/3.9.1
or equivalently
ml python/3.9.1
If this module is not available, you can pip install the package.
### Interactive Session
You can run an "interactive session" using sh_dev with various options:
$ sh_dev -h
sh_dev: start an interactive shell on a compute node.
Usage: sh_dev [OPTIONS]
Optional arguments:
-c number of CPU cores to request (OpenMP/pthreads, default: 1)
-n number of tasks to request (MPI ranks, default: 1)
-N number of nodes to request (default: 1)
-m memory amount to request (default: 4GB)
-p partition to run the job in (default: dev)
-t time limit (default: 01:00:00)
-r allocate resources from the named reservation (default: none)
-J job name (default: sh_dev)
-q quality of service to request for the job (default: normal)
Note: the default partition only allows for limited amount of resources.
If you need more, your job will be rejected unless you specify an
alternative partition with -p.
This can be useful for debugging.
### Submitting Jobs
Usually, if you are running a job that will take a long time, you should submit it using a batch job with a .slurm file.
You can write what you want into the slurm file and it will send the job to a compute node to run your file for you.
This is my slurm file for the PINN code:
1 #!/bin/bash
2 #
3 #SBATCH --partition=serc
4 #SBATCH --job-name=test1
5
6 #SBATCH --output=test_%A.out
7 #SBATCH --error=test_%A.err
8 #SBATCH --nodes=1
9 #SBATCH --ntasks=1
10 #SBATCH --cpus-per-task=2
11 #SBATCH --constraint=GPU_SKU:A100_SXM4
12 #SBATCH --time=2-00:00:00
13 #SBATCH --gpus=1
14 #SBATCH --mem-per-cpu=20G
15
16 module load python/3.9.0
17 module load py-tensorflow/2.10.0_py39
18 module load viz
19 module load py-matplotlib/3.4.2_py39
20
21 python3 pde_tester_multistage.py
To submit the batch job, run
sbatch job-file.slurm

The error/output files will be in <job_number>.err and <job_number.out>, and you can see the output of your job. You can also create/modify files in your Python code and they will appear as the job runs.
#### My .slurm file, in more detail
Line 3
3 #SBATCH --partition=serc
selects the SERC partition, which can be faster sometimes since there are fewer people using it. Here is more information about partitions: https://www.sherlock.stanford.edu/docs/user-guide/running-jobs/?h=partitions#resource-requests
Line 4 gives the job name. Lines 8-14 specify the resources needed. Since I want to run tensorflow with the GPU, I requested 1 gpu on line 14. I also wanted to use the A100 GPU which is faster, so I added line 11, which I sometimes comment out. The compromise is that the more resources you request, the longer you will wait in the queue before your job is run.
Lines 16-19 is loading the modules I need to use. I think there is a way to use .yml environments instead but I don't know how. You can use module spider to see which versions of which modules are available.
Line 21 is actually running the python script.
# SDSS AMD Cluster Tutorial
Email Brian: btempero@stanford.edu if you have issues or questions
## Connect/Login to AMD Cluster
Login using ssh:
ssh username@sdss-amd.stanford.edu
Use your SUnet password to log in. Duo authentication is not required but you have to be on the Stanford network to access the cluster.

If you are not at Stanford, you can use the Stanford VPN (download at https://uit.stanford.edu/service/vpn) which is Cisco AnyConnect on Mac. This will ask for Duo authentication.

## Sending Files
Each user has 500GB of storage. Please do not store any critical information on this cluster as it is not backed up.
Use scp to send files; Globus Connect Personal is installed but you can only send to a Globus endpoint (for example, Sherlock) and not to your personal computer.

## Sending Jobs
The time limit for a job is 2 days. Use the command
sinfo
to see the slurm queues and the time limits.

#### Modules
Brian: "If you want you can load anaconda (https://www.anaconda.com/) for your own personal virtual environment. This will help with your personalized needs. Remember that you will need to reload your conda environment once you get a compute node." -> I did not do this, but I am sure it is useful.
The cluster uses modules for some of the common tools. Here are some basic module commands:
module avail (to show what modules are available)
module load <ModuleName> (to load a module)
module list (to show what modules you have loaded)
module purge (to remove loaded modules)
If you want to install python modules/packages, use pip:
pip3 install --user <python module>
If you need an application installed, email Brian and he can get it installed. He responds very quickly.
#### Interactive Mode
The sdss-amd partition is “sdss." Here is a sample of how to get to allocate a compute node and run a module in interactive mode with 3 GPU’s. I think this is the same as sh_dev from before but I am not sure.
srun --partition=sdss --gres=gpu:3 --pty bash (will give you 3 gpu)
This will creates a new "job" that you are on, and so you can run jobs in real time.

If I create a Python file containing
print("hello")
then I can run the file:

### Submitting Jobs
Here is a sample/example of how to get to a compute node and run a module in batch mode. Create a job named slurm-job.slurm
#!/usr/bin/env bash
#SBATCH -o slurm.sh.out
#SBATCH -p sdss
echo "In the directory: `pwd`"
echo "As the user: `whoami`"
echo “write this is a file" > analysis.output
sleep 60
Submit the job:
$ sbatch slurm-job.sh
Submitted batch job 106

Then the output files are in slurm.sh.out as described in #SBATCH -o slurm.sh.out:

#### My Slurm File
I don't know why but I don't have to module load all the modules I am using.
This runs the same PINN job as on Sherlock:
#!/bin/bash
#
#SBATCH --partition=sdss
#SBATCH --job-name=multistage-job
#
#SBATCH --output=test_%A.out
#SBATCH --error=test_%A.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --time=6:00:00
#SBATCH --gpus=1
#SBATCH --mem-per-cpu=20G
<!-- python3 -m pip install tensorflow-rocm==2.12.0.560 this line is probably unnecessary -->
module load rocm/rocmtools
python3 vcurrent_pde_tester_multistage.py
ROCM refers to the GPU developed by AMD. I am not sure if my code is actually GPU enabled or not but it says it is.

### References
If you have not used a SLURM based cluster before you must familiarize yourself with the following documentation. Only the “SLURM USERS” secton: https://slurm.schedmd.com/
Please look at the following sites for examples, tips, and tricks:
https://icme.stanford.edu/resources/hpc-compute-resources/icme-cluster
https://web.stanford.edu/group/farmshare/cgi-bin/wiki/index.php/Main_Page