Solutions for hands-on HPC intro

# Solutions for hands-on HPC intro ---- ## Connecting to Triton https://scicomp.aalto.fi/triton/tut/connecting/#exercise 1. Connect to Triton. List your home directory and work directory $WRKDIR. ```bash # Time [01:00] ssh USERNAME@kosh.aalto.fi # needed if not on Aalto network ssh triton.aalto.fi ls # list of home files here cd $WRKDIR ls # most likely empty if you have nothing stored on work folder ``` 2. Check the uptime and load of the login node: uptime and htop (q to quit - if htop is not available, then top works almost as well). What else can you learn about the node? ```bash # Time [01:00] uptime # Output will look like: # 10:02:22 up 133 days, 13:49, 88 users, load average: 4.72, 4.60, 4.34 htop # It displays a monitoring interface. q to quit ``` [01:00] 3. Check what your default shell is: echo $SHELL. Go ahead and change your shell to bash if it’s not yet (see below). ```bash # Time [01:00] echo $SHELL ssh kosh.aalto.fi chsh -s /bin/bash quit ``` OPTIONAL 4. Test jupyter.triton.aalto.fi [03:00] 5. Test vdi.aalto.fi [05:00] ---- ## Applications https://scicomp.aalto.fi/triton/tut/applications/#exercises 1. Figure out how to use tensorflow (this is not a software problem, but a searching the documentation problem). Make it work enough to do python and import tensorflow. ```python #Time [05:00] module load anaconda ipython # This starts ipython import tensorflow tensorflow.__version__ # Output will look like # Out[2]: '2.4.1' ``` 2. Find the Applications page link above, and check the list for ways to find out if we already have your software installed. See if we have what you need, using any of the strategies on that list. 3. (optional) From the Applications page, find the Spack package list (warning: it’s a very long page and takes a while to load). Does it have anything useful to you? 4. (optional) Discuss among your group what software you need, if it’s available, and how you might get it. ---- ## Modules https://scicomp.aalto.fi/triton/tut/modules/#exercises Before each exercise, run module purge to clear all modules. 1. module avail and check what you see. Find a software that has many different versions available. Load the oldest version. [01:00] `module avail` 2. PATH is an environment variable that shows from where programs are run. See it’s current value using echo $PATH. Then, load a module such as py-gpaw. List what it loaded. Check the value of PATH again. Why is there so much stuff? ```bash # Time [02:00] echo $PATH # You will see a list of folders in your $PATH module load py-gpaw echo $PATH #the list has grown ``` 3. (Advanced) Same as number 2, but use env | sort > filename to store environment variables, then swap to py-gpaw/1.3.0-openmpi-scalapack-python3. Do the same, and compare the two outputs using diff. ```bash #Time [05:00] env |sort > current_env.txt module load py-gpaw/1.3.0-openmpi-scalapack-python3 env | sort > new_env.txt diff current_env.txt new_env.txt ``` 4. Load a module with many dependencies, such as r-ggplot2 and save it as a collection. Compare the time needed to load the module and the collection. (Does time not work? Change your shell to bash, see the previous tutorial) ```bash #Time [01:00] time module load r/ggplot2 ``` 5. (Advanced) Load openfoam-org/7-openmpi-metis. Use which to find where executable blockMesh is coming from and then use ldd to find out what libraries it uses. ```bash #Time [01:00] module load openfoam-org/7-openmpi-metis which blockMesh ldd /share/apps/spack/envs/fgci-centos7-generic/software/openfoam-org/7/beuz5lf/platforms/linux64GccDPInt32RpathOpt/bin/blockMesh ``` --- ## Data storage https://scicomp.aalto.fi/triton/tut/storage/#exercises No need for terminal commands (except for the optional strace exercise), mostly based on discussions. --- ## Interactive https://scicomp.aalto.fi/triton/tut/interactive/#exercises 1. The program hpc-examples/slurm/memory-hog.py uses up a lot of memory to do nothing. Let’s play with it. It’s run as follows: python hpc-examples/slurm/memory-hog.py 50M, where the last argument is however much memory you want to eat. You can use --help to see the options of the program. ```bash cd $WRKDIR mkdir hpcexercises cd hpcexercises git clone https://github.com/AaltoSciComp/hpc-examples.git module load anaconda python hpc-examples/slurm/memory-hog.py --help ``` a. Try running the program with 50M. ```bash # Time [05:00] # Note that we are on the login node now python hpc-examples/slurm/memory-hog.py 50M # Output will look like: # # Trying to hog 50000000 bytes of memeory # Using 8650752 bytes so far (allocated: 2) # ... # Using 75894784 bytes so far (allocated: 67108864) ``` b. Run the program with 50M and srun --mem=500M. ```bash srun python hpc-examples/slurm/memory-hog.py 50M srun --mem=500M python hpc-examples/slurm/memory-hog.py 50M ``` c. Increase the amount of memory the Python process tries to use (not the amount of memory Slurm allocates). How much memory can you use before the job fails? ```bash # Time [03:00] # Increasing: srun --mem=500M python hpc-examples/slurm/memory-hog.py 1000M # With the next one it will get killed: srun --mem=500M python hpc-examples/slurm/memory-hog.py 10000M ``` d. Look at the job history using slurm history - can you see how much memory it actually used? - Note that Slurm only measures memory every 60 seconds or so. To make the program last longer, so that the memory used can be measured, give the --sleep option to the Python process, like this: python hpc-examples/slurm/memory-hog.py 50M --sleep=60 - keep it available. ```bash srun --mem=500M python hpc-examples/slurm/memory-hog.py 50M --sleep=60 ``` 2. The program hpc-examples/slurm/pi.py calculates pi using a simple stochastic algorithm. The program takes one positional argument: the number of trials. The time program allows you to time any program, e.g. you can time python x.py to print the amount of time it takes. a. Run the program, timing it with time, a few times, increasing the number of trials, until it takes about 10 seconds: time python hpc-examples/slurm/pi.py 500, then 5000, then 50000, and so on. ```bash # Time [05:00] # Note that we are on the login node now time python hpc-examples/slurm/pi.py 500 # ... time python hpc-examples/slurm/pi.py 10000000 ``` b. Add srun in front (srun python ...). Use the seff <jobid> command to see how much time the program took to run. (If you’d like to use the time command, you can run srun --mem=<m> --time=<t> time python hpc-examples/slurm/pi.py <iters>) ```bash # Time [05:00] # You can try changing values of mem and time options and see how it changes the output of "seff" srun --mem=500M --time=00:00:20 time python hpc-examples/slurm/pi.py 10000000 seff <JOBID> # replace with the jobID you obtained from previous command ``` c. Tell srun to use five CPUs (-c 5). Does it go any faster? ```bash # Time [01:00] srun --mem=500M --time=00:00:20 --cpus-per-task=5 time python hpc-examples/slurm/pi.py 10000000 seff <JOBID> # replace with the jobID you obtained from previous command # Note the CPU-efficiency now compared to previous time! ``` d. Use the --threads=5 option to the Python program to tell it to also use five threads. ... python .../pi.py --threads=5 ```bash # Time [01:00] srun --mem=500M --time=00:00:20 --cpus-per-task=5 time python hpc-examples/slurm/pi.py 10000000 --threads=5 seff <JOBID> # Note the CPU-efficiency now ``` e. Look at the job history using slurm history - can you see how much time each process used? What’s the relation between TotalCPUTime and WallTime? 3. Check out some of these commands: sinfo, sinfo -N, squeue. Run slurm job <jobid> on some running job - does anything look interesting? 4. Run scontrol show node csl1 What is this? (csl1 is the name of a node on Triton - if you are not on Triton, look at the sinfo -N command and try one of those names). --- ## Serial jobs 1. Submit a batch job that just runs hostname. a. Set time to 1 hour and 15 minutes, memory to 500MB. b. Change the job’s name and output file. c. Check the output. Does the printed hostname match the one given by slurm history/sacct -u $USER? ```bash # Start a terminal editor like "nano" nano myfirstjob.sh # Insert these lines: #!/bin/bash #SBATCH --time=01:15:00 #SBATCH --mem=500M #SBATCH --output=hello.out #SBATCH --job-name="my first job!" srun echo "Hello $USER! You are on node $HOSTNAME" # save and quit nano with CTRL+X # submit the job: sbatch myfirstjob.sh # check history slurm h ``` 2. Create a batch script which does nothing (or some pointless operation for a while), for example sleep 300. Check the queue to see when it starts running. Then, cancel the job. What output is produced? ```bash # Start a terminal editor like "nano" nano sleepyjob.sh # Insert these lines: #!/bin/bash #SBATCH --time=01:00:00 #SBATCH --mem=50M #SBATCH --output=sleepy.out #SBATCH --job-name="I sleep!" echo "I am going to sleep" srun sleep 300 # save and quit nano with CTRL+X # submit the job: sbatch sleepyjob.sh # check the queue slurm q # cancel the job scancel JOBID # get jobID from previous outputs # check history slurm h # check output cat sleepy.out ``` 3. Create a slurm script that runs the following program: for i in $(seq 30); do date sleep 10 done a. Submit the job to the queue. b. Log out from Triton. Log back in and use slurm queue/squeue -u $USER to check the job status. c. Use cat name_of_outputfile to check at the output periodically. d. Cancel the job once you’re finished. ```bash # Start a terminal editor like "nano" nano sleepyfor.sh # Insert these lines: #!/bin/bash #SBATCH --time=01:00:00 #SBATCH --mem=50M #SBATCH --output=sleepyfor.out #SBATCH --job-name="Forloopsleep" echo "I am going to sleep with for loops" for i in $(seq 30); do date sleep 10 done # save and quit nano with CTRL+X # submit the job: sbatch sleepyfor.sh # check the queue slurm q # cancel the job scancel JOBID # get jobID from previous outputs # check history slurm h # check output cat sleepyfor.out ``` 4. (Advanced) What happens if you submit a batch script with bash instead of sbatch? Does it appear to run? Does it use all the Slurm options? ``` It will just ignore SBATCH directives and run each command sequentially as if it is on the login node. ``` 5. (Advanced) Create a batch script that runs in another language using a different #! line. Does it run? What are some of the advantages and problems here? --- ## Monitoring https://scicomp.aalto.fi/triton/tut/monitoring/#exercises 1.a ```bash # Copy this on a text file named pi10e8.sh #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --mem=50M #SBATCH --output=$j.out srun python hpc-examples/slurm/pi.py 100000000 # and then submit it with sbatch ``` 1.b ```bash # Copy this on a text file named pi10erange.sh #!/bin/bash #SBATCH --time=00:10:00 #SBATCH --mem=50M #SBATCH --output=$j.out srun python hpc-examples/slurm/pi.py 100 srun python hpc-examples/slurm/pi.py 1000 srun python hpc-examples/slurm/pi.py 10000 srun python hpc-examples/slurm/pi.py 100000 srun python hpc-examples/slurm/pi.py 1000000 srun python hpc-examples/slurm/pi.py 10000000 # and then submit it with sbatch ``` 1.c ```bash ``` --- ## Array jobs https://scicomp.aalto.fi/triton/tut/array/#exercises --- ## GPU computing https://scicomp.aalto.fi/triton/tut/gpu/#exercises --- ## Parallel computing https://scicomp.aalto.fi/triton/tut/parallel/#exercises