2025 winter kickstart - Gutenberg demos

# Gutenberg demos, 2025 winter kickstart ## Intro 10:15 10min - What is cluster - What's our problem? - Language processing - What's an ngram - Why is it appropriate for the cluster - data-based: hard disk→processor time is expected to be bottleneck - can be split easily - How we do it now - *demos*, not type-along right now. - We only explain HPC-related things. Not in scope for the demo: - connection: tutorials, ask in afternoon - version control (other courses) - command line usage (video, other tutorials) - Type-along in afternoon. ## Set up our project 10:25 10 min - Download data to local computer - https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip - Create data storage place - Explanation: storage places, why you have to care - ls storage places - mkdir - Copy our data over - Explanation: ondemand - Action: - Create new directory - Upload data via OOD interface - Clone git repository - Explanation: version control - Explanation: ood terminal & later connections session - Action: - ssh to triton - cd to the directory - git clone https://github.com/AaltoSciComp/hpc-examples.git - cd hpc-examples - Initial test: srun code from login node - Explaination: login node vs cluster - Explaination: --help - Action - python3 ngrams/count.py --help - Explanation: - arguments (how we tell our programs what to do) - CodeRefinery course on scripting - Action: - python3 ngrams/count.py ../data/Gutenberg-Fiction-first100.zip - Discuss how we use zipfile directly ## Run code on cluster 10:35 15 min - Queue system: - Explanation: - Slurm & queue - Interactive - Explanation: - srun - --pty - default resources (1 CPU) - --time=00:05:00 - --mem=200M - Explain output - Action: - srun --pty --time=00:05:00 --mem=200M python3 ngrams/count.py -n 2 -o ngrams-2.out ../data/Gutenberg-Fiction-first100.zip - Add -n 2 - Add -o ngrams-2.out - Explanation: - pager - Action: - less ngrams-2.out - Batch script - Explanation: - batch script - Action: - nano - Explanation - shebang - #SBATCH lines - Action: - sbatch - Explanation: - slurm history - Action: - slurm queue - slurm history 1hour - Explanation: looking at output - stdout output - Action: - less slurm-NNN.out - Check timings - slurm history - seff ## Break 10:50 ## Run code parallel 11:00 20 min - Array jobs - Explanation: - array job - Action: - copy script to `run-ngrams-2-array.sh` - `#SBATCH --array=0-9` - `python3 ngrams/count.py --step=10 --start=$SLURM_ARRAY_TASK_ID -n 2 -o ngrams-2-array-$SLURM_ARRAY_TASK_ID.out ../data/Gutenberg-Fiction-first100.zip` - `sbatch ngrams-2-array.sh` - Look at array job outputs - Array data processing: - Combine array job outputs to one - python3 ngrams/combine.py ngrams2-array*.out -o ngrams2-combined.out - `head ngrams-2.out` - `head ngrams-2-combined.out` - Multiprocessing: - Explanation: - Quick difference between array and multiprocessing - If your code uses multiprocessing, you need to tell it how many CPUs it should use - and if you're planning on using multiprocessing, you should tell the queue to give you the resources - Action: - Run single core job: `srun --pty --time=00:05:00 --mem=2G python3 ngrams/count.py -n 2 --words -o ngrams-2.out ../data/Gutenberg-Fiction-first100.zip` - Modify to run 4 core version - `--cpus-per-task=4` - Increase memory to `--mem=2G` - `--threads=4` - `Change the script name to `count-multi.py` - Explanation: - Check run time. See it isn't worth it. - Explain that this is *not* optimized at all. ## Looking at results and outputs 11:20 5 min - Connecting to ood: browse results with file browser ## Speed considerations 11:25 10 min - i/o speed: try zipfile vs reading from directories - Result: zipfile slightly faster - Is this true if the cache is warm? - If we have to run many times, is it better to repack to an uncompressed format? - Moving data between processes in multiprocessing - It can't use shared memory, so all data has to be pickled and unpickled. - Data transfer bandwidth! - Saving data speed (json serialization/deserialization) - Data transfer bandwidth! - It also sorts data when saving - This is a case to use a better file format that is faster to read/write - ## What's next?