# Gutenberg demos, 2025 winter kickstart
## Intro 10:15 10min
- What is cluster
- What's our problem?
- Language processing
- What's an ngram
- Why is it appropriate for the cluster
- data-based: hard disk→processor time is expected to be bottleneck
- can be split easily
- How we do it now
- *demos*, not type-along right now.
- We only explain HPC-related things. Not in scope for the demo:
- connection: tutorials, ask in afternoon
- version control (other courses)
- command line usage (video, other tutorials)
- Type-along in afternoon.
## Set up our project 10:25 10 min
- Download data to local computer
- https://users.aalto.fi/~darstr1/public/Gutenberg-Fiction-first100.zip
- Create data storage place
- Explanation: storage places, why you have to care
- ls storage places
- mkdir
- Copy our data over
- Explanation: ondemand
- Action:
- Create new directory
- Upload data via OOD interface
- Clone git repository
- Explanation: version control
- Explanation: ood terminal & later connections session
- Action:
- ssh to triton
- cd to the directory
- git clone https://github.com/AaltoSciComp/hpc-examples.git
- cd hpc-examples
- Initial test: run code on login node
- Explaination: login node vs cluster
- Explaination: --help
- Action
- python3 ngrams/count.py --help
- Explanation:
- arguments (how we tell our programs what to do)
- CodeRefinery course on scripting
- Action:
- python3 ngrams/count.py ../data/Gutenberg-Fiction-first100.zip
- Discuss how we use zipfile directly
## Run code on cluster 10:35 15 min
- Queue system:
- Explanation:
- Slurm & queue
- Interactive
- Explanation:
- srun
- --pty
- default resources (1 CPU)
- --time=00:05:00
- --mem=200M
- Explain output
- Action:
- srun --pty --time=00:05:00 --mem=200M python3 ngrams/count.py -n 2 -o ngrams-2.out ../data/Gutenberg-Fiction-first100.zip
- Add -n 2
- Add -o ngrams-2.out
- Explanation:
- pager
- Action:
- less ngrams-2.out
- Batch script
- Explanation:
- batch script
- Action:
- nano
- Explanation
- shebang
- #SBATCH lines
- Action:
- sbatch
- Explanation:
- slurm history
- Action:
- slurm queue
- slurm history 1hour
- Explanation: looking at output
- stdout output
- Action:
- less slurm-NNN.out
- Check timings
- slurm history
- seff
## Break 10:50
## Run code parallel 11:00 20 min
- Array jobs
- Explanation:
- array job
- Action:
- copy script to `run-ngrams-2-array.sh`
- `#SBATCH --array=0-9`
- `python3 ngrams/count.py --step=10 --start=$SLURM_ARRAY_TASK_ID -n 2 -o ngrams-2-array-$SLURM_ARRAY_TASK_ID.out ../data/Gutenberg-Fiction-first100.zip`
- `sbatch ngrams-2-array.sh`
- Look at array job outputs
- Array data processing:
- Combine array job outputs to one
- python3 ngrams/combine.py ngrams2-array*.out -o ngrams2-combined.out
- `head ngrams-2.out`
- `head ngrams-2-combined.out`
- Multiprocessing:
- Explanation:
- Quick difference between array and multiprocessing
- If your code uses multiprocessing, you need to tell it how many CPUs it should use
- and if you're planning on using multiprocessing, you should tell the queue to give you the resources
- Action:
- Run single core job: `srun --pty --time=00:05:00 --mem=2G python3 ngrams/count.py -n 2 --words -o ngrams-2.out ../data/Gutenberg-Fiction-first100.zip`
- Modify to run 4 core version
- `--cpus-per-task=4`
- Increase memory to `--mem=2G`
- `--threads=4`
- `Change the script name to `count-multi.py`
- Explanation:
- Check run time. See it isn't worth it.
- Explain that this is *not* optimized at all.
## Looking at results and outputs 11:20 5 min
- Connecting to ood: browse results with file browser
## Speed considerations 11:25 10 min
- i/o speed: try zipfile vs reading from directories
- Result: zipfile slightly faster
- Is this true if the cache is warm?
- If we have to run many times, is it better to repack to an uncompressed format?
- Moving data between processes in multiprocessing
- It can't use shared memory, so all data has to be pickled and unpickled.
- Data transfer bandwidth!
- Saving data speed (json serialization/deserialization)
- Data transfer bandwidth!
- It also sorts data when saving
- This is a case to use a better file format that is faster to read/write
-
## What's next?