
# CASTIEL2 Multi-GPU AI Train the Trainer course
**Contents**
[TOC]
:::info
**Participating organizations and people**
- NCC Sweden (ENCCS):
- Ashwin Mohanan (AM)
- Yonglei Wang (YW)
- Francesco Fiusco (FF)
- NCC Poland:
- Klemens Noga (KN)
- also addtional people, have to be discussed internally
- NCC Germany
- Maksym Deliyergiev (HLRS)
- NCC Belgium
- Geert Jan Bex (NCC Belgium/Vlaams Supercomputing Centrum)
- NCC Italy
- Domitilla Brandoni
- Laura Cavalli
- Michele Visciarelli
- NCC Montenegro:
- Luka Filipovic (UDG)
- Stevan Cakic (UDG)
- Danilo Planinić (UCG)
- NCC Hungary:
- Gyula Ujlaki
- NCC Netherlands:
- Caspar van Leeuwen (SURF)
- NCC Finland:
- Mats Sjöberg (CSC)
- Oskar Taubert (CSC)
- NCC Romania:
- Elena Paraschiv
:::
## Final schedule
### Pre-course checklist
Please start adding your material to [GitHub]
Some deadlines for follow up
- [ ] 28-nov: ==Germany==: Create allocation with early access to instructors starting (3-dec). Reserve HPC allocation with GPU nodes for the course period (30-jan to 5-feb). Suggestion to reserve at least 100 GPUs, assuming 50 participants and considering multi-GPU lessons are involved.
- [x] 1-dec: ==Sweden==: To send reminders to all instructors, to get access to [GitHub].
- [ ] 2-dec: ==All==: Final course planning meeting
- [ ] 3-dec: ==All==: Should get access to [GitHub] and should start adding their course materials.
- [ ] 3-dec: ==Germany==: Invite all instructors for access to the HPC cluster.
- [ ] 5-dec: ==Castiel==: Create registration form and circulate among NCCs. Filtering of participants:
- [ ] 12-jan: ==Sweden== To send reminders to finalize course material on [GitHub]
- [ ] 17-jan: ==All==: Last chance to add course materials
- [ ] 23-jan: ==Sweden==: Review, modify and publish course page along with materials
- [ ] 23-jan: ==Germany==: Invite all learners for access to the HPC cluster.
- [ ] 30-jan: ==All== Course starts
[GitHub]: https://github.com/ENCCS/castiel-multi-gpu-ai
### Course agenda
| Day | Date |
| --- | ------------ |
| 1 | 30-jan (Fri) |
| 2 | 2-feb (Mon) |
| 3 | 3-feb (Tue) |
| 4 | 4-feb (Wed) |
| 5 | 5-feb (Thu) |
:::success
In the table below the following short form is used
`<day>(M|A)(.<part>)`
M = Morning
A = Afternoon
:::
| Day | Lesson | Org | Comments / suggestions |
| --- |:-------------------------------------------------- |:--------------------------------------------------------------------------- |:--------------------------------- |
| 1M.1 | Intro to the HPC system (1.5h) | Italy | |
| 1M.2 | Setup and accessing Jupyter with GPU (1.5h) | Italy | |
| 1A | Introduction to deep-learning | Sweden | Need to reduce course length |
| 2M | Pytorch DDP | Netherlands | Sync with Hungary |
| 2A | Model parallelism with Pytorch | Hungary | Sync with Netherlands |
| 3M | Pytorch Lightning | Sweden | Reusable content from Turkey |
| 3A.1 | LLM, Finetuning (1.5h) | Belgium | |
| 3A.2 | HuggingFace Accelerate, Deepspeed (1.5h) | Italy |
| 4M | CV | Romania | |
| 4A | MLOps | Poland | |
| 5M | Ray + Retrieval Augmented Generation (RAG) | Italy | |
| 5A.1 | Hyperparameter tuning (2.5h) | Finland | |
| 5A.2 | Closing session with open discussion (30 min) | All | |
## Fifth meeting
2nd December
:::info
- EuroCC2 cost-neutral extensions has been approved - no issues there.
- Jureca constraints
- Access to Github
:::
- Jureca constraints, and possible solutions to allow more participation:
- GPU jobs are short `sbatch` jobs, and not launched
- Some learners get access to only the lectures
- Access to Github:
- ...
---
### Agenda (02/12/2025)
::::danger
:::spoiler **This agenda written in this red block is obsolete.**
**Day 0 (30/01/2026 - Friday)** / Intro JSC systems (Xin)
**Day 1 (02/02/2026 - Monday)**
• Day 1.a / Introduction to deep-learning (Ashwin) 2.0h / 09:00-11:00
Break (30min)
• Day 1.b / Introduction to deep-learning (Ashwin) 0.5h / 11:30-12:00
Lunch
• Day 1.c / Introduction to deep-learning (Ashwin) 1.5h / 13:00-14:30
Break (30min)
• Day 1.d / Pytorch DDP (Caspar) 2.0h / 15:00-17:00
**Day 2 (03/02/2026 - Tuesday)**
• Day 2.a / Model parallelism with Pytorch (Gyula) 2.0h / 09:00-11:00
Break (30min)
• Day 2.b / LLM, Finetuning (Geert) 0.5h / 11:30-12:00
Lunch
• Day 2.c / LLM, Finetuning (Geert) 1.0h / 13:00-14:00
Break (30min)
• Day 2.d / HuggingFace Accelerate, Deepspeed (Michele) 1.5h / 14:30-16:00
**Day 3 (04/02/2026 - Wednesday)**
• Day 3.a / Pytorch Lightning (Ashwin) 2.0h / 09:00-11:00
Break (30min)
• Day 3.b / Ray + Retrieval Augmented Generation RAG (Michele) 0.5h / 11:30-12:00
Lunch
• Day 3.c / Ray + Retrieval Augmented Generation RAG (Michele) 2.0h / 13:00-15:00
**Day 4 (05/02/2026 - Thursday)**
• Day 4.a / Computer Vision / NCC-Romania <span style="color:red;">**(??)** </span> 2.0h / 9:00-11:00
Break (30min)
• Day 4.b / Hyperparameter tuning (Oskar) 0.5h / 11:30h-12:00
Lunch
• Day 4.c / Hyperparameter tuning (Oskar) 1.0h / 13:00h-14:00
Break (30min)
• Day 4.d / MLOps / <span style="color:red;">**NCC-Poland (Klemens??) 1.5h / 14:30-16:00??** </span>
:::
::::
## Fourth meeting
4th November
:::info
To do
All
- Alt. plans if EuroCC2 cost-neutral extension does not happen
- Has to happen in Jan-Mar 2026
- Alternative do the course with the particpants involved in AIF or AIF-antennas.
- Who are **not** involved in AIF and AIF-antennas:
- Castiel2/BSC can find a way
- Hungary involved in antenna, need to clarify
- Montenegro needs further confirmation
- Finalize Date to 2–6 February 2026 as VP suggested?
CASTIEL:
- Licensing
- GitHub use
- Gitlab at code.europa.eu
Germany
- Access & reservation
- Might need to step out as a course instructor, but happy to host.
Sweden
- Show how / where to add material: https://github.com/ENCCS/castiel-multi-gpu-ai)
- Only 8 people added to https://github.com/orgs/ENCCS/teams/castiel-multi-gpu-ai/members so far.
:::
## Third meeting
23 October, 10:15 CEST
### Hosting the course material
We really need to get started.
- code.europa.eu or other institutional GitLab:
- restricted access
- may happen only with intervention from CASTIEL
- Our take: we can move to this later **after the course**
- Google Drive or Nextcloud etc:
- Poor version history, harder to copy material to a HPC cluster, no visibility
- Our take: we don't do this
- Github:
- We can do this now, and migrate to code.europa.eu when it is ready.
- We can add you to https://github.com/ENCCS/castiel-multi-gpu-ai. Please tell us your GitHub usernames here below or email me `ashwin.mohanan [at] ri [dot] se`:
- ujlaki15 - Hungary (github name - country)
- viscio - Italy
- lcavall11 - Italy
- dbrandoni - Italy
### Finalize topics
- Sweden:
- Intro to deep-learning
-
- ...
- ...
## Second meeting
7 October, 10:00 CEST
:::info
** Attendees **
- Francesco Fiusco (NCC Sweden)
:::
TODO:
- Email NCC Poland regarding MLOps
- Maksym: check with JUWELS
-
## First meeting
where we discuss the dates, people and rough course outline and other logistics. **Welcome, and check-in below:**
:::info
**Supercomputer**
- [LUMI](https://lumi-supercomputer.eu/)
- [Juwels](https://www.fz-juelich.de/en/ias/jsc/systems/supercomputers/juwels)
- [JURECA DC](https://apps.fz-juelich.de/jsc/hps/jureca/configuration.html#hardware-configuration-of-the-system-name-dc-module-phase-2-as-of-may-2021)
:::
:::success
**Tentative dates on 2026**:
Let's vote. Add a `+` to the right
- Week 3: 12-16 January ++
- Week 4: 19-23 January +++
- Week 5: 26-30 January +++
- Week 6: 2-6 February ++++
- other suggestions
:::
### OLD Course outline
:::success
M = Morning
A = Afternoon
:::
Here we slightly change the order in which the lessons appear and present some suggestions. Let's discuss!
| Day | Lesson | Org | Comments / suggestions |
| --- |:-------------------------------------------------- |:--------------- |:-------------------------------------------------------------------------------- |
| 1M | ==Access to HPC== | ==Hosting entity== | Needs to demonstrate launching JupyterLab |
| 1M | ==Intro to GPU== | ==Hosting entity== | Retain emphasis on architecture. Skip programming models like CUDA, OpenACC etc. |
| 1A | ==Introduction to deep-learning== | ==Sweden== | Need to reduce course length |
| 2M | MLOps and/or HP tuning | Belgium and/or Finland | Will follow up on this |
| 2A | Pytorch DDP | ==Netherlands== | Sync |
| 3M | Model parallelism with Pytorch | ==Hungary== | Sync |
| 3A | Other modes of paralellism with Pytorch (TBC) | Germany | Sync, and try to not repeat DDP and model parallelism |
| 4M | Pytorch Lightning | ==NCC Sweden== | Reusable content needed, approved |
| 4A | LLM, Finetuning, HuggingFace Accelerate, Deepspeed | ==GJB for first two topics <br> NCC Italy ~1.5h for last two topics== | Belgium + Italy own material |
| 5M | Ray + Retrieval Augmented Generation (RAG) | ==Italy== | |
| 5A |
| ==NCC Romania== | |
| 5A | Closing session with open discussion | All | |
### Links and further notes
- https://gitlab.tuwien.ac.at/vsc-public/training/LLMs-on-supercomputers
- MD from DE: Move preprocessing using Ray to the beginning of the course?
- If we use LUMI, then the first session on access to HPC should be someone from CSC Finland / CINECA / BSC
- AMD machine can be tricky?
- KN from Cyfronet:
- Can we still run this course under the EuroCC brand?
- We are still on EuroCC2
- New activities may need to switch under the AI-Factory?
- Ask others from AI-Factory to join us?
- Sima: As CASTIEL, they cannot provide training for the countries with AI Factories, going forward
- Missing topic: MLops, how to monitor training
- Have to discuss internally
- Ray is versatile, but does MLops fits better?
- EB from Cineca: ok to leave the topic on Ray
- GJB: on MLops, has a course which uses DVC rather than MLflow. Not suggesting an alternative, but something that can be included.
- YW and KN: we should try to include people working in AI Factory.
- LF:
- Course format: online.
- (Public) material from last year: https://drive.google.com/drive/folders/1GqULIbJ5wJsvUk6zgu9fFDCkvfmOtjQN?usp=drive_link
## Repository
License:
- CC-BY 4.0 https://creativecommons.org/licenses/by/4.0/
Format:
- Markdown, PDF, Jupyter Notebooks all compiled into a single Sphinx lesson
- See for example, our (ENCCS's) template
- Sources: https://github.com/ENCCS/sphinx-lesson-template
- Rendered: https://enccs.github.io/sphinx-lesson-template
Where to host:
- Does CASTIEL have a GitHub organization or equivalent?
- CASTIEL will check internally. All NCCs and COEs should be able to access https://code.hlrs.de
- Currently under maintenance. Try next time.
- If not we can add it under https://github.com/ENCCS and help with the formating and access to the instructors