![](https://www.eurohpc-ju.europa.eu/sites/default/files/styles/oe_theme_medium_2x_no_crop/public/2023-02/eurocc2.jpg?itok=RDXozVhu =20%x) # CASTIEL2 Multi-GPU AI Train the Trainer course **Contents** [TOC] :::info **Participating organizations and people** - NCC Sweden (ENCCS): - Ashwin Mohanan (AM) - Yonglei Wang (YW) - Francesco Fiusco (FF) - NCC Poland: - Klemens Noga (KN) - also addtional people, have to be discussed internally - NCC Germany - Maksym Deliyergiev (HLRS) - NCC Belgium - Geert Jan Bex (NCC Belgium/Vlaams Supercomputing Centrum) - NCC Italy - Domitilla Brandoni - Laura Cavalli - Michele Visciarelli - NCC Montenegro: - Luka Filipovic (UDG) - Stevan Cakic (UDG) - Danilo Planinić (UCG) - NCC Hungary: - Gyula Ujlaki - NCC Netherlands: - Caspar van Leeuwen (SURF) - NCC Finland: - Mats Sjöberg (CSC) - Oskar Taubert (CSC) - NCC Romania: - Elena Paraschiv ::: ## Final schedule ### Pre-course checklist Please start adding your material to [GitHub] Some deadlines for follow up - [ ] 28-nov: ==Germany==: Create allocation with early access to instructors starting (3-dec). Reserve HPC allocation with GPU nodes for the course period (30-jan to 5-feb). Suggestion to reserve at least 100 GPUs, assuming 50 participants and considering multi-GPU lessons are involved. - [x] 1-dec: ==Sweden==: To send reminders to all instructors, to get access to [GitHub]. - [ ] 2-dec: ==All==: Final course planning meeting - [ ] 3-dec: ==All==: Should get access to [GitHub] and should start adding their course materials. - [ ] 3-dec: ==Germany==: Invite all instructors for access to the HPC cluster. - [ ] 5-dec: ==Castiel==: Create registration form and circulate among NCCs. Filtering of participants: - [ ] 12-jan: ==Sweden== To send reminders to finalize course material on [GitHub] - [ ] 17-jan: ==All==: Last chance to add course materials - [ ] 23-jan: ==Sweden==: Review, modify and publish course page along with materials - [ ] 23-jan: ==Germany==: Invite all learners for access to the HPC cluster. - [ ] 30-jan: ==All== Course starts [GitHub]: https://github.com/ENCCS/castiel-multi-gpu-ai ### Course agenda | Day | Date | | --- | ------------ | | 1 | 30-jan (Fri) | | 2 | 2-feb (Mon) | | 3 | 3-feb (Tue) | | 4 | 4-feb (Wed) | | 5 | 5-feb (Thu) | :::success In the table below the following short form is used `<day>(M|A)(.<part>)` M = Morning A = Afternoon ::: | Day | Lesson | Org | Comments / suggestions | | --- |:-------------------------------------------------- |:--------------------------------------------------------------------------- |:--------------------------------- | | 1M.1 | Intro to the HPC system (1.5h) | Italy | | | 1M.2 | Setup and accessing Jupyter with GPU (1.5h) | Italy | | | 1A | Introduction to deep-learning | Sweden | Need to reduce course length | | 2M | Pytorch DDP | Netherlands | Sync with Hungary | | 2A | Model parallelism with Pytorch | Hungary | Sync with Netherlands | | 3M | Pytorch Lightning | Sweden | Reusable content from Turkey | | 3A.1 | LLM, Finetuning (1.5h) | Belgium | | | 3A.2 | HuggingFace Accelerate, Deepspeed (1.5h) | Italy | | 4M | CV | Romania | | | 4A | MLOps | Poland | | | 5M | Ray + Retrieval Augmented Generation (RAG) | Italy | | | 5A.1 | Hyperparameter tuning (2.5h) | Finland | | | 5A.2 | Closing session with open discussion (30 min) | All | | ## Fifth meeting 2nd December :::info - EuroCC2 cost-neutral extensions has been approved - no issues there. - Jureca constraints - Access to Github ::: - Jureca constraints, and possible solutions to allow more participation: - GPU jobs are short `sbatch` jobs, and not launched - Some learners get access to only the lectures - Access to Github: - ... --- ### Agenda (02/12/2025) ::::danger :::spoiler **This agenda written in this red block is obsolete.** **Day 0 (30/01/2026 - Friday)** / Intro JSC systems (Xin) **Day 1 (02/02/2026 - Monday)** • Day 1.a / Introduction to deep-learning (Ashwin) 2.0h / 09:00-11:00 Break (30min) • Day 1.b / Introduction to deep-learning (Ashwin) 0.5h / 11:30-12:00 Lunch • Day 1.c / Introduction to deep-learning (Ashwin) 1.5h / 13:00-14:30 Break (30min) • Day 1.d / Pytorch DDP (Caspar) 2.0h / 15:00-17:00 **Day 2 (03/02/2026 - Tuesday)** • Day 2.a / Model parallelism with Pytorch (Gyula) 2.0h / 09:00-11:00 Break (30min) • Day 2.b / LLM, Finetuning (Geert) 0.5h / 11:30-12:00 Lunch • Day 2.c / LLM, Finetuning (Geert) 1.0h / 13:00-14:00 Break (30min) • Day 2.d / HuggingFace Accelerate, Deepspeed (Michele) 1.5h / 14:30-16:00 **Day 3 (04/02/2026 - Wednesday)** • Day 3.a / Pytorch Lightning (Ashwin) 2.0h / 09:00-11:00 Break (30min) • Day 3.b / Ray + Retrieval Augmented Generation RAG (Michele) 0.5h / 11:30-12:00 Lunch • Day 3.c / Ray + Retrieval Augmented Generation RAG (Michele) 2.0h / 13:00-15:00 **Day 4 (05/02/2026 - Thursday)** • Day 4.a / Computer Vision / NCC-Romania <span style="color:red;">**(??)** </span> 2.0h / 9:00-11:00 Break (30min) • Day 4.b / Hyperparameter tuning (Oskar) 0.5h / 11:30h-12:00 Lunch • Day 4.c / Hyperparameter tuning (Oskar) 1.0h / 13:00h-14:00 Break (30min) • Day 4.d / MLOps / <span style="color:red;">**NCC-Poland (Klemens??) 1.5h / 14:30-16:00??** </span> ::: :::: ## Fourth meeting 4th November :::info To do All - Alt. plans if EuroCC2 cost-neutral extension does not happen - Has to happen in Jan-Mar 2026 - Alternative do the course with the particpants involved in AIF or AIF-antennas. - Who are **not** involved in AIF and AIF-antennas: - Castiel2/BSC can find a way - Hungary involved in antenna, need to clarify - Montenegro needs further confirmation - Finalize Date to 2–6 February 2026 as VP suggested? CASTIEL: - Licensing - GitHub use - Gitlab at code.europa.eu Germany - Access & reservation - Might need to step out as a course instructor, but happy to host. Sweden - Show how / where to add material: https://github.com/ENCCS/castiel-multi-gpu-ai) - Only 8 people added to https://github.com/orgs/ENCCS/teams/castiel-multi-gpu-ai/members so far. ::: ## Third meeting 23 October, 10:15 CEST ### Hosting the course material We really need to get started. - code.europa.eu or other institutional GitLab: - restricted access - may happen only with intervention from CASTIEL - Our take: we can move to this later **after the course** - Google Drive or Nextcloud etc: - Poor version history, harder to copy material to a HPC cluster, no visibility - Our take: we don't do this - Github: - We can do this now, and migrate to code.europa.eu when it is ready. - We can add you to https://github.com/ENCCS/castiel-multi-gpu-ai. Please tell us your GitHub usernames here below or email me `ashwin.mohanan [at] ri [dot] se`: - ujlaki15 - Hungary (github name - country) - viscio - Italy - lcavall11 - Italy - dbrandoni - Italy ### Finalize topics - Sweden: - Intro to deep-learning - - ... - ... ## Second meeting 7 October, 10:00 CEST :::info ** Attendees ** - Francesco Fiusco (NCC Sweden) ::: TODO: - Email NCC Poland regarding MLOps - Maksym: check with JUWELS - ## First meeting where we discuss the dates, people and rough course outline and other logistics. **Welcome, and check-in below:** :::info **Supercomputer** - [LUMI](https://lumi-supercomputer.eu/) - [Juwels](https://www.fz-juelich.de/en/ias/jsc/systems/supercomputers/juwels) - [JURECA DC](https://apps.fz-juelich.de/jsc/hps/jureca/configuration.html#hardware-configuration-of-the-system-name-dc-module-phase-2-as-of-may-2021) ::: :::success **Tentative dates on 2026**: Let's vote. Add a `+` to the right - Week 3: 12-16 January ++ - Week 4: 19-23 January +++ - Week 5: 26-30 January +++ - Week 6: 2-6 February ++++ - other suggestions ::: ### OLD Course outline :::success M = Morning A = Afternoon ::: Here we slightly change the order in which the lessons appear and present some suggestions. Let's discuss! | Day | Lesson | Org | Comments / suggestions | | --- |:-------------------------------------------------- |:--------------- |:-------------------------------------------------------------------------------- | | 1M | ==Access to HPC== | ==Hosting entity== | Needs to demonstrate launching JupyterLab | | 1M | ==Intro to GPU== | ==Hosting entity== | Retain emphasis on architecture. Skip programming models like CUDA, OpenACC etc. | | 1A | ==Introduction to deep-learning== | ==Sweden== | Need to reduce course length | | 2M | MLOps and/or HP tuning | Belgium and/or Finland | Will follow up on this | | 2A | Pytorch DDP | ==Netherlands== | Sync | | 3M | Model parallelism with Pytorch | ==Hungary== | Sync | | 3A | Other modes of paralellism with Pytorch (TBC) | Germany | Sync, and try to not repeat DDP and model parallelism | | 4M | Pytorch Lightning | ==NCC Sweden== | Reusable content needed, approved | | 4A | LLM, Finetuning, HuggingFace Accelerate, Deepspeed | ==GJB for first two topics <br> NCC Italy ~1.5h for last two topics== | Belgium + Italy own material | | 5M | Ray + Retrieval Augmented Generation (RAG) | ==Italy== | | | 5A | | ==NCC Romania== | | | 5A | Closing session with open discussion | All | | ### Links and further notes - https://gitlab.tuwien.ac.at/vsc-public/training/LLMs-on-supercomputers - MD from DE: Move preprocessing using Ray to the beginning of the course? - If we use LUMI, then the first session on access to HPC should be someone from CSC Finland / CINECA / BSC - AMD machine can be tricky? - KN from Cyfronet: - Can we still run this course under the EuroCC brand? - We are still on EuroCC2 - New activities may need to switch under the AI-Factory? - Ask others from AI-Factory to join us? - Sima: As CASTIEL, they cannot provide training for the countries with AI Factories, going forward - Missing topic: MLops, how to monitor training - Have to discuss internally - Ray is versatile, but does MLops fits better? - EB from Cineca: ok to leave the topic on Ray - GJB: on MLops, has a course which uses DVC rather than MLflow. Not suggesting an alternative, but something that can be included. - YW and KN: we should try to include people working in AI Factory. - LF: - Course format: online. - (Public) material from last year: https://drive.google.com/drive/folders/1GqULIbJ5wJsvUk6zgu9fFDCkvfmOtjQN?usp=drive_link ## Repository License: - CC-BY 4.0 https://creativecommons.org/licenses/by/4.0/ Format: - Markdown, PDF, Jupyter Notebooks all compiled into a single Sphinx lesson - See for example, our (ENCCS's) template - Sources: https://github.com/ENCCS/sphinx-lesson-template - Rendered: https://enccs.github.io/sphinx-lesson-template Where to host: - Does CASTIEL have a GitHub organization or equivalent? - CASTIEL will check internally. All NCCs and COEs should be able to access https://code.hlrs.de - Currently under maintenance. Try next time. - If not we can add it under https://github.com/ENCCS and help with the formating and access to the instructors