Try   HackMD

Magic Castle: Terraforming the Cloud to Teach HPC

November 12, 2023

General information

Schedule

November 12, 13:30-17:00 MST

Time Topic
13:30-13:45 Welcome and setup
13:45-13:55 Creating a Magic Castle Cluster in 5 minutes
13:55-14:20 Terraforming the Cloud to Teach HPC
14:20-15:00 Magic Castle
15:00-15:30 Break
15:30-16:45 Hands-on exercises
16:45-17:00 Break
11:20-12:00 Q&A

Instructors

  • Félix-Antoine Fortin
  • Alan O’Cais (he/him, CECAM/University of Barcelona, @ocaisa)
  • Lydia Vermeyden
  • Darren Boss

Code of conduct

The SC Conference is dedicated to providing a harassment-free conference experience for everyone, regardless of gender, sexual orientation, disability, physical appearance, race, or religion. We do not tolerate harassment in any form.

Contributor Covenant Code of Conduct

During this tutorial, we strive to follow the Contributor Covenant Code of Conduct
to foster an inclusive and welcoming environment for everyone.

Contributor Covenant

In short:

  • Use welcoming and inclusive language
  • Be respectful of different viewpoints and experiences
  • Gracefully accept constructive criticism
  • Focus on what is best for the community
  • Show courtesy and respect towards other community members

Contact details to report CoC violations can be found here.


You can ask questions about the workshop content at the bottom of this page. We use the videoconferencing chat only for reporting videoconferencing problems and such.


Questions, answers, discussion and information

  • is this how to ask a question?
    • yes, and an answer will appear like so!
  • Have you tested with OpenTofu?
    • Slide on OpenTofu coming up
    • Turns out the slide is very far down the list but yes, it should work, but I don't believe we have tested it just yet
  • Can you do 'demand-driven' auto-scaling (possibly within a maximum number of configured nodes)?
    • E.g. When a job is submitted, it creates the nodes necessary for the job, then shuts the nodes down & deletes them when the job is done.
  • Just to check the on-demand nodes are only shut down after the queue is empty? (i.e. those nodes stay up while jobs are waiting for resources?)
  • Terraform Cloud alternatives:
  • When you change the number of nodes, will it restart the Slurm daemon? In other words, will it kill existing, running Slurm jobs?
    • it will not.
  • What was the password again for Julian?
    • We will all be creating our own clusters after the break so we don't need the password post break

Exercise 1

  • ssh yourusername@sc23.magiccastle.live
  • terraform version
  • source cloud-creds.sh
  • tar xvf magic_castle*.tar.gz
  • mv magic_castle-aws-13.1.0 mycluster
  • cd mycluster
  • nano main.tf
  • # (Set a unique cluster name, save, and then exit nano)
  • terraform init
  • terraform plan -out=myplan.zip
  • terraform apply myplan.zip

Exercise 2

  • ssh -A centos@<your-ip-address
  • tail -f /var/log/cloud-init-output.log
  • journalctl -u puppet -f # Ctrl-C to leave journalctl
  • ssh mgmt1
  • tail -f /var/log/cloud-init-output.log
  • journalctl -u puppet -f

Exercise 3

  • nano main.tf
  • Uncomment the dns module by removing # in front of lines 63 to 75
  • source cloud-creds.sh
  • terraform init -upgrade
  • terraform plan -out=my-plan.zip
  • terraform apply my-plan.zip
  • ssh -A centos@<your_username>.magiccastle.live

Exercise 4

  • nano main.tf
  • Add "proxy" to login1's tags array
  • terraform plan -out=my-plan.zip
  • terraform apply my-plan.zip
  • nano data.yaml
  • source cloud-creds.sh
  • terraform plan -out=my-plan.zip
  • terraform apply my-plan.zip

Plugs & resources

  • Jetstream2