# EESSI/Azure/SURF sync meeting 20211015 ## Agenda - Update on NeIC project proposal (S4) - EESSI Stratum-1 in Azure (Bob) - GitHub runners for EESSI hosted in Azure VM (Bob) - Use of Terraform (Bob) - Zen3 build node (Kenneth) - Some trouble due to SELinux (?) - Work on interconnect detection support in archspec (Hugo) - See https://github.com/archspec/archspec/pull/60 ## Attendees - Laura Redfern - Martin Brandt - Ivar Janmaat - Bob Dröge - Kenneth Hoste - Ahmad Hesam - Alan O'Cais - Hugo Meiland ## Notes - Update on NeIC project proposal (S4) - Did not get funded - Pretty decent score but competition was tough - Were recommended to reapply in next funding round (Feb. 22) - Will use feedback to tune proposal - Need to make a concrete connection back to users - Need some additional nordic partners (since that was noted) - Other opportunities will arise soon (like EOSC calls which are currently being fine-tuned) - Laura: Can arrange a letter of support for future bids - EESSI Stratum-1 in Azure (Bob) - Now part of the (latest) configuration package - CVMFS uses geoapi so may not be used so much since it currently sits in US - Hugo will test it out - Can check with `cvmfs_config` which S1 you're talking to ``` # first make sure that CVMFS is mounted, e.g. by doing an ls: ls /cvmfs/pilot.eessi-hpc.org cvmfs_config stat -v pilot.eessi-hpc.org # That should show something like: # Connection: http://134.94.88.70/cvmfs/pilot.eessi-hpc.org through proxy DIRECT (online) ``` - (Default) GitHub runners may also be using this - Should keep an eye on traffic, as this can be large - Azure blob as Stratum-1 is an option that might be interesting - GitHub runners for EESSI hosted in Azure VM (Bob) - Some of our actions exceed the 6h time limit for default runners - CVMFS do not provide containers for some archs (ARM + POWER) so we need to build them from source - Created our own runners to build containers - Only need these intermittently when the containers need updating - Any experience with Auto-scaling Kubernetes cluster for GitHub Actions workflows? - Martin can check with SURF people working on Kubernetes & GitLab runners - Hugo can share info on throwaway multi-node clusters used internally - Martin: see https://docs.microsoft.com/nl-nl/azure/aks/kubernetes-action - for multi-node application testing Magic Castle should work well (and more secure than Cluster-in-the-Cloud) - support for Infiniband and EFA - see https://github.com/ComputeCanada/magic_castle - Use of Terraform through API access to Azure (Bob) - separate 'terraform' account - Martin can probably help here, has done this - Zen3 build node (Kenneth) - available now for EESSI in West Europe - Some trouble due to SELinux (?) - Using our container inside the image is kicking an error ``` Singularity> mkdir /cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/amd/zen3 mkdir: cannot create directory '/cvmfs/pilot.eessi-hpc.org/2021.06/software/linux/x86_64/amd/zen3': Operation not supported ``` - Can make it work if `/tmp` is used for the overlay - Look like it could be related to mount options and SELinux - Once we get this resolved we should have a full stack in a day given the node is so powerful - Work on interconnect detection support in archspec (Hugo) - See https://github.com/archspec/archspec/pull/60 - Motivated by issues with using the interconnect, see https://github.com/EESSI/software-layer/issues/136 - Fixed by setting some environment variables - OpenMPI should probably behave nicer - Interconnect detection could trigger some appropriate environment variables - Usage is about 200euro/month so no alarms trigger :P - only Stratum-1 + GitHub Actions runners - Are there any relevant upcoming events? - There will be another EasyBuild User meeting - Having an end-user focussed tutorial might be a good idea - For example, for someone building on top of the EESSI stack - Topics: - setting up EESSI from scratch - usage - building your own software on top - Hugo: Marketplace VM image - When will there be a production stack? - Really hard to say - Would need to have some monitoring in place...and someone to notify if there is something wrong - How we do roll something back? Who can do that? - If we have issues with a stratum 1 who fixes it, and if we can't contact the responsible people how do we kick it out - Can we use DNS to kick out stratum 1s? - That is possible