## December 18th, 2024 * Agenda: * Last community call until 2025! * Status updates * GPU work * Can configure and request GPUs * Final bits of work required for `slurmutils` and the `slurmctld` + `slurmd` operators. * Filesystem client * Check it out! https://github.com/charmed-hpc/filesystem-client-operator/ * IAM stack * `glauth-k8s` soon landing ingress support * Open Documentation Academy * In the new year starting to look at how new contributors can be onboarded through docs. ## December 11th, 2024 * Agenda: * Status updates * Sackd operator is merged! 🥳 * Found bug on node-configured action that was fixed. * Some cleanups needed. * Upgrade to Noble (and problems with mysql-router) * GPU work * Merged _gres.conf_ support for slurmutils (preparing for GPU support) * Storage work * Published https://github.com/charmed-hpc/filesystem-client-operator * IAM stack work * Problems with integrating K8s' DNS with external machine charms on cross-relations * Using Traefik to resolve the DNS names into internal K8s services * Infiniband and partition keys * Billy: Any experience with Infiniband partitions? * James: Only hands-off experience. Needed to setup network manager. * Arif: Generic network manager doesn't scale horizontally that well. OpenSM scales much better. Slurm partitions don't map well to Infiniband partitions. Some clients are interested on [IPoIB](https://docs.nvidia.com/networking/display/mlnxofedv23070512/ip+over+infiniband+(ipoib)#) * Jason: Kind of depends. Worked on clusters where some Slurm partitions were Infiniband enabled and some weren't. ## December 4th, 2024 * Agenda: * Status updates * `apt` charm library fix is almost merged. Waiting for final reviews. * Works for development branch of the 24.04 Slurm charms. * GPU work * Developing _gres.conf_ editor for `slurmutils` * Driver installation on compute nodes. * IAM stack work * Updating `sssd-operator` to work with `glauth-k8s-operator`. * Use SQL queries to manage user and groups on your cluster * Last community call of the year will be December 18th! Will resume Janurary 8th! ## November 27th, 2024 * Agenda: * Happy Thanksgiving :turkey: :cow: * Status updates * Fixes for the `apt` charm library should be landed by the end of the week :crossed_fingers: * `auth/slurm` discussion thread: https://github.com/orgs/charmed-hpc/discussions/11 * Terraform for Charmed HPC * Working on GPU + Infiniband integration. * dmtcp checkpointing software is helpful for workflows that run in steps. ## November 20th, 2024 * Agenda: * Status updates * `apt` charm library doesn't work on Noble. * `sackd` discovery work. * MUNGE works with sackd. * COS is now in Charmed HPC. * `slurm_ops` is now in the Slurm charms. * Documentation work. Let's make it great! ## November 13th, 2024 * Agenda: * Status updates * GPU specing * Storage charm specing * `slurm_ops` work * `cryptography` vs `pycryptodome` * Debating if we should revert to using pycryptodome in `slurm_ops` as `cryptography` adds noticeably more time to `charmcraft pack`. Since `cryptography`'s backend is written in Rust, it musted be compiled when packing the charm. * Charms only use source packages, not prebuilt wheels, so we need to pull in `rustc` to build `cryptography`. `pycryptodome` doesn't require this. * https://github.com/canonical/charmcraftcache <- could possibly help, but we need to play with it more. * [Slurm authentication plugin](https://slurm.schedmd.com/authentication.html#slurm) and [`sackd`](https://slurm.schedmd.com/sackd.html) * Action items * Investigate different caching mechanisms for `rustc` + `cryptography` * Further investigate Slurm's new authentication mechanism `auth/slurm` ## November 6th, 2024 * Agenda: * How's it going? * InfluxDB * Add support for constraints to the Slurm charms * Terraform plans working nicely so far. * Planning to expand by adding modules for COS, storage, etc. * Add README documenting how to use `charmed-hpc-terraform` ## October 23th, 2024 * Agenda * Status updates * [Azure + Lustre](https://github.com/jedel1043/azure-lustre-bicep) * Terraform testing (https://github.com/charmed-hpc/charmed-hpc-terraform) ## October 16th, 2024 * Agenda * Status updates * [CI failures caused by Github Actions update to Noble](https://github.com/charmed-hpc/slurm-charms/actions/runs/11296721635/job/31422137240) * [feat(slurm_ops): loosen environment variable rules and configure openfile limits](https://github.com/charmed-hpc/hpc-libs/pull/44) * TiCS analysis tour ## October 9th, 2024 * Agenda * Welcome Dominic!! :rocket: * Status updates * `slurm_ops` * Adding last patches required for HPC use case * `MYSQL_UNIX_PORT` * ulimit rules * slurm-charms PRs in the oven * Enable prometheus on main * Use `slurm_ops` for management * Repository tooling - call for opinions! * tox * poetry * monorepo tooling * Anything we want to focus for OpenInfra NA Days * What should I mention about the community? * Discussion notes: * Feedback on `poetry`: * poetry doesn't seem to handle workspaces well. Building local components doesn't work right because poetry can't pull the correct metadata. Pulling via git URL builds the entire repository which is not desirable. * https://github.com/DavidVujic/poetry-multiproject-plugin * https://github.com/gerbenoostra/poetry-plugin-mono-repo-deps * Solved issue with the local dependency by using a tool called `stickywheel`: https://github.com/omnivector-solutions/jobbergate/blob/main/jobbergate-agent/pyproject.toml#L38,L44 * Check if poetry + charmcraft has integration with pyproject.toml. Make sure that charmcraft can also properly support poetry. * How does `uv` and `poetry` defer when it comes to workspaces? * Feedback on `tox`: * Just document common commands that should be used when creating a new _tox.ini_ file. If there's any major issues, comment on shared document. * Feedback on monorepo tooling: * `repository.py` gets the job done but we should be cognizant that it might not be tenable as we scale up the amount of monorepos we have. _One change needs to be merged several times._ * Consider making a small PoC if there's spare time. ## October 2nd, 2024 * Agenda * Status updates * Putting `slurm_ops` into the Slurm charms. * COS for `main` branch. * Conferences * OpenInfra NA Days (lots of HPC people) ## September 25th, 2024 * Agenda * Status updates * Final touches for `slurm_ops` * Public cloud testing * James: https://github.com/omnivector-solutions/aws-plugin-for-slurm Plugin for AWS that enables deploying dynamic Slurm nodes. * James: The plugin creates a cloud formation for an EC2 fleet. * Billy: Yeah, we've been talking about public cloud integrations recently. * James: The downside is that Juju becomes this thing where other resources deployed in AWS would create things that Juju won't realize they exist/won't clean them up properly. * Billy: What resources are referring to? * James: Spark k8s, since Juju doesn't know about the schedules. In the case of Slurm, requesting more nodes than available, Juju won't be able to request more nodes automatically. * Billy: It would be a great idea to make Juju more aware of those things. e.g automatic hardware detection for GPUs. * James: We could create public cloud integrators that are filesystem aware, which allows easily mounting custom storage. * James: EFS ends up being NFS but AWS managed. * Billy: We're pretty much aligned on this. ## September 18th, 2024 * Agenda * Status updates * Terraform plans! * `hpc-libs` and slurm ops integration. * AMD GPU driver enablement. * `systemd` trying to rule the universe * Conferences * OpenInfra NA Days * Charmed HPC is going to the midwest. * Actions * Evaluate `juju-systemd-notices` ## September 11th, 2024 * Agenda * Status updates * Working on Terraform plans for deploying Charmed HPC. _It's gonna be slick_ * AMD GPU driver enablement within compute nodes * New `experimental` PPA. Central place for publishing new Debian packages for testing before upstreaming and/or downstreaming * https://launchpad.net/~ubuntu-hpc/+archive/ubuntu/experimental * Conferences * [name=jedel] UbuCon LA * [name=nuccitheboss] OpenInfra Summit Asia'24 ## September 04th, 2024 * Agenda * Status updates * [feat: add update method to update configs from other models](https://github.com/charmed-hpc/slurmutils/pull/16) * [feat: abstract around base package manager](https://github.com/charmed-hpc/hpc-libs/pull/22) * [add common packages to the slurm charms](https://github.com/charmed-hpc/slurm-charms/issues/23) - [name=arif-ali]: Maybe 5% of clusters use NFS, and you won't use it if you're using Lustre. - [name=wolsen]: We already offer the NFS piece with `nfs-client-operator`, and people will most likely use Lustre instead. We also have CephFS there, but it probably won't be used as a filesystem. - [name=wolsen]: NFS is your toy cluster filesystem, but for real use cases people will use Lustre. We should not add `nfs-common` by default, but `libpmix-dev` and `openmpi-bin` should be there. - [name=arif-ali]: Maybe also NHC. Prometheus exporter won't necessarily contact slurm. NHC offers a better integration with Slurm overall. - [name=wolsen]: The problem with NHC is that it will check very similar things than Prometheus. It will have gaps because of the longer ping times, but it would be super nice to have only one observability stack instead of multiple ones doing multiple things. We should take time to think if NHC should really be there or if we should just replace it with Prometheus. - [name=arif-ali]: As long as we have parity with NHC but using prometheus, then it shouldn't be that bad. - [name=wolsen]: Slurm is very customizable, so we are able to run hooks using Python code. That could enable the functionality. However, we need to determine the features from NHC that we'll need to reimplement for Prometheus first. - [name=wolsen]: I'm not saying we should replace NHC today, but we should consider it as an open question for now. - [name=arif-ali]: What about the corner cases? e.g. Users who don't want pmix, openmpi, etc. - [name=wolsen]: We should say: "We provide a default distro, but if you need to differ from this, use Spack". - [name=arif-ali]: Some users would want minimalist systems to squeeze the most performance from their system. - [name=wolsen]: I think that it's still okay to provide a default for now. Maybe the market will require us to provide minimal systems in the future, but for now we should be good. ## August 28th, 2024 * Agenda * Quick status updates * Congrats Cory for having your talk accepted at the Ubuntu Summit * Conferences * https://discourse.charmhub.io/t/discontinuing-new-charmhub-bundle-registrations/15344 * Prometheus exporter (packaging for Debian) * Terraform plans for Charmed HPC * AMD has new packager that will be focused on enabling ROCm for various Linux distributions. ## August 21th, 2024 * Agenda * Quick status updates * Summit booth * slurm-wlm packaging * Exporter packaging * Adding `from_dict` support to `slurmutils` * Congrats to Omnivector team for releasing Vantage to AWS. * [name=nuccitheboss] DevConf.US was a success! Good amount of interest from folks in our community and Charmed HPC. * [name=nuccitheboss] `slurmhelpers` discussion * Feel free to share your thoughts: https://github.com/orgs/charmed-hpc/discussions/3 * ORAS discussion ## August 7th, 2024 * Agenda * [name=nuccitheboss] Ubuntu Summit updates for HPC. * [name=nuccitheboss] Issue labels for Charmed HPC organization. * [name=jedel] Add size labels to help judge tasks. * https://github.com/unicode-org/icu4x/labels?page=1&sort=name-asc * [name=jedel1043] Monorepo is done! * Action items * [ ] Create base set of issue labels for Charmed HPC organisation. * [ ] Share labels in public markdown document. ## July 31st, 2024 * Agenda * [name=nuccitheboss] & [name=jamesbeedy] Ubuntu Summit 2024 HPC booth proposal * Do we want a booth for the whole weekend? * If we can't do the whole time, we can share our booth with the Ubuntu Community. The Ubuntu Community booth will be partitioned for different community groups within Ubuntu, especially for those who cannot staff a booth for the whole weekend * [name=jedel] & [name=nuccitheboss] using just for top-level repository actions * `just` is like `make` but it's just a command runner :) * Want to use `just` to automate top-level actions on our repos. `just` defers from `make` in that it doesn't need `.PHONY` file targets. `make` is more suited for creating files while `just` just runs commands. * Thoughts? ## July 24th, 2024 * Agenda * [name=jedel] Reassemble the Slurm charms monorepo discussion * [name=nuccitheboss] I was wrong. Originally thought that having individual repos would be better as we could update charms independently and keep changes small, but it's causing too much grief for integration testing when breaking changes are made 🥲 * __Benefits of returning to monorepo:__ * We can pin the repository to the top of the Charmed HPC org. It's something actively developed that we can show off. * Test against latest commit to the Slurm charms rather than pull what is currently in edge. Don't burn CI minutes running if we know it will fail :white_frowning_face: * Easier CI testing for development branches. Test `experimental` against `experimental` rather than needing to mess with `main`. * Only one branch protection rule for all Slurm charms! * One set of integration tests for all the Slurm charms! * One quickstart README for deploying the Slurm charms! * [name=jamesbeedy] the third time we put them back together. Let's keep this documentation. * [name=wolsen] document reasoning for monorepo in CONTRIBUTING.md * [name=nuccitheboss] Charmed HPC + GitHub Discussions discussion * Figured out how to set up Charmed HPC-wide GitHub Discussions. Tied to special `.github` repository. * https://github.com/orgs/charmed-hpc/discussions * Why this? * Central place for us to discuss Charmed HPC development asynchronously. Better than having to chase individual threads on repositories. * We've been looking into how we can provide proper forum and Q&A support for new Charmed HPC users as we get closer to stable/alpha release. * Ideally, if someone going through the Charmed HPC documentation has a question, we want to provide a place where they can go to post their question without getting lost in the chatter of Matrix. * Ubuntu Discourse is not an option as the site-wide policy is that the instance cannot be used as a technical support forum. * AskUbuntu was ruled out because as it is not beginner-friendly. Lot's of rules for what can and can't be posted, and new user questions maybe deleted or locked if its considered a duplicate or not high enough quality. * Thoughts? * [name=nuccitheboss] Documentation site demo * Charmed HPC documentation repository: https://github.com/charmed-hpc/docs * [name=jedel] poetry for charms discussion * Action items - [ ] File request to have Slurm charms transferred to "HPC charm team" on Charmhub - [ ] See if we can have name changed to _Ubuntu High-Performance Computing_. Have consistent branding across platforms. - [ ] Tombstone README's on existing Slurm charm repositories to point to new Slurm charm monorepo - [ ] Make PoC for using Poetry to manage a charm project