# hachy OS build & update
We will move this to the infra repo proposals once we have directional agreement.
## Situation
A Hachyderm Infra person needs to repeatably and reliably:
- Build and boot new hosts in our various clouds
- Update a existing hosts
- Rollback failed updates to a host
Today (early 2023), these operations are done manually: either through console clicks or ad-hoc commands at the CLI. While this has gotten us to where we are, we hypothesize it will be difficult to scale using this same set of processes. We've already observed several cases where expected configuration does not match actual configuration.
Previously, the [team agreed to leverage NixOS]([tbd](https://discord.com/channels/858148454953254943/1060701008461824051/1060701495831576627)) (needs a better reference), as we think it will provide us consistency in our host builds that will lead to improved reliability and more consistent operations.
Furthermore, while we have presence in Linode for our edge nodes, we would like to explore using Hetzner cloud for `sidekiq` processes, as HCloud provides us with [free egress](https://discord.com/channels/858148454953254943/1063853904644808794/1063886051376115794), which we hypothesize will lower our overall burn rate. We also have a presence in Heztner's decidated datacenters for our databases.
## Build new host
We need to create a net-new virtual machine in one of our providers and get it into a good running state. We have multiple providers where we boot hosts:
- Hetzner dedicated
- Hetzner cloud (not really used)
- Linode
### Constraints
None of our providers have a base operating system that boots directly into NixOS. This means we need to build an image/artifact to install NixOS. Our cloud-based providers do not support mounting ISO images, so in this case, we need to leverage an external process to "convert" a different operating system into NixOS. Fortunately, the [`nixos-infect` project](https://github.com/elitak/nixos-infect) allows us to do exactly that.
To further complicate things, each provider has different capabilities and APIs -- getting to a host running NixOS is different in each, and they don't all support the same features or levels of automation.
- Hetzner cloud (reasonably automatable, more volatility)
- CAN create images and snapshot them
- CANNOT mount a custom ISO
- CAN be infected
- Hetzner dedicated (least automation, least volatility)
- CANNOT create images and snapshot them
- CAN mount a custom ISO
- CAN be infected
- Linode (reasonably automatable, more volatility)
- CAN create images and snapshot them
- CAN be infected
### Conclusions / Proposal
We will bake base NixOS `golden images`. These images will be a snapshot or AMI-type image stored with the provider. The base image will be as generic as possible, e.g. it will not contain any "role-specific" configuration: that will be delegated to the Hachyderm NixOS repo (and addressed in the **Update existing host** proposal, below).
We hypothesize golden images will improve our boot times. We also think they will abstract the OS prep step such that Hachyderm infra folks focus primarily on configuration.
For cloud:
- Bake base golden images that:
- Get to the simplest possible NixOS host
- Account for any cloud-specifics like setting hostnames
- Identify a tool (like `packer`) that:
- Prepares a temporary host like the above
- Creates a snapshot to base future host boots on
- Cleans up after itself
- Can handle multiple clouds (Hetzner cloud, Linode)
- Likely use `cloud-init` and `nixos-infect` to convert a Debian 11 host into a NixOS host ([WIP PR reference](https://github.com/hachyderm/infrastructure/pull/143))
- Consider periodically testing this process to minimize bitrot and surprise failures
- For new host boots based on the golden images:
- Author a runbook that includes instructions on how to boot a new VM in HCloud and Linode using the respective base image
- It is important to us that this process is reasonably automated, repeatable, and reliable, so we may use tools like Terraform to help us roll out infrastructure
- The host must be added to our observability stack so that we are aware of it and any issues it may have
For dedicated:
- Bake a NixOS ISO that can be uploaded to our provider
- Accept the fact that installing will remain a manual process
- Author a runbook to guide the installation of NixOS on a dedicated node
## Update existing host
Once we have a running host, we need to periodically update it for changes like:
- software updates
- configuration changes
- incident remediation
We want a process that works across cloud & dedicated hosts with minimal (or maybe no?) differences.
### Conclusion / Proposal
We will establish the Hachyderm NixOS repository. This will contain our NixOS configurations such that they can be distributed to/consumed by our Hachyderm hosts.
Team members will push updates to the repo and follow a process (TBD) to distribute these updates to hosts.
We will establish the concept of a Hachyderm Host `role` (similar to an [Ansible role](https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_reuse_roles.html)). This will allow us to organize configuration in reusable units.
Hosts will require a configuration element to dictate the `role` to which they are assigned. This will be included as part of the New Host Boot process described above.
In the event a deployment does not go as expected, we will establish a Rollback Process (TBD) to return the host to the previous state.
```
🐉🐉🐉 HERE THERE BE DRAGONS 🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉
Below this line needs significant work. Esk doesn't have much NixOS ops
experience (aka zero), so please contribute your ideas.
🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉🐉
```
For both cloud & dedicated:
- Create a `systemd` unit (shipped with our base image) that:
- Monitors our GH repo `main` branch (or other target)
- Pulls down updated config relevant to the host's `role`
- Applies and does `nixos-rebuild`
- Consider tools like [`colmena`](https://github.com/zhaofengli/colmena) to assist in automating NixOS deployments
#### after talking to sofi
use flakes
use pattern similar to imsofi/phenix
consider doing something simple to start like just having ppl run this command:
```
nixos-rebuild switch --flake github:hachyderm/infrastructure#some-role
```
we use cloud-init to run the deployment command & a human can run the deployment command when needed (manually)
```
nixos-rebuild switch --rollback
```