Redesigning EESSI infrastructure management

# Redesigning EESSI infrastructure management ## Goals * Staged deployment for production * The deployment to the staging environment is automatically tested and validated, and can only go from staging to production when tests pass and after a human approves the PR. See https://netmemo.github.io/post/tf-gha-nsxt-cicd/ for inspiration. * Multi-* support: * Support multiple providers * Support multiple locations * Support multiple linux distributions * Allow for scale-out and scale-up for all resources we support * Beyond staging and production, we should have ephemeral (todays "dynamic") and a playground (work)space to play with. * Ephemeral and playground resources do not in any way interact with staging/production. However, they may share configuration resources (node types, etc) and can/should/will share terraform modules to create said resources. * Different permissions boundries for production vs playground vs ephemeral * Ideally also for the backends (ie, one thing is limiting who can do a git review / merge, but also that the AWS account the playground uses can't touch production.) * Consistent configuration deployment * All nodes, irrespective of their priority, get configuration pushed as the configuration changes in git. ## Solutions ### Provisioning We can either use Terraform Workspaces (https://www.terraform.io/language/state/workspaces) to seperate stages or use foldes and files to separate the stages. Workspaces allow for some extra automation, but are repo-wide. Ie, if you push something to a brach that uses the workspace "staging", you are doing that for every resource in the repo. This gives us two different approches to infrastructure, multiple repositories or a big single repository. https://www.hashicorp.com/blog/terraform-mono-repo-vs-multi-repo-the-great-debate is worth reading, and adapted for us we get the following two ideas. ### Multirepo option: * Three infra repos: * **infrastructure-core**: Contains the staging and production workspaces, only ever deals with resources that are directly or inderectly of EESSIs service deliveries. Even resources that go into staging are *planned* to go into production. There is no what-happens-if-testing in this repo. * **infrastructure-playground**: Testing repo. Good for deploying test environments and trying new things. If one wants to move this into production, releant bits are copied over to infrastucture-core/staging so that the automation can test the configuration and then deploy as per the given rules. * **infrastructure-ephemeral**: Replaces `eessi-dynamic`. An inventory file is kept, in a format that details what the nodes are about, who has access to which node, lifespan, and so on. Similar staging/prod split as infrastructure-core so it is easy to see if something fails and one can follow up on nodes. * Due to the required sharing of terraform modules between the repos, modules then *must* be located as distinct git repos and ideally pushed to https://registry.terraform.io. * Note that doing this will allow us to use version controlled modules, so we can update infrastructure-core/staging to use a new version of a module to create EESSI azure nodes, and see if it works as well there as in the playground. * Easy to set permission boundries for both github and frontend, simply done by repo. Do we have a uniform configuration repo or tie configuration to each infra repo? ### All in one repo We will have subfolders (core/ephemeral/playground) instead of repos. * Having everyting in one repo allows us to clone the file structures between different stages of different parts of the repo, rather than using workspaces. * If we wish to use workspaces, we will not have distinct workspaces (staging/production) between the three subfolders (they are one repo, workspaces are repo-wide). This means that we *either* need different staging branches for the different subfolders *or* we use a single staging branch that applies to every subfolder equally. * We may keep terraform modules in the same repo, probably speeding up development. * But this means we cannot use or test different versions of a module with any semblance of ease. * Permission boundries... will require fiddling hard with github actions and possibly the deployment tools (terraform cloud/atlantis) * Configuration is either stored in-repo or in a seperate configuration repo. ### Thoughts about terraform module repositories * All node-generation modules (ephemeral node, core-node, test-node, aws-node, azure-node, etc) will be highly interdependent – at least until they become very stable. It could make sense to keep them all in one repo but this is not *usually* what terraform wants. So we may need a lot of repos, but we only source one when using a module... However, look at https://github.com/terraform-aws-modules/terraform-aws-security-group/tree/master/modules/ssh as an example of how to submodule within a single tree. This is very similar to what we want. * The above applies to networking and ACLs as well. Sideline, requirements for pushing to Terraform Registry: > The list below contains all the requirements for publishing a module: > > GitHub. The module must be on GitHub and must be a public repo. This is only a requirement for the public registry. If you're using a private registry, you may ignore this requirement. > > Named terraform-<PROVIDER>-<NAME>. Module repositories must use this three-part name format, where <NAME> reflects the type of infrastructure the module manages and <PROVIDER> is the main provider where it creates that infrastructure. The <NAME> segment can contain additional hyphens. Examples: terraform-google-vault or terraform-aws-ec2-instance. > > Repository description. The GitHub repository description is used to populate the short description of the module. This should be a simple one sentence description of the module. > > Standard module structure. The module must adhere to the standard module structure. This allows the registry to inspect your module and generate documentation, track resource usage, parse submodules and examples, and more. > > x.y.z tags for releases. The registry uses tags to identify module versions. Release tag names must be a semantic version, which can optionally be prefixed with a v. For example, v1.0.4 and 0.9.2. To publish a module initially, at least one release tag must be present. Tags that don't look like version numbers are ignored. So we can't have provider-agnostic modules in terraform registry? But see https://registry.terraform.io/modules/DOboznyi/agnostic/cloud (https://github.com/DOboznyi/terraform-cloud-agnostic/). They use "cloud" as a "provider" label. ### Module call design ``` module "terjekv-demo-nodes" { source = "github.com/eessi/terraform-cloud-eessi-ephemeral-node.git?ref=v0.0.1" # optionally, if we have one node-repo and push it to terraform registry, we can do: # source = "terraform-cloud-eessi/eessi-nodes/cloud//modules/ephemeral" # version = "~> 4.0" name = "terjekv-demo-nodes" size = "medium" # Using count will create "count" nodes and suffix their names with "-XXX". count = 4 access = "[user1, user2, user3, user4]" # access_type="personal" gives each user access to a personal node. # access_type="shared" gives all users access to all nodes # "personal" might be the default if size(access) == count # "personal" allocates what it can, failing if size(access) > count. access_type = "personal" } ```