owned this note
owned this note
Published
Linked with GitHub
With this document, I'm proposing a plan for [this](https://github.com/rust-lang/simpleinfra/issues/458) issue.
This document uses the RFC [template](https://github.com/rust-lang/rfcs) to structure the proposal.
I'm using this structure just to organize my thoughts. There's no need to create an RFC for this change. 👍
> [!Note]
> I joined the infra team 3 weeks ago, so I don't have a lot of context around crates-io. Please read this carefully and check that this plan truly has zero-downtime (think about route53 DNS, CDNs, etc.)
# Summary
[summary]: #summary
Move the crates-io infrastructure from the legacy account to the `crates-io-prod` and `crates-io-staging` accounts with zero downtime.
# Motivation
[motivation]: #motivation
<!-- Why are we doing this? What use cases does it support? What is the expected outcome? -->
The infra team started the process of moving away from a "single account for everything" approach to a more fine-grained approach where different parts of the infrastructure are managed by different accounts.
This task is a continuation of that effort.
# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation
<!-- Explain the proposal as if it was already included in the language and you were teaching it to another Rust programmer. That generally means:
- Introducing new named concepts.
- Explaining the feature largely in terms of examples.
- Explaining how Rust programmers should *think* about the feature, and how it should impact the way they use Rust. It should explain the impact as concretely as possible.
- If applicable, provide sample error messages, deprecation warnings, or migration guidance.
- If applicable, describe the differences between teaching this to existing Rust programmers and new Rust programmers.
- Discuss how this impacts the ability to read, understand, and maintain Rust code. Code is read and modified far more often than written; will the proposed feature make code easier to maintain?
For implementation-oriented RFCs (e.g. for compiler internals), this section should focus on how compiler contributors should think about the change, and give examples of its concrete impact. For policy RFCs, this section should provide an example-driven introduction to the policy, and explain its impact in concrete terms. -->
The following resources are moved from the [legacy](https://github.com/rust-lang/simpleinfra/tree/master/terragrunt/accounts/legacy) account to the [crates-io-prod](https://github.com/rust-lang/simpleinfra/tree/master/terragrunt/accounts/crates-io-prod) and [crates-io-staging](https://github.com/rust-lang/simpleinfra/tree/master/terragrunt/accounts/crates-io-staging) accounts.
Buckets moved to the `crates-io-prod` account:
- [crates-io](https://github.com/rust-lang/simpleinfra/blob/master/terragrunt/modules/crates-io/s3-index.tf)
- [crates-io-index](https://github.com/rust-lang/simpleinfra/blob/master/terragrunt/modules/crates-io/s3-index.tf)
- [rust-crates-io-logs](https://github.com/rust-lang/simpleinfra/blob/master/terragrunt/modules/crates-io/s3-logs.tf)
Buckets moved to the `crates-io-staging` account:
- `staging-crates-io`
- `staging-crates-io-index`
- `rust-staging-crates-io-logs`
We don't need to replicate the `cargo-io-fallback` and `staging-crates-io-fallback` buckets because they will [repopulate](https://github.com/rust-lang/simpleinfra/blob/c0095970e96253f220a9e022c36fe24f3cca691e/terragrunt/modules/crates-io/s3-static.tf#L53) itself.
> [!Note]
> For the other resources I won't specify whether they go in the prod or staging account. You got the idea.
The following hosted zones, with their records:
- [index](https://github.com/rust-lang/simpleinfra/blob/c0095970e96253f220a9e022c36fe24f3cca691e/terragrunt/modules/crates-io/cloudfront-index.tf#L73)
- [static](https://github.com/rust-lang/simpleinfra/blob/c0095970e96253f220a9e022c36fe24f3cca691e/terragrunt/modules/crates-io/cloudfront-static.tf#L105)
- [webapp](https://github.com/rust-lang/simpleinfra/blob/c0095970e96253f220a9e022c36fe24f3cca691e/terragrunt/modules/crates-io/cloudfront-webapp.tf#L97)
Other resources, including the ones in the same file:
- [cloudfront-index](https://github.com/rust-lang/simpleinfra/blob/c0095970e96253f220a9e022c36fe24f3cca691e/terragrunt/modules/crates-io/cloudfront-index.tf#L7)
- [cloudfront-static](https://github.com/rust-lang/simpleinfra/blob/c0095970e96253f220a9e022c36fe24f3cca691e/terragrunt/modules/crates-io/cloudfront-static.tf#L7)
- [cloudfront-webapp](https://github.com/rust-lang/simpleinfra/blob/c0095970e96253f220a9e022c36fe24f3cca691e/terragrunt/modules/crates-io/cloudfront-webapp.tf#L3)
- [fastly-iam-role](https://github.com/rust-lang/simpleinfra/blob/c0095970e96253f220a9e022c36fe24f3cca691e/terragrunt/modules/crates-io/fastly-iam-role.tf#L5)
- [fastly-static](https://github.com/rust-lang/simpleinfra/blob/c0095970e96253f220a9e022c36fe24f3cca691e/terragrunt/modules/crates-io/fastly-static.tf#L26)
- heroku [iam](https://github.com/rust-lang/simpleinfra/blob/c0095970e96253f220a9e022c36fe24f3cca691e/terragrunt/modules/crates-io/iam.tf#L7)
# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation
<!-- This is the technical portion of the RFC. Explain the design in sufficient detail that:
- Its interaction with other features is clear.
- It is reasonably clear how the feature would be implemented.
- Corner cases are dissected by example.
The section should return to the examples given in the previous section, and explain more fully how the detailed proposal makes those examples work. -->
Here's the plan to move the resources with zero downtime.
> [!Warning]
> Do the following steps for *staging first*. Once you confirm the migration happened with zero downtime, do the same for prod.
> You can monitor the up-time with [infra-smoke-tests](https://github.com/jdno/infra-smoke-tests).
> [!Warning]
> This is a "high level plan". When dealing with terraform + aws there's always a level of uncertainty. Hopefully there won't be too many roadblocks.
## Step 1: Copy the Route 53 hosted zones to the new account
Follow [this](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/hosted-zones-migrating.html) guide to copy the following Route 53 hosted zones to the new account:
- [index](https://github.com/rust-lang/simpleinfra/blob/c0095970e96253f220a9e022c36fe24f3cca691e/terragrunt/modules/crates-io/cloudfront-index.tf#L73)
- [static](https://github.com/rust-lang/simpleinfra/blob/c0095970e96253f220a9e022c36fe24f3cca691e/terragrunt/modules/crates-io/cloudfront-static.tf#L105)
- [webapp](https://github.com/rust-lang/simpleinfra/blob/c0095970e96253f220a9e022c36fe24f3cca691e/terragrunt/modules/crates-io/cloudfront-webapp.tf#L97)
> [!Warning]
Don't delete the old hosted zones yet (step 10 of the [aws guide](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/hosted-zones-migrating.html))
## Step 2: Create iam and buckets in the new accounts
iam:
- heroku [iam](https://github.com/rust-lang/simpleinfra/blob/master/terragrunt/modules/crates-io/iam.tf)
- [fastly-iam-role](https://github.com/rust-lang/simpleinfra/blob/c0095970e96253f220a9e022c36fe24f3cca691e/terragrunt/modules/crates-io/fastly-iam-role.tf#L5)
buckets:
- [s3-static](https://github.com/rust-lang/simpleinfra/blob/master/terragrunt/modules/crates-io/s3-static.tf) (crates-io)
- [s3-index](https://github.com/rust-lang/simpleinfra/blob/master/terragrunt/modules/crates-io/s3-index.tf) (crates-io-index)
- [s3-logs](https://github.com/rust-lang/simpleinfra/blob/master/terragrunt/modules/crates-io/s3-logs.tf) (rust-crates-io-logs)
## Step 3: Replicate the buckets
1. Create a [live replication](https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication-walkthrough-2.html) to automatically replicate buckets from the `legacy` account to the new accounts.
2. Wait for the replication mechanism to copy all the files from the old buckets to the new buckets.
3. Leave the mechanism running to keep copying new files.
4. Check the delay between a new package being published and it showing in the new account and decide if it's reasonable. Otherwise there will be delay between people being able to run `cargo install` or `cargo add` after a `cargo publish` after we switch CDNs.
## Step 4: Switch CDNs
Switch Cloudfront and Fastly CDNs to point to the new buckets:
1. Switch Cloudfront [index](https://github.com/rust-lang/simpleinfra/blob/master/terragrunt/modules/crates-io/cloudfront-index.tf) to point to the new buckets (starting with Cloudfront is less risky because it has less traffic than Fastly).
2. Wait for one day to make sure everything is working as expected.
3. Switch Fastly to point to the new buckets.
4. Wait for one day to make sure everything is working as expected.
## Step 5: Migrate Crates.io logs
- Change the Crates.io app to point to the new buckets of `rust-crates-io-logs` and `rust-staging-crates-io-logs`.
- Edit monitoring platforms (e.g. Datadog) to point to the new buckets.
## Step 6: Migrate Crates.io app
Change the Crates.io app to use the new `crates-io` and `crates-io-index` buckets.
## Step 7: Delete the buckets replication mechanism
Delete the replication you created in the [replication](#Step-3-Replicate-the-buckets) step.
## Step 8: Delete the old buckets
If you are migrating the staging environment:
- In the [crates-io-staging](https://github.com/rust-lang/simpleinfra/tree/master/terragrunt/accounts/legacy/crates-io-staging/crates-io) terragrunt state, delete the buckets we moved earlier:
- `staging-crates-io`
- `staging-crates-io-index`
- `rust-staging-crates-io-logs`
If you are migrating the production environment:
- In the [crates-io-prod](https://github.com/rust-lang/simpleinfra/tree/master/terragrunt/accounts/legacy/crates-io-prod) terragrunt state, delete the buckets we moved earlier:
- `crates-io`
- `crates-io-index`
- `rust-crates-io-logs`
## Step 9: Move cloudfront CDN to the new AWS account
1. Move CDN balance to 100% Fastly. I.e. Fastly serves all the traffic.
2. Delete the cloudfront distribution from the legacy account
3. Create a new cloudfront distribution in the new account. Test it.
5. Restore the previous CDN balance.
## Step 10: Move the Fastly CDN to the new AWS account
1. Delete the Fastly CDN resources from the legacy account terraform state (just the state, we don't want to delete resources).
2. [Import](https://developer.hashicorp.com/terraform/language/import) the Fastly CDN resources to the new account.
> [!NOTE]
> By editing the terraform state only, we don't need to edit anything in Fastly, ensuring a smooth transition.
## Step 11: Delete the old hosted zones
Step 10 of the [aws guide](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/hosted-zones-migrating.html#hosted-zones-migrating-delete-old-hosted-zone).
> [!Warning]
> Wait 48 hours for the new hosted zones to propagate before deleting the old ones.
# Drawbacks
[drawbacks]: #drawbacks
<!-- Why should we *not* do this? -->
- *Time*: It will take weeks to get this done.
- *Risk*: It's a risky operation. If something goes wrong, we could have downtime in crates.io. We need to test staging before doing the prod migration.
- *Opportunity*: we could work on other things, such as updating our servers or terraform providers.
# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives
<!-- - Why is this design the best in the space of possible designs?
- What other designs have been considered and what is the rationale for not choosing them?
- What is the impact of not doing this?
- If this is a language proposal, could this be done in a library or macro instead? Does the proposed change make Rust code easier or harder to read, understand, and maintain? -->
- There might be an easier way to do this with scheduled downtime, but we don't want crates.io to stop working.
- If we don't do this task, we won't remove the "legacy account" tech debt.
# Prior art
[prior-art]: #prior-art
Blank.
<!-- Discuss prior art, both the good and the bad, in relation to this pr
A few examples of what this can include are:
- For language, library, cargo, tools, and compiler proposals: Does this feature exist in other programming languages and what experience have their community had?
- For community proposals: Is this done by some other community and what were their experiences with it?
- For other teams: What lessons can we learn from what other communities have done here?
- Papers: Are there any published papers or great posts that discuss this? If you have some relevant papers to refer to, this can serve as a more detailed theoretical background.
This section is intended to encourage you as an author to think about the lessons from other languages, provide readers of your RFC with a fuller picture.
If there is no prior art, that is fine - your ideas are interesting to us whether they are brand new or if it is an adaptation from other languages.
Note that while precedent set by other languages is some motivation, it does not on its own motivate an RFC.
Please also take into consideration that rust sometimes intentionally diverges from common language features. -->
# Unresolved questions
[unresolved-questions]: #unresolved-questions
<!-- - What parts of the design do you expect to resolve through the RFC process before this gets merged?
- What parts of the design do you expect to resolve through the implementation of this feature before stabilization?
- What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC? -->
- What monitoring platforms do we need to migrate?
- Is there a smarter way to migrate Cloudfront to the new account? I imagine that deleting and recreating it isn't great from a caching point of view.
- Do we have friends at AWS that are familiar with our stack and can read this guide to confirm that in this way we have zero downtime?
# Future possibilities
[future-possibilities]: #future-possibilities
<!-- Think about what the natural extension and evolution of your proposal would
be and how it would affect the language and project as a whole in a holistic
way. Try to use this section as a tool to more fully consider all possible
interactions with the project and language in your proposal.
Also consider how this all fits into the roadmap for the project
and of the relevant sub-team.
This is also a good place to "dump ideas", if they are out of scope for the
RFC you are writing but otherwise related.
If you have tried and cannot think of any future possibilities,
you may simply state that you cannot think of anything.
Note that having something written down in the future-possibilities section
is not a reason to accept the current or a future RFC; such notes should be
in the section on motivation or rationale in this or subsequent RFCs.
The section merely provides additional information. -->
In the future we want to move other parts of the legacy account to different accounts as well.