carpark bucket plan

# carpark upgrade plan ## Motivation > Why do this? Main goal is to create `carpark-prod-1` in R2, so that we can have a new bucket with w3up uploads. We can also leverage this change to move us towards the future we are looking for with w3up: - writing to R2 directly - fully decentralize location of CAR files at rest ## `carpark-prod-0` bindings > How is current bucket in use? ### Write pipelines - `https://api.web3.storage` for writes - `https://api.nft.storage` for writes - `w3infra replicator` for writes happening via `w3up` ### Read pipelines - `https://freeway.dag.haus` for reads - `https://dag.w3s.link` for reads (via `https://freeway.web3.storage`) - `https://hoverboard.dag.haus` - `w3infra roundabout` for reads as `presigned URLs` redirect ## w3s pipelines zoom in > Dive deep into how each binding is used ### Writes Given core motivation of these changes is to segment uploads from new API, we can assume that old API deployments will continue to write to `carpark-prod-0`. - `https://api.web3.storage` for writes to old bucket ✅ - `https://api.nft.storage` for writes to old bucket ✅ Therefore, only replicator needs to be considered (in case we can't write directly to R2). ### Reads While the write pipelines are quite straighforward to update, the read pipelines are highly coupled with `carpark-prod-0`. Indexes available within these read pipelines do not specify the bucket where content lives or assume it by default (except for Hoverboard that has some more information). #### Roundabout Roundabout relies on `carpark-prod-0` as the bucket source to redirect requests to, while it allows requesting content from a custom bucket via query parameter. However, end user MAY not know where the content lives. ### Freeway Freeway is highly coupled with `carpark-prod-0` given it relies on side indexes from `dudewhere-prod-0` and `satnav-prod-0` to be able to get to know how to give a response back to the client. Unfortunetely, when we designed this solution we did not create `dudewhere-prod-0` and `satnav-prod-0` considering that keys would be preffixed by bucket name, and `dudewhere-prod-0` only contains the name of the key within the bucket. ### Hoverboard Hoverboard was built as a replacement for E-IPFS, running in CF infrastructure instead of AWS. It relies in the same block indexes that `E-IPFS` relies on the indexing of blocks. Currently the indexes tell us where in S3 they live, and we just optimistically try R2 first, fallbacking to S3 if not found in R2. ## Proposal > What should we do and how? We have built indexes in multiple ways as we can see in our reads pipelines. With all these learnings, we build the [Content Claims Protocol](https://hackmd.io/@gozala/content-claims) that we should start to heavily use. The core of this proposal is to rely on the Content claims Protocol [Location claims](https://hackmd.io/@gozala/content-claims#Location-Claims) to keep track of where a CAR file is at rest. This SHOULD be the way forward for the new content that is uploaded via w3up. ### Writes On the `w3infra replicator` side of things, a new `w3infra` deployment swapping `R2_CARPARK_BUCKET_NAME` to the new name should be enough. An event should be triggered from writing into `carpark-prod-1` to invoke `assert/location` and let `content-claims` service know that this CAR file is now at rest in `carpark-prod-0` (in CF). We also write the `dudewhere` index on `upload/add`, which should be updated to use [Partition Claims](https://hackmd.io/@gozala/content-claims#Partition-Claims) so that we can know which CAR files at rest have the DAG for a given Root CID. Currently, we rely on [E-IPFS indexer topic queue lambda](https://github.com/web3-storage/w3infra/blob/main/stacks/carpark-stack.js#L36) to create block levels index. This is also where we verify block hashes are correct. We MAY consider moving into a flow where we have a "triage" bucket that validates the CAR blocks and copying them into a destination bucket. This would also allow us not need to rely more on E-IPFS old infrastructure code, but have as part of the validation step to issue Location claims to where each block lives in a given CAR file. Note that when we migrate to write directly into R2, we MAY just need to modify the place where `assert/location` is invoked. Dependending on whether `assert/location` is invoked in the replicator scope, or R2 bucket events is supported. ### Reads Moving into a world where Location claims are issued on the writes pipeline, all our reads pipelines SHOULD start to rely on content claims to learn the location of requested content instead of current indexes. Note that we MAY live in a world where multiple claims exist for a given CID. At the moment of writing, `content-claims` service can only be used internally and we can trust the claims. However, in the future we MAY need to build a trust layer. #### Roundabout Roundabout serves only CAR files. Therefore, it only needs to rely on **Location Claims** to get to know locations for given CAR file. #### Freeway Freeway serves content by Root CID. The content of a DAG represented by a given Root CID may be in multiple CAR files. As a result, freeway will need to rely on **Partition Claims** to get to know set of CARs where DAG is at rest. Once the CAR CIDs are known, Freeway needs to rely on **Location Claims** to get to know locations for given CAR file. #### Hoverboard Hoverboard receives requests at block level. It currently relies on E-IPFS indexer topic queue generated indexes. Instead of the indexes currently in use, once we write **Location Claims** of block CIDs to CAR locations and their byte range, Hoverboard can rely on these to get to know where requested block lives. ### Historical content plan > Should we re-index all old stored data to the new world, or fallback to current behaviour? Assuming that older data becomes less and less accessed, we can have an idea on whether we should bear the dev and operational cost of performing this migration, or rely on having old code as fallback when claims are not available. In this proposal, I suggest we start by fallbacking to old code to iteratively ship things. Moreover, we will need to compute Piece CIDs for old content, in order to put this content in deals. Therefore, might be a good idea to merge both efforts, as we will need to read and process CARs, we can issue claims for these. ## Milestones 1. Make reads pipelines ready to rely on Content Claims Protocol 1.1. Add support for roundabout 1.2. Add support for freeway 1.3. Add support for hoverboard 2. Make writes pipeline to issue content claims 2.1. Issue `assert/location` on replicator / R2 Bucket event if available 2.2. Issue `assert/partition` on `upload/add` handler 2.3. TBD: Create triage bucket OR refactor E-IPFS indexing lambda to write R2 specific indexes 3. Create new Bucket and wire it with replicator 4. Handle historical content 5. Write to R2 directly ## Other notes We rely on redirects to presigned URLs in roundabout. Depending on the APIs in play, we could consider doing this. Probably `dag.w3s.link` could be a good place to save some egress.