Getting off web3 cluster

The web3.storage ipfs cluster contains ~320TiB of:

All DAGs pinned via the pinning service
All DAGs for direct uploads from day 1 until 2022-09-21 (when we switched to writing them to R2)

Some care must be taken to get the data off cluster as it is the only thing storing:

DAGs pinned via the pinning service
direct uploads created before 2021-09-27 (after which we started writing uploads to S3 as well as cluster.

The postgres db in heroku is the source of truth for all the root CIDs that users have added to the system.

Goals

Every DAG for every root CID that has been added to cluster by a web3.storage user is copied to an S3 bucket where it can be indexed and announced by E-IPFS.
The existing api.web3.storage api correctly reports that every root CID is "pinned" on E-IPFS rather than a specific cluster node. (either the db is updated or the api code is changed to hardcode e-ipfs in the response.)

Risks

time to copy ~320TiB off of a set of ipfs nodes to S3 could be greater than 3 months!
- We're running an backfill job for the the set of uploads added before we started writing to s3 in parallel to cluster and it's been going for a year! it's not been optimised, but it points to a risk we need to manage.
data corruption
- when we write a CAR to aws we must send a checksum to ensure the write fails if a bit is flipped in transit. This has happened to us.
- when we export a CAR from the ipfs damaon on a cluster node, we have to verify that the blocks bytes we recieve match the block CIDs, as bits can get flipped in transit.
  - IPFS won't give us a CAR CID for an export, so we have to verify each block.
  - Alternatively we can use dagula or an ipfs impl on the reciveing end which would verify each block as it is recieved.
cost
- we will need to ingest a large number blocks into E-IPFS very quickly, and last time someone tried that it caused a noticable spike in our aws bill.

Prior art

backup job we run in Digital Ocean to copy pins from cluster to S3 https://github.com/nftstorage/backup
pickup that exports CARs from ipfs and writes result to S3 https://github.com/web3-storage/pickup/blob/main/pickup/lib/pickup.js
PR to add a backup job for pins added via the pinning service (not used in prod yet, but in the modern style, using dagula) https://github.com/web3-storage/web3.storage/pull/1844

DAG deduplication rate

As of 2023-01-10T15:13 the sum of the used space in bytes across web3.storage nodes is 1169851708526592 and the sum of our dag_sizes is 398108030210654 and our replication rate in cluster is 3 so our deduplication rate is about 2%

\frac{t o t a l D a g S i z e B y t e s * r e p l i c a t i o n F a c t o r}{t o t a l B y t e s S t o r e d} = \frac{1194324090631962}{1169851708526592} = 1.02

see: https://observablehq.com/d/d9f9d01c9e2c51f6

source: our node_exporter metrics for the ipfs nodes and our web3storage_content_bytes_total from: https://daghouse.grafana.net/goto/fwW0XA24z?orgId=1

Proposed Steps

This is TBC and intended to provoke better ideas and flush out concerns at this stage.

the first 3 headings can happen in parallel

Stop writing to cluster

(nearform,oli) Land batch CID status query handler PR on pickup
(oli) Send all pinning service requests to pickup, proxied to cluster (web3.stoarge and nft.storage)
(nearform) Send some traffic to pickup, monitor and optimise.
(nearform) pickup handles 100% of pin requests.

Find out the total dag_size we have to migrate

(~~nearform~~ alanshaw) Query the web3.storage & nft.storage postgre DBs to calculate the total dag_size we need to migrate.

Run `backup` test run

(alanshaw) Update https://github.com/nftstorage/backup to work with current environemt. Write to the dotstorage-prod-1 and check uploads.backup_urls
(nearform) Kick off backup process, and let it run for 24hrs.
(nearform) Find out what dag_size it has copied so far and project how long it will take to do total.
if too long… determine if optimising and running multiple backup instances would be sufficient, or flag and we come up with another plan.

Run `backup` for real

(nearform) Kick off backup process, and let it run to completion for both nft.storage and web3.storage
(nearform) report on progress each week if tracking as expected. Flag early if falling behind schedule.

Update web3.storage and nft.storage

Update all the things that query cluster, or cache pin location info.

this list is incomplete

ensure that api returns the right info for pin status
- either update the pins table so each row points to a pin_location for e-ipfs rather than n cluster nodes.
- or hardcode e-ipfs in the response from the api.
ensure that api returns the right info for pinning service get requests
update cron-pin-status to check status on pickup

Out of scope

Things that were considered but currently won't be tackled

Replicating to R2

This is currently out of scope

To enable fast retrevial via freeway we may also want to:

add <pickup bucket> to w3infra CAR indexing and replicator config to copy CARs and indexes R2 for fast retrival via freeway.
add dotstorage-prod-1 to w3infra CAR indexing and replicator config to copy CARs and indexes R2 for fast retrival via freeway.
- This would duplicate work that the CF worker api is doing currently, so uploads via that would be written twice unless we add checks or delete that code from the api.

DB schema

Where to look to find how many much dag we need to move!

Note!

We now write all uploads to s3 and record that url in upload.backup_urls
Our old backup job has been copying older uploads to S3 for a while. It writes to backup.url so the query will need to check for the existance of either.

Also note!

nft.stroage tracks pinning service pins in it's uploads table and uses type === Remote to discriminate. As such some of them may already be backed up to S3 by the backup job.
web3.storage tracks pinning service pins in a psa_pin_requests table, seperate from uploads. None of these have been backed up yet. there is no backup_urls column on that table.

web3.storage

upload.backup_urls - where we set s3 url for direct uploads via api
backup.url - where the backup script has been tracking backups. Join on backup.upload_id
psa_pin_request - where we record pinning service pins… none backed-up yet.
conent.dag_size… join upload.content_cid on content.cid to get the dag_size.

nft.storage

upload.backup_urls - where we set s3 url for direct uploads via api
backup.url - where the backup script has been tracking backups. Join on backup.upload_id
upload.type === Remote - where we record pinning service pins. some may be backed up already.
conent.dag_size… join upload.content_cid on content.cid to get the dag_size.

Notes

We could use the postgres db to define the list of pins to copy, or we could aim to copy every pin that cluster has.

It is likely that cluster has some pins on it that were added from dev or testing that are not in the db and we dont care about.

The db is the source of truth for the set of pins we really care about. However we tackle it we must verify that we have a copy of every pin that is in the DB to ensure no user data is lost.

we can

capture the list of root cids to fetch from the db. Use as static input to process.
fetch each DAG as a CAR from cluster.
- what happens if we have a partial? should we exclude those from this ?
- we need to send the request to the node that has the pin… we have this info in the db.
verify the blocks in the CAR then calculate the CAR CID.
write the CAR to an s3 bucket that E-IPFS will index e.g dotstorage-prod-1

(what bucket to write to?)

modify the backup tool to go faster
use predefined list of CIDs to fetch.
skip updating db as we go? can do in one shot once migration is done.
tweak ipfs config to set sync:false
- turn off dht publishing etc.

Set up 1 or more cluster nodes in aws
update cluster config so it joins as a node to our ipfs cluster
set a replication rule such that every pin is added to that cluster
shut down other cluster nodes once it's sync'd
… then copy things from our aws cluster node to s3 at our leisure.

Gotchas

the same CID can be expressed in CIDv0 and CIDv1 formats.
Cluster should have received all CIDs as source CID (the CID as the user uploaded them). However there was a period where we added normalized (CIDv1 base32) CIDs to cluster
nft storage doesn't track which machine a pin is on.
we dont know how long the backup script has been running for.
doublecheck aws lib-storage does verification
backup is writing to the backup table! so querying is gonna be hard.
backup is writing to the dotstorage-prod-0/complete folder
psa pins go into seperate table for web3.storge but go into uploads table in nft.storage with type remote

David Choi

2023/01/05 16:12:49

## Notes We co

all content uploaded pre-R2 days isn't available via freeway, right? if this is the case, think we can leave this out of scope (Edited)

2023/01/05 16:15:12

could be greater than 3 months

oof - we should get this started ASAP, but hopefully turning off new writes to cluster will also help (so disk operations are less competitive) (Edited)

2023/01/05 16:15:33

We're running an backfill job for the the set of uploads added before we started writing to s3 in parallel to cluster and it's been going for a year! it's not been optimised, but it points to a risk we need to manage. -

we should check postgres to see how much of this job is left (Edited)

2023/01/05 16:16:12

we could probably use some version of this job as a base and look into optimizing it, right? (Edited)

2023/01/05 16:16:38

data corruption

do we do these checks with the currently running backfill job? (Edited)

Oli Evans

2023/01/05 16:22:11

we do not. which is bad news.

2023/01/05 16:22:52

we do these checks when uploads pass through the cf web3.storage api

2023/01/05 16:18:59

correct. It's included here as it would be relatively simply to configure it (<1day)... and we'd likely want the replication to R2 for items pinned to pickup going forwards (Edited)

2023/01/05 16:19:43

tho we may want to not bother to avoid spiking our costs further during this process. (Edited)

2023/01/05 16:21:34

alan just checked and it has effectively completed, as it is now processing uploads that were done after we added s3 backups. but it did take around a year to run. (Edited)

2023/01/05 16:21:52

yes, it can be used as a base to guide this work. I will add a link to it in the doc (Edited)

Brendt Lucas

2023/01/05 16:35:20

2021-09-27

Is this the right date? The year looks wrong? (Edited)

2023/01/05 16:36:12

api correctly reports

Would nft.storage require the same? (Edited)

Peter Rabbitson

2023/01/05 16:38:59

DAGs pinned via the pinning service - direct uploads created before [2021-09-27](https://github.com/web3-storage/web3.storage/releases/tag/api-v3.5.0) (after which we started writing uploads to S3 as well as cluster.

A large portion has been repackaged in mega-car-files by cargo. It won't be 100%, but 99.99% is going to be there and available in the tables you normally query. (Edited)

2023/01/05 16:46:08

when we write a CAR to aws we must send a checksum

Can we take advantage of the sdk's support for verifying content: https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html ? (Edited)

2023/01/05 16:49:26

yes. i've focused this doc on the web3 storage cluster but yes, it all applies to both (Edited)

2023/01/05 16:55:30

ah good catch. for a ballpark estimate - 320TiB is just for web3.storage's cluster, but not all of that will have to be copied over (at this point, just uploads via PSA). but then there's probably a similar amount of data in the NFT.Storage cluster, and likely more DAGs and data (since the PSA is more popular with NFT.Storage) (Edited)

2023/01/05 16:55:56

i can probably run some queries today on the postgres database to get a more precise estimate if helpful (assuming the queries resolve) (Edited)

2023/01/05 19:00:26

The more info we have the better in my opinion :)

2023/01/05 16:59:38

oh does this mean that the vast majority of DAGs should already be extracted from cluster (sitting in CAR files in cargo)? (Edited)

2023/01/05 17:48:49

we started doing S3 backups really early on. it's actually a bit concerning that it wasn't a ton of data that needed to be moved from cluster to S3, and that job is just finishing up (Edited)

2023/01/05 19:30:35

for web3.storage, looks like the uploads table has a backup_urls field, but the pin table does not. it looks like there is some additional backfilling to do for direct uploads beyond uploads before we started writing uploads to S3 on 2021-09-27 (though don't really understand the pattern of what did / didn't get backed up) (Edited)

2023/01/05 20:01:21

sample of 6k records: https://docs.google.com/spreadsheets/d/1Er3NLliSXlw1mUPNH7OqtVwwVNC-1mMboGhJSZr9-MI/edit?usp=sharing (Edited)

2023/01/05 20:06:07

does the backfill include PSA uploads from before S3 bakcups? (Edited)

2023/01/05 20:09:43

6.8M PSA pins after 2021-09-27 in addition (Edited)

2023/01/10 14:10:42

no the backfil was only for direct uploads (Edited)

2023/01/10 14:18:42

we can, but we have to ensure that it is enabled. this wasn't the case when we started using the v3 aws s3 lib but that changed over the last 12 months. ideally we'd use the sha256 hash from the CAR CID as the checksum, but any checksum will do, as long as we're happy it's being validated. (Edited)

Getting off web3 cluster

Goals

Risks

Prior art

DAG deduplication rate

Proposed Steps

Stop writing to cluster

Find out the total dag_size we have to migrate

Run backup test run

Run backup for real

Update web3.storage and nft.storage

Out of scope

Replicating to R2

DB schema

Notes

Gotchas

Read more

What if we generate a PeerID per space?

The problem with using a CAR V2 Multihash Sorted Index as an external index

hoverboard 🛹

beam 🛸

Run `backup` test run

Run `backup` for real