Try   HackMD

Getting off web3 cluster

The web3.storage ipfs cluster contains ~320TiB of:

  • All DAGs pinned via the pinning service
  • All DAGs for direct uploads from day 1 until 2022-09-21 (when we switched to writing them to R2)

Some care must be taken to get the data off cluster as it is the only thing storing:

  • DAGs pinned via the pinning service
  • direct uploads created before 2021-09-27 (after which we started writing uploads to S3 as well as cluster.

The postgres db in heroku is the source of truth for all the root CIDs that users have added to the system.

Goals

  1. Every DAG for every root CID that has been added to cluster by a web3.storage user is copied to an S3 bucket where it can be indexed and announced by E-IPFS.
  2. The existing api.web3.storage api correctly reports that every root CID is "pinned" on E-IPFS rather than a specific cluster node. (either the db is updated or the api code is changed to hardcode e-ipfs in the response.)

Risks

  • time to copy ~320TiB off of a set of ipfs nodes to S3 could be greater than 3 months!
    • We're running an backfill job for the the set of uploads added before we started writing to s3 in parallel to cluster and it's been going for a year! it's not been optimised, but it points to a risk we need to manage.
  • data corruption
    • when we write a CAR to aws we must send a checksum to ensure the write fails if a bit is flipped in transit. This has happened to us.
    • when we export a CAR from the ipfs damaon on a cluster node, we have to verify that the blocks bytes we recieve match the block CIDs, as bits can get flipped in transit.
      • IPFS won't give us a CAR CID for an export, so we have to verify each block.
      • Alternatively we can use dagula or an ipfs impl on the reciveing end which would verify each block as it is recieved.
  • cost
    • we will need to ingest a large number blocks into E-IPFS very quickly, and last time someone tried that it caused a noticable spike in our aws bill.

Prior art

DAG deduplication rate

As of 2023-01-10T15:13 the sum of the used space in bytes across web3.storage nodes is 1169851708526592 and the sum of our dag_sizes is 398108030210654 and our replication rate in cluster is 3 so our deduplication rate is about 2%

totalDagSizeBytesreplicationFactortotalBytesStored=11943240906319621169851708526592=1.02

see: https://observablehq.com/d/d9f9d01c9e2c51f6

source: our node_exporter metrics for the ipfs nodes and our web3storage_content_bytes_total from: https://daghouse.grafana.net/goto/fwW0XA24z?orgId=1

Proposed Steps

This is TBC and intended to provoke better ideas and flush out concerns at this stage.

the first 3 headings can happen in parallel

Stop writing to cluster

  1. (nearform,oli) Land batch CID status query handler PR on pickup
  2. (oli) Send all pinning service requests to pickup, proxied to cluster (web3.stoarge and nft.storage)
  3. (nearform) Send some traffic to pickup, monitor and optimise.
  4. (nearform) pickup handles 100% of pin requests.

Find out the total dag_size we have to migrate

  1. (nearform alanshaw) Query the web3.storage & nft.storage postgre DBs to calculate the total dag_size we need to migrate.

Run backup test run

  1. (alanshaw) Update https://github.com/nftstorage/backup to work with current environemt. Write to the dotstorage-prod-1 and check uploads.backup_urls
  2. (nearform) Kick off backup process, and let it run for 24hrs.
  3. (nearform) Find out what dag_size it has copied so far and project how long it will take to do total.
  4. if too long determine if optimising and running multiple backup instances would be sufficient, or flag and we come up with another plan.

Run backup for real

  1. (nearform) Kick off backup process, and let it run to completion for both nft.storage and web3.storage
  2. (nearform) report on progress each week if tracking as expected. Flag early if falling behind schedule.

Update web3.storage and nft.storage

Update all the things that query cluster, or cache pin location info.

this list is incomplete

  • ensure that api returns the right info for pin status
    • either update the pins table so each row points to a pin_location for e-ipfs rather than n cluster nodes.
    • or hardcode e-ipfs in the response from the api.
  • ensure that api returns the right info for pinning service get requests
  • update cron-pin-status to check status on pickup

Out of scope

Things that were considered but currently won't be tackled

Replicating to R2

This is currently out of scope

To enable fast retrevial via freeway we may also want to:

  • add <pickup bucket> to w3infra CAR indexing and replicator config to copy CARs and indexes R2 for fast retrival via freeway.
  • add dotstorage-prod-1 to w3infra CAR indexing and replicator config to copy CARs and indexes R2 for fast retrival via freeway.
    • This would duplicate work that the CF worker api is doing currently, so uploads via that would be written twice unless we add checks or delete that code from the api.

DB schema

Where to look to find how many much dag we need to move!

Note!

  • We now write all uploads to s3 and record that url in upload.backup_urls
  • Our old backup job has been copying older uploads to S3 for a while. It writes to backup.url so the query will need to check for the existance of either.

Also note!

  • nft.stroage tracks pinning service pins in it's uploads table and uses type === Remote to discriminate. As such some of them may already be backed up to S3 by the backup job.
  • web3.storage tracks pinning service pins in a psa_pin_requests table, seperate from uploads. None of these have been backed up yet. there is no backup_urls column on that table.

web3.storage

  • upload.backup_urls - where we set s3 url for direct uploads via api
  • backup.url - where the backup script has been tracking backups. Join on backup.upload_id
  • psa_pin_request - where we record pinning service pins none backed-up yet.
  • conent.dag_size join upload.content_cid on content.cid to get the dag_size.

nft.storage

  • upload.backup_urls - where we set s3 url for direct uploads via api
  • backup.url - where the backup script has been tracking backups. Join on backup.upload_id
  • upload.type === Remote - where we record pinning service pins. some may be backed up already.
  • conent.dag_size join upload.content_cid on content.cid to get the dag_size.

Notes

We could use the postgres db to define the list of pins to copy, or we could aim to copy every pin that cluster has.

It is likely that cluster has some pins on it that were added from dev or testing that are not in the db and we dont care about.

The db is the source of truth for the set of pins we really care about. However we tackle it we must verify that we have a copy of every pin that is in the DB to ensure no user data is lost.

we can

  • capture the list of root cids to fetch from the db. Use as static input to process.
  • fetch each DAG as a CAR from cluster.
    • what happens if we have a partial? should we exclude those from this ?
    • we need to send the request to the node that has the pin we have this info in the db.
  • verify the blocks in the CAR then calculate the CAR CID.
  • write the CAR to an s3 bucket that E-IPFS will index e.g dotstorage-prod-1

(what bucket to write to?)

or

  • modify the backup tool to go faster
  • use predefined list of CIDs to fetch.
  • skip updating db as we go? can do in one shot once migration is done.
  • tweak ipfs config to set sync:false
    • turn off dht publishing etc.

or

  • Set up 1 or more cluster nodes in aws
  • update cluster config so it joins as a node to our ipfs cluster
  • set a replication rule such that every pin is added to that cluster
  • shut down other cluster nodes once it's sync'd
  • then copy things from our aws cluster node to s3 at our leisure.

Gotchas

  • the same CID can be expressed in CIDv0 and CIDv1 formats.
  • Cluster should have received all CIDs as source CID (the CID as the user uploaded them). However there was a period where we added normalized (CIDv1 base32) CIDs to cluster
  • nft storage doesn't track which machine a pin is on.
  • we dont know how long the backup script has been running for.
  • doublecheck aws lib-storage does verification
  • backup is writing to the backup table! so querying is gonna be hard.
  • backup is writing to the dotstorage-prod-0/complete folder
  • psa pins go into seperate table for web3.storge but go into uploads table in nft.storage with type remote