oof - we should get this started ASAP, but hopefully turning off new writes to cluster will also help (so disk operations are less competitive)
(Edited)
We're running an backfill job for the the set of uploads added before we started writing to s3 in parallel to cluster and it's been going for a year! it's not been optimised, but it points to a risk we need to manage.
-
we should check postgres to see how much of this job is left (Edited)
We're running an backfill job for the the set of uploads added before we started writing to s3 in parallel to cluster and it's been going for a year! it's not been optimised, but it points to a risk we need to manage.
-
we could probably use some version of this job as a base and look into optimizing it, right? (Edited)
correct. It's included here as it would be relatively simply to configure it (<1day)... and we'd likely want the replication to R2 for items pinned to pickup going forwards (Edited)
We're running an backfill job for the the set of uploads added before we started writing to s3 in parallel to cluster and it's been going for a year! it's not been optimised, but it points to a risk we need to manage.
-
alan just checked and it has effectively completed, as it is now processing uploads that were done after we added s3 backups. but it did take around a year to run. (Edited)
We're running an backfill job for the the set of uploads added before we started writing to s3 in parallel to cluster and it's been going for a year! it's not been optimised, but it points to a risk we need to manage.
-
yes, it can be used as a base to guide this work. I will add a link to it in the doc (Edited)
A large portion has been repackaged in mega-car-files by cargo. It won't be 100%, but 99.99% is going to be there and available in the tables you normally query. (Edited)
ah good catch. for a ballpark estimate - 320TiB is just for web3.storage's cluster, but not all of that will have to be copied over (at this point, just uploads via PSA). but then there's probably a similar amount of data in the NFT.Storage cluster, and likely more DAGs and data (since the PSA is more popular with NFT.Storage) (Edited)
we started doing S3 backups really early on. it's actually a bit concerning that it wasn't a ton of data that needed to be moved from cluster to S3, and that job is just finishing up (Edited)
for web3.storage, looks like the uploads table has a backup_urls field, but the pin table does not. it looks like there is some additional backfilling to do for direct uploads beyond uploads before we started writing uploads to S3 on 2021-09-27 (though don't really understand the pattern of what did / didn't get backed up) (Edited)
We're running an backfill job for the the set of uploads added before we started writing to s3 in parallel to cluster and it's been going for a year! it's not been optimised, but it points to a risk we need to manage.
-
does the backfill include PSA uploads from before S3 bakcups? (Edited)
We're running an backfill job for the the set of uploads added before we started writing to s3 in parallel to cluster and it's been going for a year! it's not been optimised, but it points to a risk we need to manage.
-
no the backfil was only for direct uploads (Edited)
when we write a CAR to aws we must send a checksum
we can, but we have to ensure that it is enabled. this wasn't the case when we started using the v3 aws s3 lib but that changed over the last 12 months. ideally we'd use the sha256 hash from the CAR CID as the checksum, but any checksum will do, as long as we're happy it's being validated. (Edited)
Getting off web3 cluster
The web3.storage ipfs cluster contains ~320TiB of:
All DAGs pinned via the pinning service
All DAGs for direct uploads from day 1 until 2022-09-21 (when we switched to writing them to R2)
Some care must be taken to get the data off cluster as it is the only thing storing:
DAGs pinned via the pinning service
direct uploads created before 2021-09-27 (after which we started writing uploads to S3 as well as cluster.
The postgres db in heroku is the source of truth for all the root CIDs that users have added to the system.
Goals
Every DAG for every root CID that has been added to cluster by a web3.storage user is copied to an S3 bucket where it can be indexed and announced by E-IPFS.
The existing api.web3.storageapi correctly reports that every root CID is "pinned" on E-IPFS rather than a specific cluster node. (either the db is updated or the api code is changed to hardcode e-ipfs in the response.)
Risks
time to copy ~320TiB off of a set of ipfs nodes to S3 could be greater than 3 months!
We're running an backfill job for the the set of uploads added before we started writing to s3 in parallel to cluster and it's been going for a year! it's not been optimised, but it points to a risk we need to manage.
data corruption
when we write a CAR to aws we must send a checksum to ensure the write fails if a bit is flipped in transit. This has happened to us.
when we export a CAR from the ipfs damaon on a cluster node, we have to verify that the blocks bytes we recieve match the block CIDs, as bits can get flipped in transit.
IPFS won't give us a CAR CID for an export, so we have to verify each block.
Alternatively we can use dagula or an ipfs impl on the reciveing end which would verify each block as it is recieved.
cost
we will need to ingest a large number blocks into E-IPFS very quickly, and last time someone tried that it caused a noticable spike in our aws bill.
As of 2023-01-10T15:13 the sum of the used space in bytes across web3.storage nodes is 1169851708526592 and the sum of our dag_sizes is 398108030210654 and our replication rate in cluster is 3 so our deduplication rate is about 2%
This is TBC and intended to provoke better ideas and flush out concerns at this stage.
the first 3 headings can happen in parallel
Stop writing to cluster
(nearform,oli) Land batch CID status query handler PR on pickup
(oli) Send all pinning service requests to pickup, proxied to cluster (web3.stoarge and nft.storage)
(nearform) Send some traffic to pickup, monitor and optimise.
(nearform) pickup handles 100% of pin requests.
Find out the total dag_size we have to migrate
(nearform alanshaw) Query the web3.storage & nft.storage postgre DBs to calculate the total dag_size we need to migrate.
Run backup test run
(alanshaw) Update https://github.com/nftstorage/backup to work with current environemt. Write to the dotstorage-prod-1 and check uploads.backup_urls
(nearform) Kick off backup process, and let it run for 24hrs.
(nearform) Find out what dag_size it has copied so far and project how long it will take to do total.
if too long… determine if optimising and running multiple backup instances would be sufficient, or flag and we come up with another plan.
Run backup for real
(nearform) Kick off backup process, and let it run to completion for both nft.storage and web3.storage
(nearform) report on progress each week if tracking as expected. Flag early if falling behind schedule.
Update web3.storage and nft.storage
Update all the things that query cluster, or cache pin location info.
this list is incomplete
ensure that api returns the right info for pin status
either update the pins table so each row points to a pin_location for e-ipfs rather than n cluster nodes.
or hardcode e-ipfs in the response from the api.
ensure that api returns the right info for pinning service get requests
update cron-pin-status to check status on pickup
Out of scope
Things that were considered but currently won't be tackled
Replicating to R2
This is currently out of scope
To enable fast retrevial via freeway we may also want to:
add <pickup bucket> to w3infra CAR indexing and replicator config to copy CARs and indexes R2 for fast retrival via freeway.
add dotstorage-prod-1 to w3infra CAR indexing and replicator config to copy CARs and indexes R2 for fast retrival via freeway.
This would duplicate work that the CF worker api is doing currently, so uploads via that would be written twice unless we add checks or delete that code from the api.
DB schema
Where to look to find how many much dag we need to move!
Note!
We now write all uploads to s3 and record that url in upload.backup_urls
Our old backup job has been copying older uploads to S3 for a while. It writes to backup.url so the query will need to check for the existance of either.
Also note!
nft.stroage tracks pinning service pins in it's uploads table and uses type === Remote to discriminate. As such some of them may already be backed up to S3 by the backup job.
web3.storage tracks pinning service pins in a psa_pin_requests table, seperate from uploads. None of these have been backed up yet. there is no backup_urls column on that table.
web3.storage
upload.backup_urls - where we set s3 url for direct uploads via api
backup.url - where the backup script has been tracking backups. Join on backup.upload_id
psa_pin_request - where we record pinning service pins… none backed-up yet.
conent.dag_size… join upload.content_cid on content.cid to get the dag_size.
nft.storage
upload.backup_urls - where we set s3 url for direct uploads via api
backup.url - where the backup script has been tracking backups. Join on backup.upload_id
upload.type === Remote - where we record pinning service pins. some may be backed up already.
conent.dag_size… join upload.content_cid on content.cid to get the dag_size.
Notes
We could use the postgres db to define the list of pins to copy, or we could aim to copy every pin that cluster has.
It is likely that cluster has some pins on it that were added from dev or testing that are not in the db and we dont care about.
The db is the source of truth for the set of pins we really care about. However we tackle it we must verify that we have a copy of every pin that is in the DB to ensure no user data is lost.
we can
capture the list of root cids to fetch from the db. Use as static input to process.
fetch each DAG as a CAR from cluster.
what happens if we have a partial? should we exclude those from this ?
we need to send the request to the node that has the pin… we have this info in the db.
verify the blocks in the CAR then calculate the CAR CID.
write the CAR to an s3 bucket that E-IPFS will index e.g dotstorage-prod-1
(what bucket to write to?)
or
modify the backup tool to go faster
use predefined list of CIDs to fetch.
skip updating db as we go? can do in one shot once migration is done.
tweak ipfs config to set sync:false
turn off dht publishing etc.
or
Set up 1 or more cluster nodes in aws
update cluster config so it joins as a node to our ipfs cluster
set a replication rule such that every pin is added to that cluster
shut down other cluster nodes once it's sync'd
… then copy things from our aws cluster node to s3 at our leisure.
Gotchas
the same CID can be expressed in CIDv0 and CIDv1 formats.
Cluster should have received all CIDs as source CID (the CID as the user uploaded them). However there was a period where we added normalized (CIDv1 base32) CIDs to cluster
nft storage doesn't track which machine a pin is on.
we dont know how long the backup script has been running for.
doublecheck aws lib-storage does verification
backup is writing to the backup table! so querying is gonna be hard.
backup is writing to the dotstorage-prod-0/complete folder
psa pins go into seperate table for web3.storge but go into uploads table in nft.storage with type remote