April 12th Riba<>Ian Filecoin Indexing

# April 12th Riba<>Ian Filecoin Indexing * Review https://github.com/vulcanize/filecoin-indexing/pull/5 * Some key differences: * Trimmed away a lot, need help deciding what can be left out and what should be added back * ipld.blocks internal blockstore * Should this target v0 or v1 blockstore format (cid vs blockstore prefixed multihash keys) * IPFS uses v1 * Filecoin deal CARs use v0 * Answer: use multihash blockstore * This might be unrealistic, we need to think about how we could prune or compress this * Maybe we need to kepe this external (use IPFS) * Added some missing CID references to header * Consolidated table where possible * Questions * Do we want to process pending blocks/messages? * Aka if we are only processing confirmed blocks, we can reference by block_cid everywhere instead of referencing to parent_state_root_cid which is cleaner in that the natural PK scheme is denormalized by the same columns all the way top to bottom * Answer: no processing mempool messages * Note: Deepest reorg that has been seen is ~11 epochs * Note: Usually 1-4 * Can we consolidate some of the messages tables into fewer tables? * There is a lot of redundancy, and as I understand it they all reference the same objects (e.g. a "parse_messages" entry will always have a correspondign entry in "messages") * vm messages vs intenral messages? * Lily "actor_states" is almost exactly the same as the "actors" table. Can we combine these (let "state" column be nullable if necessary) * Review [progress.md](https://github.com/vulcanize/filecoin-indexing/blob/97639b6345e43279e4ad22e35d1a8f803136f406/db/migrations/progress.md) * Questions: * Which summaries are required and which could be derived at a secondary layer? * Chain economics * Chain powers * Chain rewards * Message gas economies * Which remaining miner info is required and which could be derived at a secondary layer? * Current deadline infos * Miner debts * Miner locked funds * Miner pre-commit infos * Answer: Anything we can derive at secondary layers we should derive ## Notes Key issues with Lily: 1. Dataset is too large 2. Dataset is too deep/linked Users want: 1. Have something, want an inclusion proof for it 2. Want to perform some aggregation function over the data ## Suggestions and comments from Riba: Do not use CIDs anywhere except single table to map CIDs to uint64 Use integer in PKs of other tables Use distributed process that is capable of communicating these mappings between one another We will have duplicate CIDs across separate instances Test this with some random fuzzed CID=>content data Do this now! E.g. https://ipfs.io/ipfs/bafy2bzacecnamqgqmifpluoeldx7zzglxcljo6oja4vrmtj7432rphldpdmm2/8/1/8/1/0/1/0 In actor tables (for exmaple; and others), embed an array of selector steps Consider using tipset keys instead of heights everywhere (with mapping of tipsets to heights/block_cids) Tipsets are also referenced by a CID, not natively in chain but can be derived. When multiple workers are updating tipset mapping (possibly the same?) tables, when we merge we simply pick one table as source of truth and all the other tables adjust and propogate their changes using ON UPDATE CASCADE. We need to another stategy for merging the CID mappings table (or avoid needing to) since it is so much larger. If we store all CIDs of all actor states and all actor state head cids, we can prune out entire subtries that we need to process and store (statediffing, but use it to prune DB and help manage reorgs which usually have a lot of, ~80%, duplicate data) Anything we can derive at secondary layers we should derive. CAR uses CIDs but the storage provider doesn't care, it is just going to persist the records as multihash=>content mappings. So if you don't know the right content prefix when generating a CAR, can just append whatever prefix you like e.g. raw. Storage market might benefit from correct content prefixes? If we dont know the content type, we can just try codecs until one works. Note: https://github.com/multiformats/multicodec#reserved-code-ranges ## Main takeaways/actionables: * Use single table for mappings CIDs to uint64s * Need to consider how this affects merging histrical segments processed in parallel * Embed selector suffixes in actor tables and other tables that require generating inclusion proofs * We should consider doing something similar for Ethereum... * Consider/evaluate the feasibility of an in-Postgres blockstore with the full, not-compacted/pruned, state * Work in Ethereum suggests this should be possible, is on similar order of magnitude in terms of number of records, but we can test this using immense mock data sets (the data doesn't need to form a DAG) * Consider how to best handle reorgs * Finality cutoff is further away from head than in Ethereum * In-place canonicity evaluation (weighing algo) is more complicated than it is in Ethereum, but maybe still possible to do inside the DB * Finish determining which things can be left out (which things can conclusively be derived at secondary layers)