Filecoin Content Indexing - Research Document

# Filecoin Content Indexing - Research Document ## Introduction This document is the deliverable for the first milestone of the engagement with Filecoin in the context of content indexing. The project was officially kicked-off on `10.01.2022` and the first milestone refers to the *research phase* where we looked into different projects and techniques to find the optimal solution that would work the best with the requirements of the first milestone. ## Objective In the current architecture of Filecoin ([Docs](https://docs.filecoin.io/)), there are no content indexers implemented i.e. an indexing of the content of the files that are stored in the Filecoin network. Our aim was to find what would be the best possible way to do that. There are several indexers working on the network that are available [here](https://cid.contact/). The [storetheindex](https://github.com/filecoin-project/storetheindex/) repository for example is responsible for indexing the mapping between CIDs and Storage Provider. Primary objectives for this engagement were to: - Research options for getting access to the file contents and metadata using retrieval deals or some better way. - Research how the indexing can happen, either after retrieving the file content or it can happen when the content gets created. The latter would involve changes into how files are getting pushed on the IPFS/Filecoin network. The other objectives mentioned in the Statement of Work (SoW) are heavily dependent on the outcome of these two objectives and are left out for now. The following section describes the projects and techniques we looked into and if they would be a possible solution. ## Findings To get a better understanding of the current implementation of file retrievals in Filecoin, we looked into the following projects to get an idea of their implementations. #### Estuary [Estuary](https://docs.estuary.tech/) uses IPFS as the hot-storage when a client is trying to store data using their service. Later, the Estuary nodes initializes storage deals with storage providers in Filecoin and the data is stored in cold-storage for storing immutably and availability. The caveat is that Estuary depends on the [Filecoin Plus](https://docs.filecoin.io/store/filecoin-plus/#concepts) program which is a community driven program to encourage the storage on Filecoin network. But it does not inherently use some protocol specific rules to ensure the data availability or retrieval. #### The Graph Protocol The Graph protocol indexes on-chain data. In the case of Filecoin, the on-chain data does not contain the content of the files itself. The actual data is spread across several storage providers and it is difficult to get them unless you know the CIDs. And considering a client can choose a storage provider, running own Filecoin storage provider will have no significant benefits because the indexer would still need to query other providers to get the data for indexing. One of our sub-projects in ChainSafe tried looking into the exact problem. Unfortunately, the conclusion from their side too was that this was not possible directly. #### IPFS-Search [IPFS-Search](https://ipfs-search.com/#/) project is what we want to do with Filecoin. IPFS-Search works by: - Sniffing on the gossip between IPFS nodes, looking for new hashes to index. - All new hashes are added to a queue, which the *crawler* picks to either fetch the content if the hash refers to the *file*. If the hash refers to a *directory*, crawler recursively fetches the file hashes and places them in the queue. - Using a metadata extractor like [Apache Tika](https://tika.apache.org/) ([Go client](https://github.com/google/go-tika) for accessing the Tika server API), it extracts the document content and metadata. - For searching and indexing, it uses [Elastic Search](https://www.elastic.co/guide/index.html). - Provides an API for querying Elastic search and serving the results. Filecoin nodes, by default, do not join the public IPFS DHT. Filecoin has an additional layer for incentives to the storage providers. This means that discovery and retrieval of data is not as trivial as compared to IPFS. For data retrieval, for example, one needs to know which storage provider is storing the data. If it's you who has stored the data then it's easy but if we are looking for data discovery, additional tools like `storetheindex` needs to be used. On top of that, for retrieving, a certain amount of FIL needs to be paid. Merely sniffing on the gossip of Filecoin nodes won't help with the data discovery. Therefore, adapting IPFS-Search directly is not possible. But it's a great start for looking into searching in distributed storage setting. #### Research Papers [Some](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/34731.pdf) of the [research](https://arxiv.org/abs/1809.00939) papers that we looked into discuss modeling of a query model for distributed nodes in a decentralized setting. The query models are based on the assumption of a distributed network where the data is readily available. The first paper for example, introduces a novel Social Query Model (SQM) for decentralized search, which factors in realistic elements such as expertise levels and response rates of nodes, and has the Pagerank model and certain Markov Decision Processes as special cases. This is not true for Filecoin due to it's retrieval style. We can definitely take inspiration and maybe even adopt the model once we have a way to fetch the content from the network. Filecoin has a specific issue when it comes to retrieving the data with its retrieval deals and the way data is stored with the storage provider. </br></br> To tackle the problem at hand, we divided the problem in 3 different approaches with respect to when the content gets indexed. Some suggestions for the solutions from our side, with their downside, are mentioned below. ### 1. When the client creates the content #### Tagging the content One simple idea that we had for when the content is created and being pushed to the Filecoin network, we can have keyword extraction using some NLP libraries like [NLTK](https://www.nltk.org/). Let those keywords act as an index in an [inverted index](https://en.wikipedia.org/wiki/Inverted_index) list in the indexer node. The indexer node will act as an intermediary between the client and the storage provider. For privacy preservation, we can let the user choose if they would want to have the content indexed or not. Client would generally have a better understanding of the data and the semantics of the metadata they are trying to store. We could either let client interfere with the keyword selection, giving them more flexibility but on the other hand, that requires the protocol to trust the client with their data. Malicious clients can lie and throw the search off. There are several other attack vector where reliance on clients can lead them to exploit this system and render it unusable. We do rely extensively on the accuracy of the NLP library for keyword extraction. But we can choose the one that fits our needs and has good accuracy record in keyword extraction. A comparison can be found in [this](https://ieeexplore.ieee.org/abstract/document/7962368) paper. ### 2. When the content is processed by the storage provider **Summary**: Want a way to label files with some metadata information to make it searchable or indexable. Miners have an easy way to do this on-chain. ------------------- Terms: - `Payload CID`: hash that represents the root node of the IPLD DAG - `Piece CID`: hash that represents the root of the hash of the padded CAR file ------- When a client wants to store a file on Filecoin, the file is transferred to the Storage Provider through Graphsync. It is the responsibility of the Storage Provider to post the deal onto chain in the form of a `PublishStorageDeals` transaction. `PublishStorageDeals` takes a list of `DealProposal` as its parameters. ``` pub struct DealProposal { pub piece_cid: Cid, pub piece_size: PaddedPieceSize, pub verified_deal: bool, pub client: Address, pub provider: Address, /// Arbitrary client chosen label to apply to the deal pub label: String, pub start_epoch: ChainEpoch, pub end_epoch: ChainEpoch, pub storage_price_per_epoch: TokenAmount, pub provider_collateral: TokenAmount, pub client_collateral: TokenAmount, } ``` The field of interest is the `label` field. This field allows for any arbitrary String to be written and is not used for any core Filecoin protocol logic. Currently, this field is most commonly used to store the Payload CID of the data. Note that this Payload CID is not necessarily true or verifiable unless the file itself is retrieved and checked. Since the Payload CID commonly stored in the `label` field is already untrusted and used essentially as a metadata field, we can create a standard metadata format and store that in there instead. The content detection and metadata is be optional in the case that some people don't want a particular file to be indexed. **Update**: Upon further research, this approach is probably not scalable. A piece can contain many files (like a UnixFS DAG). This means that a lot of metadata would have to stuffed into this label field. One potential approach would be to leverage `storetheindex` to also store metadata related to files. It is unclear currently if this is scalable since `storetheindex` could be optimized for storing specific data types (https://github.com/filecoin-project/index-provider/blob/main/metadata/metadata.go#L70). ### 3. When the content is already stored in the network Unfortunately, this is by far the most difficult task to create an indexer on. This would essentially require querying for as much data as possible using the CIDs and retrieval deals. The approach is not feasible as it immediately raises the cost of retrieval of the files. Another factor limiting the use of this approach is the need to know which CIDs are the ones for publicly available data. We weren't able to come up with a good suggestion of how we can tackle the problem. ## Re-scoping The retrieval deals turned out to be complex to be able to fetch the content already on the network. Therefore, after aligning with Protocol Labs, we decided to pivot our focus to the indexing mechanism rather than fetching of the content. ### Reduced Scope(s) - Instead of the question of **when** to index the content and **how** does the Crypto-economics work for indexer nodes, focus more on **how** to index and **where** to store the indexes. - We can assume we have the data or CIDs for the publicly available content that can be retrieved freely. Use the dataset and the metadata to focus on how to create index and the architecture surrounding it. - We do not worry about the retrieval costs for now. ## Proposed solution One assumption we had before going through the proposed solution was the we already have CIDs we know we want to fetch the content for. The CIDs provided by Protocol labs referred to IPFS, for us to kick-start our solution and that assumption still applies that we would have that in Filecoin. Another assumption, we had was to ignore the retrieval costs for the files in Filecoin network. Overview of the data flow follows: - **Get content from CIDs:** From the CIDs available, we iteratively *get* the content for each entry. The CIDs we had, referred to static websites on the IPFS network. - **Text extraction:** We extract the actual data, which was mostly contained in the `index.html` file. We can recursively scrape through all files in the directory and fetch plain text content from the HTML files. - **Indexer:** The fetched content, will then be used as an input to [Elastic search](https://www.elastic.co/guide/index.html). We also looked into [Apache Solr] for indexing and searching of plain text data, but as there were no significant advantages of using that over Elastic search, we decided to go with Elastic search. Elastic search also natively supports creating **minhash** for the content and the option to search using minhash. - **Mapping:** We need to have a mapping of the `Indexes -> CIDs`. This is when a user searches, the returned results should also contain the document or at least the address of the document. Then using the [storetheindex](https://github.com/filecoin-project/storetheindex/) project, which maintains a mapping between `CIDs -> Storage Provider`, fetch the entire document and return the detailed result. - **Querying:** When querying, ElasticSearch can also take care of routing the query to appropriate shard where the index is stored. This should then retrieve the mapping mentioned in the previous point, to be able to fetch the most relevant CIDs and consequently the content. ### Building the Indexer For building the indexer, following components will be required: - **A queue:** If we already know the CIDs and that is what we want to work with, we *do not need* a queue. But if the content to be indexed is going to change dynamically and the list being updated by some sort of a watcher, then we would need a queuing mechanism in place so that the new requests don't get lost. A good project that we can use in this case would be [Rabbit MQ](https://www.rabbitmq.com/). - **Metadata and Content Extractor:** In the current state, as we know the CIDs given are of static websites, we can simply write an extractor ourselves. But in case, we are not aware of the file content and want to extract the content for various file types, we would need a proper extractor. [Apache Tika](https://tika.apache.org/) ([Go client](https://github.com/google/go-tika)), works quite well for this case. - **Indexing**: For actually indexing the content, we would need to run an instance of [Elastic search](https://www.elastic.co/guide/index.html). Elastic search is well maintained, has native support for sharding and routing, open source and readily available for use with large data sets. - **API to serve queries**: The last component would be to build an API for serving search requests against the content indexers we have built and stored. Following diagram gives an overview of how all components will interact with each other: ![](https://i.imgur.com/CnnNE4w.jpg) ### New Content For the new content that gets pushed to the Filecoin network, our indexer node can work as an intermediary between the storage client and storage provider as explained in the storage [Deal Flow](https://spec.filecoin.io/systems/filecoin_markets/storage_market/#section-systems.filecoin_markets.storage_market.deal-flow) of Filecoin. To give the user the option, if they want to index their content or not, the **storage client** will have to be **modified** to include a confirmation and input of the user (probably will need to record this in metadata). Storage client will also then have to redirect the request through our indexer node(s) if the user agrees to index the content. This part also needs modification in the storage client. ### Existing Content Even though we already mentioned that fetching and indexing the already stored content is difficult, we might try a strategy that might work for some content, if not all. We can use the `storetheindex` project to iterate through all CIDs. For each `CID -> Storage provider` mapping, we can initiate a retrieval deal if the price is zero. This will at least allow us to get as many files as possible or at least the files that are meant for free access. Alternatively, if the client still has the data available, they can retroactively have their content indexed by skipping the rest of the deal flow and interacting with this indexer only. ### Locality Sensitive Hashing - Optional Given the size of Filecoin network, it is also important to look at how would the indexers scale. One approach to that would be to create a cluster of CIDs that are similar with respect to their content. This can be identified by using [Jaccard Coefficient](https://en.wikipedia.org/wiki/Jaccard_index) on Minhashes created from the document content. The clusters can also be internally ranked based on their similarity scores. For queries that match indexes from one document, we can directly return the top results from the cluster without having to match through all indexes of all documents. ## Conclusion For the first milestone, we wanted to have a research document outlining what was possible with respect to indexing the content on Filecoin network. After some hindrances, and re-scoping, we proposed a solution that would work with the assumption of having CIDs to index. We also, briefly, provided an idea of how it might work for existing content. If the Protocol labs team feels satisfied with the work, we can move to the next milestone of writing specification for the solution we proposed. We do agree that the solution is very close to how ipfs-search works.