owned this note
owned this note
Published
Linked with GitHub
# File data sources in Graph Node
---
GIP: <The number of this GIP. To be assigned by editors.>
Title: File data sources
Authors: Leo Schwarzstein (leo@edgeandnode.com), Adam Fuller (adam@edgeandnode.com), Zachary Burns
Created: 2021-10-01
Updated: 2021-11-10
Stage: Draft
---
# Abstract
The current mapping API for IPFS makes the overall indexing output depend on whether or not the index node was able to find a file at the moment ipfs.cat was called. This is not deterministic and therefore not acceptable for the decentralized network. It also is not performant or reliable, as handler processing waits for the file to be found before proceeding. Finally, it creates a data-dependency between the data sourced from the chain, and the data sourced from IPFS, which means any change in file availability would require a full subgraph re-indexing from the block the file was first required.
The proposed solution consists of:
- An **Availability Chain**, which tracks the availability over time of files referenced by Subgraphs.
- Dynamic **"file" data sources** which correspond to content-addressed data (e.g. from IPFS). These are created by blockchain-based data sources, and will respect the availability definition of the Availability Chain. They are isolated, both in terms of being processed asynchronously from the main chain-based process, and in updating a discrete set of entities.
This GIP describes the implementation of file data sources in Graph Node, starting with IPFS as the first type of file data source.
# Motivation
As on-chain transaction costs have gone up, and dApp use-cases have expanded from DeFi into NFTs & other areas, the importance of indexing off-chain data has increased. Graph Node currently supports accessing data from IPFS, but that implementation is not deterministic, performant, or reliable. Meanwhile other data storage solutions, such as Arweave and Filecoin, are not supported at all.
In order to be useful to these emerging projects, Graph Node needs to be able to reliably, performantly and deterministically index subgraphs which depend on these off-chain data storage protocols.
# Prior Art
[Prior IPFS Datasource RFP](https://github.com/graphprotocol/network-rfcs/blob/master/rfcs/0003-data-availability-and-IPFS-data-sources.md)
[IPFS data sources graph-node issues](https://github.com/graphprotocol/graph-node/issues/1017)
[Existing IPFS documentation](https://thegraph.com/docs/developer/assemblyscript-api#ipfs-api)
# High Level Description
A new "kind" of data source will be introduced: `file/ipfs`. These data sources can be statically defined, based on a content addressable hash, or they can be created dynamically by chain-based mappings.
Upon instantiation, Graph Node will try to fetch the corresponding data from IPFS, based on the Availability Chain's stated availability for the file, if it is configured. If the file is retrieved, a designated handler will be run. That handler will only be able to save entities specified for the data source on the manifest, and those entities will not be editable by main chain handlers. The availability of those entities to query will be determined by the Availability Chain, if configured.
# Detailed Specification
Our running example will be a subgraph with the following user information:
```graphql
User @entity {
id: ID!
name: String!
email: String
bio: String
}
```
In this case the `bio` and the `email` are found in an IPFS file, while the ID and name are sourced from Ethereum. This separation & merging at query time can be defined in the subgraph `schema.graphql` as follows:
```graphql
UserMetadata @entity {
id: ID!
email: String
bio: String
}
User @entity {
id: ID!
name: String!
metadata: UserMetadata
}
```
> We now have an entity which is _only_ dependent on IPFS (`UserMetadata`). This can therefore be considered separately for Proof-of-indexing.
A static file data source is then declared as:
```yaml
- name: UserBioIPFS
kind: file/ipfs
source: # Omitted in templates
file: Qm...
mapping:
apiVersion: 0.0.6
language: wasm/assemblyscript
file: ./src/mappings/Mapping.ts
handler: handleFile
entities:
- UserMetadata
```
- In this case `ipfs` is the type of `file`, indicating that this file is to be found on the IPFS network. This pattern could be extended to `file/arweave` and `file/filecoin` in the future.
- Other than `file/ipfs` we can imagine kinds such as `fileLines/ipfs` to replace `ipfs.map` and handle files line by line or `directory/ipfs` to stream files in a directory, however those precise definitions are out of scope for this document.
- The `entities` specified under the mapping are important in guaranteeing isolation - entities specified under file data sources should not be accessible by other data sources (and the file data source itself should only create those entities). This should be checked at compile time, and should also break in run-time if the `store` API is used to update a file data source.
- There can only be one handler per file data source.
In our example, the data source would not be static but a template, in which case the `source` field would be omitted. To instantiate the template the current `create` API can be used as:
```typescript
// Set the `User` entity.
let user = new User(newUserEvent.userId);
user.name = newUserEvent.username;
user.metadata = newUserEvent.userDataHash;
user.save();
// Get the bio from IPFS.
DataSourceTemplate.create("UserMetadata", newUserEvent.userDataHash);
```
The file handler would look like:
```typescript
export function handleFile(file: Bytes, content: Bytes) {
let userMetadata = new UserMetadata(file.toHexString());
userMetadata.bio = content.toString();
userMetadata.save();
}
```
Note that in this case, the mapping of the Ethereum-based entity to the IPFS-based entity takes place entirely in the Ethereum data mapping, which allows for multiple Ethereum entities to reference the same file-based entity.
It is possible for an identical file data source to be created more than once (e.g. if multiple ERC721s share the same tokenURI on-chain). In this case the corresponding file handler should only be run once, and it will be up to the subgraph author to handle the many to one relationship (as in the example), or a scenario where an older file is re-used by handling the reference on the main chain mapping.
## Indexing IPFS data sources
Indexing behaviour will be dependent on whether Graph Node has been configured with an Availability Chain.
### Without an Availability Chain
In the absence of an Availability Chain, when a file data source is created, Graph Node tries to find that file from the node's configured IPFS gateway. If Graph Node is unable to find the file, it should retry several times, backing off over time. On finding a file, Graph Node will execute the associated handler, updating the store. The associated entity updates will **not** be part of the subgraph PoI.
### With an Availability Chain
> The initial implementation in Graph Node will not include an Availability Chain
If Graph Node has an Availability Chain configured, when a file data source is created, Graph Node should check the availability of the file in the Availability Chain, and in its configured IPFS Gateway.
If the file is marked as Available by the Availability Chain per the latest block, and the Graph Node is able to find that file, it should process the corresponding handler and update the PoI for updated entities with that latest Availability Chain block. It should then listen for updates to that file's availability, and if the file is marked as unavailable, the resulting entities' availability block range should be closed as of the latest block, and the PoI updated.
If the file is marked as Available by the Availability Chain per the latest block, and the Graph Node is not able to find that file, it should indicate to the Availability Chain that it is not able to find the file, and try again to find the file periodically.
If the file is not tracked by the Availability Chain, or marked as Unavailable, and the Graph Node is able to find the file, it should indicate to the Availability Chain that the file is available. Notably, Graph Node should not proceed with the corresponding handler until the Availability Chain marks the file as Available.
If the file is not tracked by the Availability Chain, or marked as Unavailable, and the Graph Node is not able to find the file, it should indicate to the Availability Chain that the file is not available, and try again to find the file periodically.
### Entities and block ranges
Entities which are created, updated and deleted by blockchain data sources are only limited by their main chain block range (`chain_range`) - i.e. no change is required for those entities.
Entities created by File data sources also have a `chain_range`, which is: `[startBlock,]`, where `startBlock` is the block when the underlying data source was instantiated on the source chain. There is no upper bound to the `chain_range`, unless an entity is updated by another File datasource, in which case the prior entity's `chain_range` will be closed. If two file data sources are created in the same block, entities created by the second data source will take precedence. Note that this over-writing will not be possible in the initial implementation.
Entities created by File data sources also have an `availability_range`, which represents the combination of Availability Chain blocks where their source data exists. If Graph Node has no Availability Chain configured, the `availability_range` will be `[0,]` (i.e. always assumed available). If Graph Node has an Availability Chain, then the range upon new Entity creation will be `[availabilityBlock,]`, where `availabilityBlock` is the block at which Graph Node verified availability.
Entities created by File data sources will necessarily need to track the data source that created them. This is because if a previously Available file is marked as Unavailable, all that file data source's entities will need to have their `availability_range` updated to `[availabilityBlock, unavailabilityBlock]`, where `unavailabilityblock` is the block at which Graph Node was notified of unavailability. The Proof of Indexing should also be updated for this new Availability Chain block.
If a file is then re-marked as Available, then Graph Node should re-process the handler, which should result in the creation of a new entitiy with `availability_range == [newAvailabilityBlock,]`, and the creation of a new Proof of Indexing.
### Interacting with the store
- File data source mappings can only load entities from chain data source entities, up to the file data sources chain create block.
- Chain data source entities cannot interact with file data source entities in any way.
### Handling entities with the same ID
There is a use-case for different file data sources to update the same entity (i.e. an entity with the same ID).
> The initial implementation will not allow creation of entities with the same ID. Developers should be able to generate unique entities for each File data source, and should be able to specify the required pointers in the mappings for the main chain.
In future, Graph Node will need to carefully handle ordering of entity updates to enable this case. File data sources are processed asynchronously, and entity updates may not take place in a predictable order, based on data availability.
The clock in this case is provided by the main chain - if file data source A is generated at block 10 of the main chain, and file data source B is updated at block 15, and they both update the same entity, then data source B should over-write any data from data source A, even if B is processed before A. This also applies within blocks - if there are two file data sources created in block 15, the data source created later in the block should take precedence.
The current behaviour in Graph Node is that saving a new version of an entity is an upsert at the database level. This implementation introduces complications for file data sources. Assuming the following file data sources, created in the following order, with an upsert pattern:
```
1: A "userProfile { hat: red, pants: blue }" => entity (hat: red, pants: blue)
2: B "userProfile { pants: yellow }" => entity (hat: red, pants: yellow)
3: C "userProfile { pants: green }" => entity (hat: red, pants: green)
```
If the data sources are processed in the order they are created, then the existing approach works. However if, for whatever reason, A is processed after B and C, then it will require not one but three entity updates in the database (to populate the `hat: red` for all relevant entities). Similarly, if the file corresponding to A is marked as unavailable by the Availability Chain, all three entities are invalidated, though actually their `pants` fields remain valid. The existing pattern therefore creates the possibility of unbounded entity updates & reprocessing requirements.
This suggests that entities created by file data sources should not be upserted at the database level, but to remove this functionality would be quite a significant and counter-intuitive constraint on subgraph developers. Therefore an alternative pattern is required, where the file data source entity updates are discrete at the database level (and therefore simple to create and roll back), but are merged together when queried. This merging could take place at query time (which may have performance constraints), or there could be an additional "reduce" step during indexing.
### Proof of indexing
_[Placeholder: the proof of indexing will need to take into account both the main chain, and the Availability Chain]_
### Indexing status
The indexing status API will need to be updated to additionally report the Availability Chain syncing status, if an Availability Chain is configured. This can be passed as an additional `ChainIndexingStatus` on the `SubgraphIndexingStatus` under `chains`. `ChainIndexingStatus` can be augmented to indicate the chain "type"
```
ChainIndexingStatus {
type: ChainType ## New field
network: String!
chainHeadBlock: Block
earliestBlock: Block
latestBlock: Block
lastHealthyBlock: Block ## This may not be applicable to the Availability Chain
}
ChainType {
AVAILABILITY
BASE ## Terminology TBD
}
```
## Querying subgraphs using file-based data sources
At Query time, Graph Node should be aware of whether the query is requesting data which includes file data source entities.
In our example:
```graphql
## No file-based data source entities:
{
User {
id
name
}
}
## Including data from file-based entities:
{
User {
id
name
metadata {
bio
}
}
}
```
- If a query requires data from a file data source, and Graph Node has an Availability Chain configured, the query will need to provide an Availability Chain block hash in addition to the Ethereum block hash.
- If no Availability Chain block hash is provided, the default will be the latest that Graph Node is aware of, similar to current treatment of Ethereum block hashes in Graph Node. Note that such a query is not deterministic.
- Graph Node may not be able to support all Availability Chain blocks, based on when it initiated indexing, and should refuse to serve queries if the stated Availability Chain is out of range.
## Monitoring
For monitoring purposes, the subgraph will want to keep track of its file data sources, including but not limited to:
- Number of file data sources
- Number of file data sources where the file has been found and processed
## Restrictions and extensions
- Chain-based data sources cannot load entities created by file data sources.
- File data sources cannot load entities not created by itself. This might be relaxed in the future, since in principle it could load entities in their state at the moment of the data source creation, or it could even be implicitly re-created at each chain block on which the data it depends on changes, but for now use the most restrictive semantics while we learn about the real world use cases.
- It seems useful to allow the originating data source to remove a file data source, stopping it and closing the `chain_range` on all entities. This could be done as `dataSource.remove("dsName")`, which would also require passing an unique name when creating the data source. This is a future extension.
- File data sources cannot create other data sources, of any type. We plan on allowing this but it is left to a future GIP.
- Other kinds of file data sources such as `lines` or `directory` are future extensions.
- The use case of overwriting data from a new file is covered by the current design by simply spawning a new file data source which updates the resultant entities. However some use cases may require reading from the previous file to update the data, in which case an file data source will need to handle more than one file. This is left as a future extension.
- Establish a simple declarative way to merge entities by ID, to improve developer experience when merging entities created by file data sources, and chain-based data sources.
- Allow file data sources to create further file data sources, to facilitate indexing of recursive files or directories.
# Backwards Compatibility
This is a breaking change as we intend to remove the current ipfs API in the `apiVersion` in which this is released. Existing subgraphs will be able to use the existing implementation under prior `apiVersions`, but subgraph authors will be encouraged to upgrade their implementation, as they may lose access to other new functionality if they do not do so. (In any case, this implementation is also expected to be more reliable, and more performant, so subgraph authors will want to upgrade).
# Dependencies
- Full implementation of the functionality described in this GIP depends on a separate GIP describing the Availability Chain, and its interaction with and implementation in Graph Node.
- There is also a syntax shared requirement with the Schema composition GIP: a declarative way to merge entities by ID
# Risks and Security Considerations
There are considerable drawbacks and risks with this proposal, such as:
- Users finding it difficult to understand the API or finding that it does match their expectation or does not fit the data model for their dApp. This design is based on what should be sufficient to migrate our current IPFS subgraphs, but it is hard to predict whether it is a good fit for future patterns that may emerge in dApps. The extensibility of the proposal mitigates this risk. Good documentation will also be important.
- Data sources concurrently modifying the store is significant change to the architecture. The implementation risks of introducing concurrency need to be evaluated in the engineering plan.
- We are assuming that the pattern described will work for other data storage solutions, such as Arweave and filecoin
- The Availability Chain means we need to run and maintain our own chain. Even if it is a very simple one, it may turn out to be a bigger operational and development overhead than anticipated.
# Rationale and Alternatives
- Focus on supporting off-chain data use-cases, without targeting deterministic indexing. This is undesirable as it will mean this functionality will not be available on the decentralised network.
- An alternative design is to not consider Ethereum and Availability Chain times as irreconcilable, but use some sort of clock to sync them based on a correspondence between blocks, or even simply emit Availability Chain blocks with an associated Ethereum block hash and get rid of the concept of Availability Chain time entirely inside the Graph Node. This would provide more power to mix data from both chains, however it suffers from being hard to generalize to other chains, since we don't want to privilege Ethereum in the protocol, and it requires the Availability Chain to be effectively as slow as Ethereum in the block times. In the future we could explore supporting sources of synchronization among blockchains, but not assume it from the start since this alternative has major tradeoffs.
# Copyright Waiver
Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).