GIP: <The number of this GIP. To be assigned by editors.>
Title: File data sources
Authors: Leo Schwarzstein (leo@edgeandnode.com), Adam Fuller (adam@edgeandnode.com), Zachary Burns
Created: 2021-10-01
Updated: 2021-11-10
Stage: Draft
The current mapping API for IPFS makes the overall indexing output depend on whether or not the index node was able to find a file at the moment ipfs.cat was called. This is not deterministic and therefore not acceptable for the decentralized network. It also is not performant or reliable, as handler processing waits for the file to be found before proceeding. Finally, it creates a data-dependency between the data sourced from the chain, and the data sourced from IPFS, which means any change in file availability would require a full subgraph re-indexing from the block the file was first required.
The proposed solution consists of:
This GIP describes the implementation of file data sources in Graph Node, starting with IPFS as the first type of file data source.
As on-chain transaction costs have gone up, and dApp use-cases have expanded from DeFi into NFTs & other areas, the importance of indexing off-chain data has increased. Graph Node currently supports accessing data from IPFS, but that implementation is not deterministic, performant, or reliable. Meanwhile other data storage solutions, such as Arweave and Filecoin, are not supported at all.
In order to be useful to these emerging projects, Graph Node needs to be able to reliably, performantly and deterministically index subgraphs which depend on these off-chain data storage protocols.
Prior IPFS Datasource RFP
IPFS data sources graph-node issues
Existing IPFS documentation
A new "kind" of data source will be introduced: file/ipfs
. These data sources can be statically defined, based on a content addressable hash, or they can be created dynamically by chain-based mappings.
Upon instantiation, Graph Node will try to fetch the corresponding data from IPFS, based on the Availability Chain's stated availability for the file, if it is configured. If the file is retrieved, a designated handler will be run. That handler will only be able to save entities specified for the data source on the manifest, and those entities will not be editable by main chain handlers. The availability of those entities to query will be determined by the Availability Chain, if configured.
Our running example will be a subgraph with the following user information:
User @entity {
id: ID!
name: String!
email: String
bio: String
}
In this case the bio
and the email
are found in an IPFS file, while the ID and name are sourced from Ethereum. This separation & merging at query time can be defined in the subgraph schema.graphql
as follows:
UserMetadata @entity {
id: ID!
email: String
bio: String
}
User @entity {
id: ID!
name: String!
metadata: UserMetadata
}
We now have an entity which is only dependent on IPFS (
UserMetadata
). This can therefore be considered separately for Proof-of-indexing.
A static file data source is then declared as:
- name: UserBioIPFS
kind: file/ipfs
source: # Omitted in templates
file: Qm...
mapping:
apiVersion: 0.0.6
language: wasm/assemblyscript
file: ./src/mappings/Mapping.ts
handler: handleFile
entities:
- UserMetadata
ipfs
is the type of file
, indicating that this file is to be found on the IPFS network. This pattern could be extended to file/arweave
and file/filecoin
in the future.file/ipfs
we can imagine kinds such as fileLines/ipfs
to replace ipfs.map
and handle files line by line or directory/ipfs
to stream files in a directory, however those precise definitions are out of scope for this document.entities
specified under the mapping are important in guaranteeing isolation - entities specified under file data sources should not be accessible by other data sources (and the file data source itself should only create those entities). This should be checked at compile time, and should also break in run-time if the store
API is used to update a file data source.In our example, the data source would not be static but a template, in which case the source
field would be omitted. To instantiate the template the current create
API can be used as:
// Set the `User` entity.
let user = new User(newUserEvent.userId);
user.name = newUserEvent.username;
user.metadata = newUserEvent.userDataHash;
user.save();
// Get the bio from IPFS.
DataSourceTemplate.create("UserMetadata", newUserEvent.userDataHash);
The file handler would look like:
export function handleFile(file: Bytes, content: Bytes) {
let userMetadata = new UserMetadata(file.toHexString());
userMetadata.bio = content.toString();
userMetadata.save();
}
Note that in this case, the mapping of the Ethereum-based entity to the IPFS-based entity takes place entirely in the Ethereum data mapping, which allows for multiple Ethereum entities to reference the same file-based entity.
It is possible for an identical file data source to be created more than once (e.g. if multiple ERC721s share the same tokenURI on-chain). In this case the corresponding file handler should only be run once, and it will be up to the subgraph author to handle the many to one relationship (as in the example), or a scenario where an older file is re-used by handling the reference on the main chain mapping.
Indexing behaviour will be dependent on whether Graph Node has been configured with an Availability Chain.
In the absence of an Availability Chain, when a file data source is created, Graph Node tries to find that file from the node's configured IPFS gateway. If Graph Node is unable to find the file, it should retry several times, backing off over time. On finding a file, Graph Node will execute the associated handler, updating the store. The associated entity updates will not be part of the subgraph PoI.
The initial implementation in Graph Node will not include an Availability Chain
If Graph Node has an Availability Chain configured, when a file data source is created, Graph Node should check the availability of the file in the Availability Chain, and in its configured IPFS Gateway.
If the file is marked as Available by the Availability Chain per the latest block, and the Graph Node is able to find that file, it should process the corresponding handler and update the PoI for updated entities with that latest Availability Chain block. It should then listen for updates to that file's availability, and if the file is marked as unavailable, the resulting entities' availability block range should be closed as of the latest block, and the PoI updated.
If the file is marked as Available by the Availability Chain per the latest block, and the Graph Node is not able to find that file, it should indicate to the Availability Chain that it is not able to find the file, and try again to find the file periodically.
If the file is not tracked by the Availability Chain, or marked as Unavailable, and the Graph Node is able to find the file, it should indicate to the Availability Chain that the file is available. Notably, Graph Node should not proceed with the corresponding handler until the Availability Chain marks the file as Available.
If the file is not tracked by the Availability Chain, or marked as Unavailable, and the Graph Node is not able to find the file, it should indicate to the Availability Chain that the file is not available, and try again to find the file periodically.
Entities which are created, updated and deleted by blockchain data sources are only limited by their main chain block range (chain_range
) - i.e. no change is required for those entities.
Entities created by File data sources also have a chain_range
, which is: [startBlock,]
, where startBlock
is the block when the underlying data source was instantiated on the source chain. There is no upper bound to the chain_range
, unless an entity is updated by another File datasource, in which case the prior entity's chain_range
will be closed. If two file data sources are created in the same block, entities created by the second data source will take precedence. Note that this over-writing will not be possible in the initial implementation.
Entities created by File data sources also have an availability_range
, which represents the combination of Availability Chain blocks where their source data exists. If Graph Node has no Availability Chain configured, the availability_range
will be [0,]
(i.e. always assumed available). If Graph Node has an Availability Chain, then the range upon new Entity creation will be [availabilityBlock,]
, where availabilityBlock
is the block at which Graph Node verified availability.
Entities created by File data sources will necessarily need to track the data source that created them. This is because if a previously Available file is marked as Unavailable, all that file data source's entities will need to have their availability_range
updated to [availabilityBlock, unavailabilityBlock]
, where unavailabilityblock
is the block at which Graph Node was notified of unavailability. The Proof of Indexing should also be updated for this new Availability Chain block.
If a file is then re-marked as Available, then Graph Node should re-process the handler, which should result in the creation of a new entitiy with availability_range == [newAvailabilityBlock,]
, and the creation of a new Proof of Indexing.
There is a use-case for different file data sources to update the same entity (i.e. an entity with the same ID).
The initial implementation will not allow creation of entities with the same ID. Developers should be able to generate unique entities for each File data source, and should be able to specify the required pointers in the mappings for the main chain.
In future, Graph Node will need to carefully handle ordering of entity updates to enable this case. File data sources are processed asynchronously, and entity updates may not take place in a predictable order, based on data availability.
The clock in this case is provided by the main chain - if file data source A is generated at block 10 of the main chain, and file data source B is updated at block 15, and they both update the same entity, then data source B should over-write any data from data source A, even if B is processed before A. This also applies within blocks - if there are two file data sources created in block 15, the data source created later in the block should take precedence.
The current behaviour in Graph Node is that saving a new version of an entity is an upsert at the database level. This implementation introduces complications for file data sources. Assuming the following file data sources, created in the following order, with an upsert pattern:
1: A "userProfile { hat: red, pants: blue }" => entity (hat: red, pants: blue)
2: B "userProfile { pants: yellow }" => entity (hat: red, pants: yellow)
3: C "userProfile { pants: green }" => entity (hat: red, pants: green)
If the data sources are processed in the order they are created, then the existing approach works. However if, for whatever reason, A is processed after B and C, then it will require not one but three entity updates in the database (to populate the hat: red
for all relevant entities). Similarly, if the file corresponding to A is marked as unavailable by the Availability Chain, all three entities are invalidated, though actually their pants
fields remain valid. The existing pattern therefore creates the possibility of unbounded entity updates & reprocessing requirements.
This suggests that entities created by file data sources should not be upserted at the database level, but to remove this functionality would be quite a significant and counter-intuitive constraint on subgraph developers. Therefore an alternative pattern is required, where the file data source entity updates are discrete at the database level (and therefore simple to create and roll back), but are merged together when queried. This merging could take place at query time (which may have performance constraints), or there could be an additional "reduce" step during indexing.
[Placeholder: the proof of indexing will need to take into account both the main chain, and the Availability Chain]
The indexing status API will need to be updated to additionally report the Availability Chain syncing status, if an Availability Chain is configured. This can be passed as an additional ChainIndexingStatus
on the SubgraphIndexingStatus
under chains
. ChainIndexingStatus
can be augmented to indicate the chain "type"
ChainIndexingStatus {
type: ChainType ## New field
network: String!
chainHeadBlock: Block
earliestBlock: Block
latestBlock: Block
lastHealthyBlock: Block ## This may not be applicable to the Availability Chain
}
ChainType {
AVAILABILITY
BASE ## Terminology TBD
}
At Query time, Graph Node should be aware of whether the query is requesting data which includes file data source entities.
In our example:
## No file-based data source entities:
{
User {
id
name
}
}
## Including data from file-based entities:
{
User {
id
name
metadata {
bio
}
}
}
For monitoring purposes, the subgraph will want to keep track of its file data sources, including but not limited to:
chain_range
on all entities. This could be done as dataSource.remove("dsName")
, which would also require passing an unique name when creating the data source. This is a future extension.lines
or directory
are future extensions.This is a breaking change as we intend to remove the current ipfs API in the apiVersion
in which this is released. Existing subgraphs will be able to use the existing implementation under prior apiVersions
, but subgraph authors will be encouraged to upgrade their implementation, as they may lose access to other new functionality if they do not do so. (In any case, this implementation is also expected to be more reliable, and more performant, so subgraph authors will want to upgrade).
There are considerable drawbacks and risks with this proposal, such as:
Copyright and related rights waived via CC0.