Adam Fuller
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
      • Invitee
    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Versions and GitHub Sync Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
Invitee
Publish Note

Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

Your note will be visible on your profile and discoverable by anyone.
Your note is now live.
This note is visible on your profile and discoverable online.
Everyone on the web can find and read all notes of this public team.
See published notes
Unpublish note
Please check the box to agree to the Community Guidelines.
View profile
Engagement control
Commenting
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Suggest edit
Permission
Disabled Forbidden Owners Signed-in users Everyone
Enable
Permission
  • Forbidden
  • Owners
  • Signed-in users
Emoji Reply
Enable
Import from Dropbox Google Drive Gist Clipboard
   owned this note    owned this note      
Published Linked with GitHub
1
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# File data sources in Graph Node --- GIP: <The number of this GIP. To be assigned by editors.> Title: File data sources Authors: Leo Schwarzstein (leo@edgeandnode.com), Adam Fuller (adam@edgeandnode.com), Zachary Burns Created: 2021-10-01 Updated: 2021-11-10 Stage: Draft --- # Abstract The current mapping API for IPFS makes the overall indexing output depend on whether or not the index node was able to find a file at the moment ipfs.cat was called. This is not deterministic and therefore not acceptable for the decentralized network. It also is not performant or reliable, as handler processing waits for the file to be found before proceeding. Finally, it creates a data-dependency between the data sourced from the chain, and the data sourced from IPFS, which means any change in file availability would require a full subgraph re-indexing from the block the file was first required. The proposed solution consists of: - An **Availability Chain**, which tracks the availability over time of files referenced by Subgraphs. - Dynamic **"file" data sources** which correspond to content-addressed data (e.g. from IPFS). These are created by blockchain-based data sources, and will respect the availability definition of the Availability Chain. They are isolated, both in terms of being processed asynchronously from the main chain-based process, and in updating a discrete set of entities. This GIP describes the implementation of file data sources in Graph Node, starting with IPFS as the first type of file data source. # Motivation As on-chain transaction costs have gone up, and dApp use-cases have expanded from DeFi into NFTs & other areas, the importance of indexing off-chain data has increased. Graph Node currently supports accessing data from IPFS, but that implementation is not deterministic, performant, or reliable. Meanwhile other data storage solutions, such as Arweave and Filecoin, are not supported at all. In order to be useful to these emerging projects, Graph Node needs to be able to reliably, performantly and deterministically index subgraphs which depend on these off-chain data storage protocols. # Prior Art [Prior IPFS Datasource RFP](https://github.com/graphprotocol/network-rfcs/blob/master/rfcs/0003-data-availability-and-IPFS-data-sources.md) [IPFS data sources graph-node issues](https://github.com/graphprotocol/graph-node/issues/1017) [Existing IPFS documentation](https://thegraph.com/docs/developer/assemblyscript-api#ipfs-api) # High Level Description A new "kind" of data source will be introduced: `file/ipfs`. These data sources can be statically defined, based on a content addressable hash, or they can be created dynamically by chain-based mappings. Upon instantiation, Graph Node will try to fetch the corresponding data from IPFS, based on the Availability Chain's stated availability for the file, if it is configured. If the file is retrieved, a designated handler will be run. That handler will only be able to save entities specified for the data source on the manifest, and those entities will not be editable by main chain handlers. The availability of those entities to query will be determined by the Availability Chain, if configured. # Detailed Specification Our running example will be a subgraph with the following user information: ```graphql User @entity { id: ID! name: String! email: String bio: String } ``` In this case the `bio` and the `email` are found in an IPFS file, while the ID and name are sourced from Ethereum. This separation & merging at query time can be defined in the subgraph `schema.graphql` as follows: ```graphql UserMetadata @entity { id: ID! email: String bio: String } User @entity { id: ID! name: String! metadata: UserMetadata } ``` > We now have an entity which is _only_ dependent on IPFS (`UserMetadata`). This can therefore be considered separately for Proof-of-indexing. A static file data source is then declared as: ```yaml - name: UserBioIPFS kind: file/ipfs source: # Omitted in templates file: Qm... mapping: apiVersion: 0.0.6 language: wasm/assemblyscript file: ./src/mappings/Mapping.ts handler: handleFile entities: - UserMetadata ``` - In this case `ipfs` is the type of `file`, indicating that this file is to be found on the IPFS network. This pattern could be extended to `file/arweave` and `file/filecoin` in the future. - Other than `file/ipfs` we can imagine kinds such as `fileLines/ipfs` to replace `ipfs.map` and handle files line by line or `directory/ipfs` to stream files in a directory, however those precise definitions are out of scope for this document. - The `entities` specified under the mapping are important in guaranteeing isolation - entities specified under file data sources should not be accessible by other data sources (and the file data source itself should only create those entities). This should be checked at compile time, and should also break in run-time if the `store` API is used to update a file data source. - There can only be one handler per file data source. In our example, the data source would not be static but a template, in which case the `source` field would be omitted. To instantiate the template the current `create` API can be used as: ```typescript // Set the `User` entity. let user = new User(newUserEvent.userId); user.name = newUserEvent.username; user.metadata = newUserEvent.userDataHash; user.save(); // Get the bio from IPFS. DataSourceTemplate.create("UserMetadata", newUserEvent.userDataHash); ``` The file handler would look like: ```typescript export function handleFile(file: Bytes, content: Bytes) { let userMetadata = new UserMetadata(file.toHexString()); userMetadata.bio = content.toString(); userMetadata.save(); } ``` Note that in this case, the mapping of the Ethereum-based entity to the IPFS-based entity takes place entirely in the Ethereum data mapping, which allows for multiple Ethereum entities to reference the same file-based entity. It is possible for an identical file data source to be created more than once (e.g. if multiple ERC721s share the same tokenURI on-chain). In this case the corresponding file handler should only be run once, and it will be up to the subgraph author to handle the many to one relationship (as in the example), or a scenario where an older file is re-used by handling the reference on the main chain mapping. ## Indexing IPFS data sources Indexing behaviour will be dependent on whether Graph Node has been configured with an Availability Chain. ### Without an Availability Chain In the absence of an Availability Chain, when a file data source is created, Graph Node tries to find that file from the node's configured IPFS gateway. If Graph Node is unable to find the file, it should retry several times, backing off over time. On finding a file, Graph Node will execute the associated handler, updating the store. The associated entity updates will **not** be part of the subgraph PoI. ### With an Availability Chain > The initial implementation in Graph Node will not include an Availability Chain If Graph Node has an Availability Chain configured, when a file data source is created, Graph Node should check the availability of the file in the Availability Chain, and in its configured IPFS Gateway. If the file is marked as Available by the Availability Chain per the latest block, and the Graph Node is able to find that file, it should process the corresponding handler and update the PoI for updated entities with that latest Availability Chain block. It should then listen for updates to that file's availability, and if the file is marked as unavailable, the resulting entities' availability block range should be closed as of the latest block, and the PoI updated. If the file is marked as Available by the Availability Chain per the latest block, and the Graph Node is not able to find that file, it should indicate to the Availability Chain that it is not able to find the file, and try again to find the file periodically. If the file is not tracked by the Availability Chain, or marked as Unavailable, and the Graph Node is able to find the file, it should indicate to the Availability Chain that the file is available. Notably, Graph Node should not proceed with the corresponding handler until the Availability Chain marks the file as Available. If the file is not tracked by the Availability Chain, or marked as Unavailable, and the Graph Node is not able to find the file, it should indicate to the Availability Chain that the file is not available, and try again to find the file periodically. ### Entities and block ranges Entities which are created, updated and deleted by blockchain data sources are only limited by their main chain block range (`chain_range`) - i.e. no change is required for those entities. Entities created by File data sources also have a `chain_range`, which is: `[startBlock,]`, where `startBlock` is the block when the underlying data source was instantiated on the source chain. There is no upper bound to the `chain_range`, unless an entity is updated by another File datasource, in which case the prior entity's `chain_range` will be closed. If two file data sources are created in the same block, entities created by the second data source will take precedence. Note that this over-writing will not be possible in the initial implementation. Entities created by File data sources also have an `availability_range`, which represents the combination of Availability Chain blocks where their source data exists. If Graph Node has no Availability Chain configured, the `availability_range` will be `[0,]` (i.e. always assumed available). If Graph Node has an Availability Chain, then the range upon new Entity creation will be `[availabilityBlock,]`, where `availabilityBlock` is the block at which Graph Node verified availability. Entities created by File data sources will necessarily need to track the data source that created them. This is because if a previously Available file is marked as Unavailable, all that file data source's entities will need to have their `availability_range` updated to `[availabilityBlock, unavailabilityBlock]`, where `unavailabilityblock` is the block at which Graph Node was notified of unavailability. The Proof of Indexing should also be updated for this new Availability Chain block. If a file is then re-marked as Available, then Graph Node should re-process the handler, which should result in the creation of a new entitiy with `availability_range == [newAvailabilityBlock,]`, and the creation of a new Proof of Indexing. ### Interacting with the store - File data source mappings can only load entities from chain data source entities, up to the file data sources chain create block. - Chain data source entities cannot interact with file data source entities in any way. ### Handling entities with the same ID There is a use-case for different file data sources to update the same entity (i.e. an entity with the same ID). > The initial implementation will not allow creation of entities with the same ID. Developers should be able to generate unique entities for each File data source, and should be able to specify the required pointers in the mappings for the main chain. In future, Graph Node will need to carefully handle ordering of entity updates to enable this case. File data sources are processed asynchronously, and entity updates may not take place in a predictable order, based on data availability. The clock in this case is provided by the main chain - if file data source A is generated at block 10 of the main chain, and file data source B is updated at block 15, and they both update the same entity, then data source B should over-write any data from data source A, even if B is processed before A. This also applies within blocks - if there are two file data sources created in block 15, the data source created later in the block should take precedence. The current behaviour in Graph Node is that saving a new version of an entity is an upsert at the database level. This implementation introduces complications for file data sources. Assuming the following file data sources, created in the following order, with an upsert pattern: ``` 1: A "userProfile { hat: red, pants: blue }" => entity (hat: red, pants: blue) 2: B "userProfile { pants: yellow }" => entity (hat: red, pants: yellow) 3: C "userProfile { pants: green }" => entity (hat: red, pants: green) ``` If the data sources are processed in the order they are created, then the existing approach works. However if, for whatever reason, A is processed after B and C, then it will require not one but three entity updates in the database (to populate the `hat: red` for all relevant entities). Similarly, if the file corresponding to A is marked as unavailable by the Availability Chain, all three entities are invalidated, though actually their `pants` fields remain valid. The existing pattern therefore creates the possibility of unbounded entity updates & reprocessing requirements. This suggests that entities created by file data sources should not be upserted at the database level, but to remove this functionality would be quite a significant and counter-intuitive constraint on subgraph developers. Therefore an alternative pattern is required, where the file data source entity updates are discrete at the database level (and therefore simple to create and roll back), but are merged together when queried. This merging could take place at query time (which may have performance constraints), or there could be an additional "reduce" step during indexing. ### Proof of indexing _[Placeholder: the proof of indexing will need to take into account both the main chain, and the Availability Chain]_ ### Indexing status The indexing status API will need to be updated to additionally report the Availability Chain syncing status, if an Availability Chain is configured. This can be passed as an additional `ChainIndexingStatus` on the `SubgraphIndexingStatus` under `chains`. `ChainIndexingStatus` can be augmented to indicate the chain "type" ``` ChainIndexingStatus { type: ChainType ## New field network: String! chainHeadBlock: Block earliestBlock: Block latestBlock: Block lastHealthyBlock: Block ## This may not be applicable to the Availability Chain } ChainType { AVAILABILITY BASE ## Terminology TBD } ``` ## Querying subgraphs using file-based data sources At Query time, Graph Node should be aware of whether the query is requesting data which includes file data source entities. In our example: ```graphql ## No file-based data source entities: { User { id name } } ## Including data from file-based entities: { User { id name metadata { bio } } } ``` - If a query requires data from a file data source, and Graph Node has an Availability Chain configured, the query will need to provide an Availability Chain block hash in addition to the Ethereum block hash. - If no Availability Chain block hash is provided, the default will be the latest that Graph Node is aware of, similar to current treatment of Ethereum block hashes in Graph Node. Note that such a query is not deterministic. - Graph Node may not be able to support all Availability Chain blocks, based on when it initiated indexing, and should refuse to serve queries if the stated Availability Chain is out of range. ## Monitoring For monitoring purposes, the subgraph will want to keep track of its file data sources, including but not limited to: - Number of file data sources - Number of file data sources where the file has been found and processed ## Restrictions and extensions - Chain-based data sources cannot load entities created by file data sources. - File data sources cannot load entities not created by itself. This might be relaxed in the future, since in principle it could load entities in their state at the moment of the data source creation, or it could even be implicitly re-created at each chain block on which the data it depends on changes, but for now use the most restrictive semantics while we learn about the real world use cases. - It seems useful to allow the originating data source to remove a file data source, stopping it and closing the `chain_range` on all entities. This could be done as `dataSource.remove("dsName")`, which would also require passing an unique name when creating the data source. This is a future extension. - File data sources cannot create other data sources, of any type. We plan on allowing this but it is left to a future GIP. - Other kinds of file data sources such as `lines` or `directory` are future extensions. - The use case of overwriting data from a new file is covered by the current design by simply spawning a new file data source which updates the resultant entities. However some use cases may require reading from the previous file to update the data, in which case an file data source will need to handle more than one file. This is left as a future extension. - Establish a simple declarative way to merge entities by ID, to improve developer experience when merging entities created by file data sources, and chain-based data sources. - Allow file data sources to create further file data sources, to facilitate indexing of recursive files or directories. # Backwards Compatibility This is a breaking change as we intend to remove the current ipfs API in the `apiVersion` in which this is released. Existing subgraphs will be able to use the existing implementation under prior `apiVersions`, but subgraph authors will be encouraged to upgrade their implementation, as they may lose access to other new functionality if they do not do so. (In any case, this implementation is also expected to be more reliable, and more performant, so subgraph authors will want to upgrade). # Dependencies - Full implementation of the functionality described in this GIP depends on a separate GIP describing the Availability Chain, and its interaction with and implementation in Graph Node. - There is also a syntax shared requirement with the Schema composition GIP: a declarative way to merge entities by ID # Risks and Security Considerations There are considerable drawbacks and risks with this proposal, such as: - Users finding it difficult to understand the API or finding that it does match their expectation or does not fit the data model for their dApp. This design is based on what should be sufficient to migrate our current IPFS subgraphs, but it is hard to predict whether it is a good fit for future patterns that may emerge in dApps. The extensibility of the proposal mitigates this risk. Good documentation will also be important. - Data sources concurrently modifying the store is significant change to the architecture. The implementation risks of introducing concurrency need to be evaluated in the engineering plan. - We are assuming that the pattern described will work for other data storage solutions, such as Arweave and filecoin - The Availability Chain means we need to run and maintain our own chain. Even if it is a very simple one, it may turn out to be a bigger operational and development overhead than anticipated. # Rationale and Alternatives - Focus on supporting off-chain data use-cases, without targeting deterministic indexing. This is undesirable as it will mean this functionality will not be available on the decentralised network. - An alternative design is to not consider Ethereum and Availability Chain times as irreconcilable, but use some sort of clock to sync them based on a correspondence between blocks, or even simply emit Availability Chain blocks with an associated Ethereum block hash and get rid of the concept of Availability Chain time entirely inside the Graph Node. This would provide more power to mix data from both chains, however it suffers from being hard to generalize to other chains, since we don't want to privilege Ethereum in the protocol, and it requires the Availability Chain to be effectively as slow as Ethereum in the block times. In the future we could explore supporting sources of synchronization among blockchains, but not assume it from the start since this alternative has major tradeoffs. # Copyright Waiver Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).

Import from clipboard

Paste your markdown or webpage here...

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lose their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.
Upgrade
All
  • All
  • Team
No template.

Create a template

Upgrade

Delete template

Do you really want to delete this template?
Turn this template into a regular note and keep its content, versions, and comments.

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Forgot password

or

By clicking below, you agree to our terms of service.

Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
Wallet ( )
Connect another wallet

New to HackMD? Sign up

Help

  • English
  • 中文
  • Français
  • Deutsch
  • 日本語
  • Español
  • Català
  • Ελληνικά
  • Português
  • italiano
  • Türkçe
  • Русский
  • Nederlands
  • hrvatski jezik
  • język polski
  • Українська
  • हिन्दी
  • svenska
  • Esperanto
  • dansk

Documents

Help & Tutorial

How to use Book mode

Slide Example

API Docs

Edit in VSCode

Install browser extension

Contacts

Feedback

Discord

Send us email

Resources

Releases

Pricing

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions and GitHub Sync
Get Full History Access

  • Edit version name
  • Delete

revision author avatar     named on  

More Less

Note content is identical to the latest version.
Compare
    Choose a version
    No search result
    Version not found
Sign in to link this note to GitHub
Learn more
This note is not linked with GitHub
 

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitHub

      Please authorize HackMD on GitHub
      • Please sign in to GitHub and install the HackMD app on your GitHub repo.
      • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
      Learn more  Sign in to GitHub

      Push the note to GitHub Push to GitHub Pull a file from GitHub

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh Authorize more repos
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Include title and tags
      Available push count

      Pull from GitHub

       
      File from GitHub
      File from HackMD

      GitHub Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Danger Zone

      Unlink
      You will no longer receive notification when GitHub file changes after unlink.

      Syncing

      Push failed

      Push successfully