Arweave metadata tutorial

## GraphQL queriyng metadata on Arweave documentation This document provides information about how to use GraphQL API gateway to retrieve files from Arweave using [GraphQL gateway](https://g8way.io/graphql). First, based on the [specification](https://hackmd.io/w14DtWCKTme8xqLV0wAqVg) we generated metadata documents for a few (100k+) files from Arweave. Each metadata file [(example)](https://viewblock.io/en/arweave/tx/jFWRxCPLMwORazz7mowrvd9vrncYfnrGEgtXc7ObJvI ) has some tags, for example: ``` App-Name: dataos Major-Language: en Org-Entities: sushiswap Org-Entities: polkadot-based Org-Entities: defi Org-Entities: curve Org-Entities: dexes Org-Entities: ethereum Org-Entities: polkadot Org-Entities: balancer Org-Entities: layman Org-Entities: uniswap Org-Entities: omni Org-Entities: hydradx Org-Entities: bancor Org-Entities: dot Org-Entities: automated market makers Gpe-Entities: defi Product-Entities: substrate Product-Entities: curve Product-Entities: balancer Product-Entities: polkadot Product-Entities: hydradx Date-Entities: early days Date-Entities: the early days Date-Entities: the end of 2021 Money-Entities: 10 Keywords: polkadot interoperable Keywords: substrate hydradx Keywords: polkadot network Keywords: defi ethereum Keywords: substrate basically Categories: sports Content-Hash: 2659049277 Alpha-Char-Count: 6065 Total-Char-Count: 7596 Non-Alpha-Char-Count: 1531 Non-Alpha-Ratio: 0.20155344918378093 Alpha-Ratio: 0.7984465508162191 Related-To: S8aCGjC_9hq6Bpwh8MXeBdCFcMyF4n1IB0BqAkza_nU Created-Ts: 2024-01-10T20:21:18.169895 Alpha-Token-Count: more than 500 License: yRj4a5KMctX_uOmKWCFJIjmY8DeJcusVk6-HzLiM_t8 License-Fee: One-Time-1 ``` Based on these tags one can utilise GraphQl API for content discovery. ## Examples of GraphQL queries There are several ways to query files using metadata on [Graphql](https://g8way.io/graphql) gateway. Each metadata file got "App-Name: dataos" tag and 'Related-To' tag that points out the file from which the metadata was generated. Down below, there are a few examples of querying: - Long text documents with the person "Elon Musk" mentioned ```sh query { transactions( first: 100, tags: [ { name: "App-Name", values: ["dataos"] }, { name: "Person-Entities", values: ["elon musk"]}, { name: "Alpha-Token-Count", values: [ "100 to 300", "300 to 500", "more than 500" ]}, ]) { edges { node { id owner { address } tags { name value } }} } } ``` - Long text documents of category "sport" ```sh query { transactions( first: 100, tags: [ { name: "App-Name", values: ["dataos"] }, { name: "Categories", values: ["sports"] }, { name: "Alpha-Token-Count", values: [ "100 to 300", "300 to 500", "more than 500" ]}, ]) { edges { node { id owner { address } tags { name value } }} } } ``` - Large documents where "Microsoft" organisation is mentioned ```sh query { transactions( first: 100, tags: [ { name: "App-Name", values: ["dataos"] }, { name: "Org-Entities", values: ["microsoft"] }, { name: "Alpha-Token-Count", values: ["more than 500"]}, ]) { edges { node { id owner { address } tags { name value } }} } } ``` Any tag from [specification](https://hackmd.io/w14DtWCKTme8xqLV0wAqVg) can be used in queries. ## Example of usage for recommendation Using tags it is possible to find relevant documents on Arweave given one file of interest. ```python # pip install "gql[all]" from gql import gql, Client from gql.transport.aiohttp import AIOHTTPTransport transport = AIOHTTPTransport(url="https://g8way.io/graphql/") client = Client(transport=transport, fetch_schema_from_transport=True) # Query for a document with specific ID doc_query = gql(""" query { transactions( ids: ["jFWRxCPLMwORazz7mowrvd9vrncYfnrGEgtXc7ObJvI"] // example of document ) { edges { node { id owner { address } tags { name value } }} } } """) result = client.execute(doc_query) # retrieve related tags for a document tags = result['transactions']['edges'][0]['node']['tags'] tags = [t for t in tags if t['name'] in ['Org-Entities', 'Product-Entities', 'Keywords']] tags_dict = {} for t in tags: if t['name'] in tags_dict: tags_dict[t['name']].append(t['value']) else: tags_dict[t['name']] = [] tags_list = [] for k, v in tags_dict.items(): tags_list.append(f'{{name: "{k}", values: {json.dumps(v)}}}') tags_str = f"[{','.join(tags_list)}]" print(tags_str) # '[{name: "Org-Entities", values: ["polkadot-based", "defi", "curve", "dexes", "ethereum", "polkadot", "balancer", "layman", "uniswap", "omni", "hydradx", "bancor", "dot", "automated market makers"]},{name: "Product-Entities", values: ["curve", "balancer", "polkadot", "hydradx"]},{name: "Keywords", values: ["substrate hydradx", "polkadot network", "defi ethereum", "substrate basically"]}]' # For a query ot select similar records tags_query = gql( """ query { transactions( first: 100, tags: %s) { edges { node { id owner { address } tags { name value } }} } } """ % tags_str ) related_docs = client.execute(tags_query) print(related_docs) ``` ## To sum it up Generally GrapQL gateway together with metadata files on Arweave can be used for content discovery on large data lakes. In the future, the metadata format will be reviewed and improved