## GraphQL queriyng metadata on Arweave documentation
This document provides information about how to use GraphQL API gateway to retrieve files from Arweave using [GraphQL gateway](https://g8way.io/graphql).
First, based on the [specification](https://hackmd.io/w14DtWCKTme8xqLV0wAqVg) we generated metadata documents for a few (100k+) files from Arweave. Each metadata file [(example)](https://viewblock.io/en/arweave/tx/jFWRxCPLMwORazz7mowrvd9vrncYfnrGEgtXc7ObJvI ) has some tags, for example:
```
App-Name: dataos
Major-Language: en
Org-Entities: sushiswap
Org-Entities: polkadot-based
Org-Entities: defi
Org-Entities: curve
Org-Entities: dexes
Org-Entities: ethereum
Org-Entities: polkadot
Org-Entities: balancer
Org-Entities: layman
Org-Entities: uniswap
Org-Entities: omni
Org-Entities: hydradx
Org-Entities: bancor
Org-Entities: dot
Org-Entities: automated market makers
Gpe-Entities: defi
Product-Entities: substrate
Product-Entities: curve
Product-Entities: balancer
Product-Entities: polkadot
Product-Entities: hydradx
Date-Entities: early days
Date-Entities: the early days
Date-Entities: the end of 2021
Money-Entities: 10
Keywords: polkadot interoperable
Keywords: substrate hydradx
Keywords: polkadot network
Keywords: defi ethereum
Keywords: substrate basically
Categories: sports
Content-Hash: 2659049277
Alpha-Char-Count: 6065
Total-Char-Count: 7596
Non-Alpha-Char-Count: 1531
Non-Alpha-Ratio: 0.20155344918378093
Alpha-Ratio: 0.7984465508162191
Related-To: S8aCGjC_9hq6Bpwh8MXeBdCFcMyF4n1IB0BqAkza_nU
Created-Ts: 2024-01-10T20:21:18.169895
Alpha-Token-Count: more than 500
License: yRj4a5KMctX_uOmKWCFJIjmY8DeJcusVk6-HzLiM_t8
License-Fee: One-Time-1
```
Based on these tags one can utilise GraphQl API for content discovery.
## Examples of GraphQL queries
There are several ways to query files using metadata on [Graphql](https://g8way.io/graphql) gateway. Each metadata file got "App-Name: dataos" tag and 'Related-To' tag that points out the file from which the metadata was generated. Down below, there are a few examples of querying:
- Long text documents with the person "Elon Musk" mentioned
```sh
query {
transactions(
first: 100,
tags: [
{ name: "App-Name", values: ["dataos"] },
{ name: "Person-Entities", values: ["elon musk"]},
{ name: "Alpha-Token-Count", values: [
"100 to 300",
"300 to 500",
"more than 500"
]},
]) {
edges {
node {
id
owner {
address
}
tags {
name
value
}
}}
}
}
```
- Long text documents of category "sport"
```sh
query {
transactions(
first: 100,
tags: [
{ name: "App-Name", values: ["dataos"] },
{ name: "Categories", values: ["sports"] },
{ name: "Alpha-Token-Count", values: [
"100 to 300",
"300 to 500",
"more than 500"
]},
]) {
edges {
node {
id
owner {
address
}
tags {
name
value
}
}}
}
}
```
- Large documents where "Microsoft" organisation is mentioned
```sh
query {
transactions(
first: 100,
tags: [
{ name: "App-Name", values: ["dataos"] },
{ name: "Org-Entities", values: ["microsoft"] },
{ name: "Alpha-Token-Count", values: ["more than 500"]},
]) {
edges {
node {
id
owner {
address
}
tags {
name
value
}
}}
}
}
```
Any tag from [specification](https://hackmd.io/w14DtWCKTme8xqLV0wAqVg) can be used in queries.
## Example of usage for recommendation
Using tags it is possible to find relevant documents on Arweave given one file of interest.
```python
# pip install "gql[all]"
from gql import gql, Client
from gql.transport.aiohttp import AIOHTTPTransport
transport = AIOHTTPTransport(url="https://g8way.io/graphql/")
client = Client(transport=transport, fetch_schema_from_transport=True)
# Query for a document with specific ID
doc_query = gql("""
query {
transactions(
ids: ["jFWRxCPLMwORazz7mowrvd9vrncYfnrGEgtXc7ObJvI"] // example of document
) {
edges {
node {
id
owner {
address
}
tags {
name
value
}
}}
}
}
""")
result = client.execute(doc_query)
# retrieve related tags for a document
tags = result['transactions']['edges'][0]['node']['tags']
tags = [t for t in tags if t['name'] in ['Org-Entities', 'Product-Entities', 'Keywords']]
tags_dict = {}
for t in tags:
if t['name'] in tags_dict:
tags_dict[t['name']].append(t['value'])
else:
tags_dict[t['name']] = []
tags_list = []
for k, v in tags_dict.items():
tags_list.append(f'{{name: "{k}", values: {json.dumps(v)}}}')
tags_str = f"[{','.join(tags_list)}]"
print(tags_str)
# '[{name: "Org-Entities", values: ["polkadot-based", "defi", "curve", "dexes", "ethereum", "polkadot", "balancer", "layman", "uniswap", "omni", "hydradx", "bancor", "dot", "automated market makers"]},{name: "Product-Entities", values: ["curve", "balancer", "polkadot", "hydradx"]},{name: "Keywords", values: ["substrate hydradx", "polkadot network", "defi ethereum", "substrate basically"]}]'
# For a query ot select similar records
tags_query = gql(
"""
query {
transactions(
first: 100,
tags: %s) {
edges {
node {
id
owner {
address
}
tags {
name
value
}
}}
}
}
""" % tags_str
)
related_docs = client.execute(tags_query)
print(related_docs)
```
## To sum it up
Generally GrapQL gateway together with metadata files on Arweave can be used for content discovery on large data lakes. In the future, the metadata format will be reviewed and improved