# Treekipedia Data Integration with Supabase, Blazegraph, DVC, and Git
This document outlines a workflow for integrating a large (100GB+) tree occurrences database into a graph database (Blazegraph/GraphDB), using **Supabase for staging and real-time data collection**, **DVC for version control**, and **Git for metadata tracking**. The pipeline ensures scalability, reproducibility, and flexibility.
---
## 1. Data Storage & Version Control
### 1.1 Git vs. Data Version Control
For a dataset of 100GB+, using **plain Git** is impractical. Instead, use:
- **Git LFS** (Large File Storage) for handling large files in Git.
- **DVC (Data Version Control)**, which integrates with Git but stores large files remotely (e.g., DigitalOcean Spaces).
### 1.2 Setting Up DVC With DigitalOcean
1. **Create a DigitalOcean Space** for large files.
2. **Install DVC** and initialize it in your Git repository.
bash
``` git init
dvc init```
```
3. Add a remote (DigitalOcean Spaces) to DVC:
``` dvc remote add -d myremote s3://<your_space_name>/<bucket_path> \
--access-key <DO_ACCESS_KEY> --secret-key <DO_SECRET_KEY> \
--endpointurl https://<region>.digitaloceanspaces.com
```
4. Add CSVs to DVC:
``` dvc add data/*.csv
git add data/.gitignore data/*.dvc
git commit -m "Add CSV data to DVC"
dvc push
```
## 2. Ingestion & Graph Database Setup
### 2.1 Graph Model for Treekipedia
Your RDF-based structure (Blazegraph/GraphDB) should model:
Taxonomy: species, genus, family.
Location: latitude, longitude, country.
Observations: timestamp, observer, source.
### 2.2 Transforming CSV to RDF
Use Python (rdflib) to convert CSVs into RDF triples:
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, RDFS
```
def csv_to_rdf(input_csv, output_ttl):
g = Graph()
EX = Namespace("http://example.org/tree/")
g.bind("ex", EX)
with open(input_csv, 'r') as f:
for row in f:
tree_uri = URIRef(EX + "tree_" + row['id'])
g.add((tree_uri, RDF.type, EX.Tree))
g.add((tree_uri, EX.species, Literal(row['species'])))
g.serialize(destination=output_ttl, format='turtle')
```
### 2.3 Bulk Loading RDF into Blazegraph
Load RDF into Blazegraph using:
```
java -server -Xmx4g -jar blazegraph.jar \
com.bigdata.rdf.store.DataLoader \
--defaultGraph=http://example.org \
--file data/processed/*.ttl
```
---
## 3. Using Supabase in the Workflow
Supabase can serve multiple roles:
1. Real-time data collection
Use Supabase as an API backend for collecting new tree observations from users.
2. Staging database before Blazegraph ingestion
Store new data temporarily, clean it, and export it periodically.
3. Subset management
Host partial datasets for easier querying before full integration into Blazegraph.
### 3.1 Storing Crowdsourced Data in Supabase
1. Set up a Supabase table (Postgres-based):
```
CREATE TABLE tree_occurrences (
id SERIAL PRIMARY KEY,
species TEXT,
latitude FLOAT,
longitude FLOAT,
observed_at TIMESTAMP DEFAULT now()
);
```
2. Insert new records via API:
```
curl -X POST "https://your-supabase-url/rest/v1/tree_occurrences" \
-H "apikey: YOUR_SUPABASE_KEY" \
-H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"species": "Quercus robur",
"latitude": 45.0,
"longitude": -1.2
}'
```
### 3.2 Exporting Data from Supabase for Versioning
1. Export as CSV for DVC tracking:
COPY tree_occurrences TO '/tmp/tree_data.csv' WITH CSV HEADER;
2. Add the exported CSV to DVC:
dvc add data/exported/tree_data.csv
dvc push
3. Transform the exported CSV to RDF (same as in section 2.2).
---
## 4. Versioning Blazegraph Data with Git & DVC
### 4.1 Storing RDF versions in DVC
Since Blazegraph does not have built-in versioning, store RDF snapshots in DVC:
```
dvc add data/processed/*.ttl
git add data/.gitignore data/*.dvc
git commit -m "Versioned RDF data"
dvc push
```
### 4.2 Incremental Graph Updates with Named Graphs
Instead of reloading the entire dataset, load each dataset version into a separate named graph in Blazegraph:
```
blazegraph-load --graph http://example.org/v1 data/processed/v1.ttl
blazegraph-load --graph http://example.org/v2 data/processed/v2.ttl
```
To compare versions, use a SPARQL query:
```
SELECT ?tree ?species
WHERE {
GRAPH <http://example.org/v1> { ?tree a <http://example.org/Tree> ; <http://example.org/species> ?species. }
GRAPH <http://example.org/v2> { ?tree a <http://example.org/Tree> ; <http://example.org/species> ?species. }
FILTER(?species_v1 != ?species_v2)
}
```
---
## 5. Automating the Pipeline
### 5.1 Using dvc.yaml to Automate Updates
Define a pipeline in DVC:
stages:
```export_from_supabase:
cmd: python src/export_supabase.py data/exported/tree_data.csv
deps:
- src/export_supabase.py
outs:
- data/exported/tree_data.csv
transform_to_rdf:
cmd: python src/convert_to_rdf.py data/exported/tree_data.csv data/processed/tree_data.ttl
deps:
- src/convert_to_rdf.py
- data/exported/tree_data.csv
outs:
- data/processed/tree_data.ttl
load_into_blazegraph:
cmd: blazegraph-load --graph http://example.org/latest data/processed/tree_data.ttl
deps:
- data/processed/tree_data.ttl
Then execute:
dvc repro
```
---
## 6. Summary
Key Components
Main Takeaways
Supabase is used as a real-time collection platform and API.
DVC ensures reproducibility and version control.
Blazegraph provides queryable RDF storage.
Automated workflows ensure data consistency.
This workflow enables scalable, versioned, and queryable tree biodiversity data for Treekipedia, integrating real-time data collection, knowledge graphs, and blockchain.