Treekipedia Data Integration with Supabase, Blazegraph, DVC, and Git

# Treekipedia Data Integration with Supabase, Blazegraph, DVC, and Git This document outlines a workflow for integrating a large (100GB+) tree occurrences database into a graph database (Blazegraph/GraphDB), using **Supabase for staging and real-time data collection**, **DVC for version control**, and **Git for metadata tracking**. The pipeline ensures scalability, reproducibility, and flexibility. --- ## 1. Data Storage & Version Control ### 1.1 Git vs. Data Version Control For a dataset of 100GB+, using **plain Git** is impractical. Instead, use: - **Git LFS** (Large File Storage) for handling large files in Git. - **DVC (Data Version Control)**, which integrates with Git but stores large files remotely (e.g., DigitalOcean Spaces). ### 1.2 Setting Up DVC With DigitalOcean 1. **Create a DigitalOcean Space** for large files. 2. **Install DVC** and initialize it in your Git repository. bash ``` git init dvc init``` ``` 3. Add a remote (DigitalOcean Spaces) to DVC: ``` dvc remote add -d myremote s3://<your_space_name>/<bucket_path> \ --access-key <DO_ACCESS_KEY> --secret-key <DO_SECRET_KEY> \ --endpointurl https://<region>.digitaloceanspaces.com ``` 4. Add CSVs to DVC: ``` dvc add data/*.csv git add data/.gitignore data/*.dvc git commit -m "Add CSV data to DVC" dvc push ``` ## 2. Ingestion & Graph Database Setup ### 2.1 Graph Model for Treekipedia Your RDF-based structure (Blazegraph/GraphDB) should model: Taxonomy: species, genus, family. Location: latitude, longitude, country. Observations: timestamp, observer, source. ### 2.2 Transforming CSV to RDF Use Python (rdflib) to convert CSVs into RDF triples: from rdflib import Graph, Namespace, Literal, URIRef from rdflib.namespace import RDF, RDFS ``` def csv_to_rdf(input_csv, output_ttl): g = Graph() EX = Namespace("http://example.org/tree/") g.bind("ex", EX) with open(input_csv, 'r') as f: for row in f: tree_uri = URIRef(EX + "tree_" + row['id']) g.add((tree_uri, RDF.type, EX.Tree)) g.add((tree_uri, EX.species, Literal(row['species']))) g.serialize(destination=output_ttl, format='turtle') ``` ### 2.3 Bulk Loading RDF into Blazegraph Load RDF into Blazegraph using: ``` java -server -Xmx4g -jar blazegraph.jar \ com.bigdata.rdf.store.DataLoader \ --defaultGraph=http://example.org \ --file data/processed/*.ttl ``` --- ## 3. Using Supabase in the Workflow Supabase can serve multiple roles: 1. Real-time data collection Use Supabase as an API backend for collecting new tree observations from users. 2. Staging database before Blazegraph ingestion Store new data temporarily, clean it, and export it periodically. 3. Subset management Host partial datasets for easier querying before full integration into Blazegraph. ### 3.1 Storing Crowdsourced Data in Supabase 1. Set up a Supabase table (Postgres-based): ``` CREATE TABLE tree_occurrences ( id SERIAL PRIMARY KEY, species TEXT, latitude FLOAT, longitude FLOAT, observed_at TIMESTAMP DEFAULT now() ); ``` 2. Insert new records via API: ``` curl -X POST "https://your-supabase-url/rest/v1/tree_occurrences" \ -H "apikey: YOUR_SUPABASE_KEY" \ -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "species": "Quercus robur", "latitude": 45.0, "longitude": -1.2 }' ``` ### 3.2 Exporting Data from Supabase for Versioning 1. Export as CSV for DVC tracking: COPY tree_occurrences TO '/tmp/tree_data.csv' WITH CSV HEADER; 2. Add the exported CSV to DVC: dvc add data/exported/tree_data.csv dvc push 3. Transform the exported CSV to RDF (same as in section 2.2). --- ## 4. Versioning Blazegraph Data with Git & DVC ### 4.1 Storing RDF versions in DVC Since Blazegraph does not have built-in versioning, store RDF snapshots in DVC: ``` dvc add data/processed/*.ttl git add data/.gitignore data/*.dvc git commit -m "Versioned RDF data" dvc push ``` ### 4.2 Incremental Graph Updates with Named Graphs Instead of reloading the entire dataset, load each dataset version into a separate named graph in Blazegraph: ``` blazegraph-load --graph http://example.org/v1 data/processed/v1.ttl blazegraph-load --graph http://example.org/v2 data/processed/v2.ttl ``` To compare versions, use a SPARQL query: ``` SELECT ?tree ?species WHERE { GRAPH <http://example.org/v1> { ?tree a <http://example.org/Tree> ; <http://example.org/species> ?species. } GRAPH <http://example.org/v2> { ?tree a <http://example.org/Tree> ; <http://example.org/species> ?species. } FILTER(?species_v1 != ?species_v2) } ``` --- ## 5. Automating the Pipeline ### 5.1 Using dvc.yaml to Automate Updates Define a pipeline in DVC: stages: ```export_from_supabase: cmd: python src/export_supabase.py data/exported/tree_data.csv deps: - src/export_supabase.py outs: - data/exported/tree_data.csv transform_to_rdf: cmd: python src/convert_to_rdf.py data/exported/tree_data.csv data/processed/tree_data.ttl deps: - src/convert_to_rdf.py - data/exported/tree_data.csv outs: - data/processed/tree_data.ttl load_into_blazegraph: cmd: blazegraph-load --graph http://example.org/latest data/processed/tree_data.ttl deps: - data/processed/tree_data.ttl Then execute: dvc repro ``` --- ## 6. Summary Key Components Main Takeaways Supabase is used as a real-time collection platform and API. DVC ensures reproducibility and version control. Blazegraph provides queryable RDF storage. Automated workflows ensure data consistency. This workflow enables scalable, versioned, and queryable tree biodiversity data for Treekipedia, integrating real-time data collection, knowledge graphs, and blockchain.