# Understanding Dagster + DLT + DuckDB in OSO
This note summarizes what I learned while debugging and exploring how Dagster materializes data assets locally using DLT and DuckDB.
---
## 1. How Dagster Materializes Assets
Each Dagster asset (for example, `giveth_qf`) launches a **DLT pipeline**.
DLT fetches data from APIs, normalizes it, and writes it into the destination defined in `.env`.
```bash
DLT_DEFAULT_DESTINATION__TYPE=duckdb
DLT_DEFAULT_DESTINATION__CREDENTIALS=duckdb:///tmp/oso.duckdb
```
Every run creates a folder under `~/.dlt/pipelines/<pipeline_id>` containing:
* `/load/loaded/.../completed_jobs/` – compressed SQL inserts (`.insert_values.gz`)
* `/load/duckdb.duckdb` – a temporary database used by DLT
* `schema.json`, `state.json` – metadata about the run
Dagster then reads the data produced by DLT and writes the final version into `/tmp/oso.duckdb`.
---
## 2. About `.insert_values.gz` Files
The `.insert_values.gz` files are compressed SQL batches generated by DLT.
They are intermediate artifacts, not actual database tables.
If the final load step fails or is skipped, these remain as the only copy of the data.
They can be decompressed and imported manually, but under normal circumstances Dagster performs that step automatically.
---
## 3. How DLT Uses DuckDB
DLT writes to the file specified in `DLT_DEFAULT_DESTINATION__CREDENTIALS`.
Each pipeline uses its `dataset_name` to create a **schema** in DuckDB:
```yaml
dataset_name: giveth
```
This results in:
```
Schema: giveth
Table: qf_rounds
```
Therefore, the full table name is `giveth.qf_rounds`.
---
## 4. Viewing Tables and Schemas
* `.show tables;` lists tables in the current schema only.
* To view everything:
```sql
SELECT table_schema, table_name FROM information_schema.tables;
```
* To switch schemas:
```sql
SET schema 'giveth';
.show tables;
```
* To copy a table into the default schema for experimentation:
```sql
CREATE TABLE main.qf_rounds AS SELECT * FROM giveth.qf_rounds;
```
---
## 5. Dagster Instance Types
Dagster uses an **instance** to store metadata such as run history, logs, and events.
* `DAGSTER_INSTANCE_TYPE=ephemeral`
Stores all metadata in `/tmp`. Everything resets when the session ends.
* `DAGSTER_INSTANCE_TYPE=postgres`
Stores metadata persistently in a PostgreSQL database. Recommended for team or production setups.
To enable Postgres:
```bash
DAGSTER_INSTANCE_TYPE=postgres
DAGSTER_POSTGRES_HOST=localhost
DAGSTER_POSTGRES_DB=dagster
DAGSTER_POSTGRES_USER=youruser
DAGSTER_POSTGRES_PASSWORD=yourpass
```
Run a Postgres container or service; Dagster will migrate and store all metadata automatically.
---
### 5.1 Custom Environment Variables for Local Configuration
The following environment variables are **not present in the default `env.example` file** in OSO,
but can be safely added for local customization or debugging:
```bash
# Controls Dagster’s instance type and persistence
DAGSTER_INSTANCE_TYPE=ephemeral
DAGSTER_POSTGRES_HOST=
DAGSTER_POSTGRES_DB=
DAGSTER_POSTGRES_USER=
DAGSTER_POSTGRES_PASSWORD=
# Local data lake path for DuckDB
DAGSTER_LOCAL_DUCKDB_PATH=/tmp/oso.duckdb
# DLT configuration
DLT_DEFAULT_DESTINATION__TYPE=duckdb
DLT_DEFAULT_DESTINATION__CREDENTIALS=duckdb:///tmp/oso.duckdb
```
Adding these allows you to explicitly control where Dagster and DLT write data locally,
and switch to PostgreSQL or another backend later without changing the code.
---
## 6. DLT System Tables
DLT creates a few internal tables in every schema:
* `_dlt_loads` – tracks each pipeline load and its status
* `_dlt_version` – tracks schema evolution and version hashes
These are safe to ignore but useful for debugging and lineage tracing.
---
## 7. Data Persistence and Overwrites
By default DLT appends new data; it does **not** delete existing tables.
However, it can overwrite them if one of the following is set:
* `full_refresh=True`
* `write_disposition="replace"`
To guarantee append-only behavior:
```python
pipeline = dlt.pipeline(
pipeline_name="giveth",
destination="duckdb",
dataset_name="giveth",
full_refresh=False
)
pipeline.run(data, write_disposition="append")
```
---
## 8. Typical Verification Workflow
1. Materialize an asset from Dagster.
2. Locate `.insert_values.gz` files in `~/.dlt/pipelines/.../completed_jobs`.
3. Confirm the pipeline writes to `/tmp/oso.duckdb`.
4. Inspect the tables in DuckDB:
```bash
duckdb /tmp/oso.duckdb
.show tables;
```
5. Verify row counts and schema.
6. Optionally back up a table before re-materializing:
```sql
ALTER TABLE giveth.qf_rounds RENAME TO giveth.qf_rounds_old;
```
---
## 9. Key Takeaways
| Concept | Description | Location |
| ------------------- | ------------------------------------- | ------------------------------- |
| Dagster Asset | Orchestrates data ingestion | `warehouse/oso_dagster/assets/` |
| DLT Pipeline | Handles fetch → normalize → load | `~/.dlt/pipelines/` |
| DuckDB | Local data lake | `/tmp/oso.duckdb` |
| Instance Type | Controls Dagster metadata store | `ephemeral` or `postgres` |
| Schema | Namespace created from `dataset_name` | Inside DuckDB |
| `.insert_values.gz` | Intermediate SQL inserts | DLT load folder |
| `_dlt_*` tables | Metadata about loads and versions | Inside schema |
---
### Summary
Dagster triggers DLT to fetch and normalize data, DLT writes the results into a schemaed DuckDB file, and Dagster treats that as the local data lake.
Understanding where data lives (`~/.dlt/pipelines` for raw batches, `/tmp/oso.duckdb` for final tables) and how instance types affect persistence allows developers to debug, customize, and extend the OSO data pipelines confidently.