Understanding Dagster + DLT + DuckDB in OSO

# Understanding Dagster + DLT + DuckDB in OSO This note summarizes what I learned while debugging and exploring how Dagster materializes data assets locally using DLT and DuckDB. --- ## 1. How Dagster Materializes Assets Each Dagster asset (for example, `giveth_qf`) launches a **DLT pipeline**. DLT fetches data from APIs, normalizes it, and writes it into the destination defined in `.env`. ```bash DLT_DEFAULT_DESTINATION__TYPE=duckdb DLT_DEFAULT_DESTINATION__CREDENTIALS=duckdb:///tmp/oso.duckdb ``` Every run creates a folder under `~/.dlt/pipelines/<pipeline_id>` containing: * `/load/loaded/.../completed_jobs/` – compressed SQL inserts (`.insert_values.gz`) * `/load/duckdb.duckdb` – a temporary database used by DLT * `schema.json`, `state.json` – metadata about the run Dagster then reads the data produced by DLT and writes the final version into `/tmp/oso.duckdb`. --- ## 2. About `.insert_values.gz` Files The `.insert_values.gz` files are compressed SQL batches generated by DLT. They are intermediate artifacts, not actual database tables. If the final load step fails or is skipped, these remain as the only copy of the data. They can be decompressed and imported manually, but under normal circumstances Dagster performs that step automatically. --- ## 3. How DLT Uses DuckDB DLT writes to the file specified in `DLT_DEFAULT_DESTINATION__CREDENTIALS`. Each pipeline uses its `dataset_name` to create a **schema** in DuckDB: ```yaml dataset_name: giveth ``` This results in: ``` Schema: giveth Table: qf_rounds ``` Therefore, the full table name is `giveth.qf_rounds`. --- ## 4. Viewing Tables and Schemas * `.show tables;` lists tables in the current schema only. * To view everything: ```sql SELECT table_schema, table_name FROM information_schema.tables; ``` * To switch schemas: ```sql SET schema 'giveth'; .show tables; ``` * To copy a table into the default schema for experimentation: ```sql CREATE TABLE main.qf_rounds AS SELECT * FROM giveth.qf_rounds; ``` --- ## 5. Dagster Instance Types Dagster uses an **instance** to store metadata such as run history, logs, and events. * `DAGSTER_INSTANCE_TYPE=ephemeral` Stores all metadata in `/tmp`. Everything resets when the session ends. * `DAGSTER_INSTANCE_TYPE=postgres` Stores metadata persistently in a PostgreSQL database. Recommended for team or production setups. To enable Postgres: ```bash DAGSTER_INSTANCE_TYPE=postgres DAGSTER_POSTGRES_HOST=localhost DAGSTER_POSTGRES_DB=dagster DAGSTER_POSTGRES_USER=youruser DAGSTER_POSTGRES_PASSWORD=yourpass ``` Run a Postgres container or service; Dagster will migrate and store all metadata automatically. --- ### 5.1 Custom Environment Variables for Local Configuration The following environment variables are **not present in the default `env.example` file** in OSO, but can be safely added for local customization or debugging: ```bash # Controls Dagster’s instance type and persistence DAGSTER_INSTANCE_TYPE=ephemeral DAGSTER_POSTGRES_HOST= DAGSTER_POSTGRES_DB= DAGSTER_POSTGRES_USER= DAGSTER_POSTGRES_PASSWORD= # Local data lake path for DuckDB DAGSTER_LOCAL_DUCKDB_PATH=/tmp/oso.duckdb # DLT configuration DLT_DEFAULT_DESTINATION__TYPE=duckdb DLT_DEFAULT_DESTINATION__CREDENTIALS=duckdb:///tmp/oso.duckdb ``` Adding these allows you to explicitly control where Dagster and DLT write data locally, and switch to PostgreSQL or another backend later without changing the code. --- ## 6. DLT System Tables DLT creates a few internal tables in every schema: * `_dlt_loads` – tracks each pipeline load and its status * `_dlt_version` – tracks schema evolution and version hashes These are safe to ignore but useful for debugging and lineage tracing. --- ## 7. Data Persistence and Overwrites By default DLT appends new data; it does **not** delete existing tables. However, it can overwrite them if one of the following is set: * `full_refresh=True` * `write_disposition="replace"` To guarantee append-only behavior: ```python pipeline = dlt.pipeline( pipeline_name="giveth", destination="duckdb", dataset_name="giveth", full_refresh=False ) pipeline.run(data, write_disposition="append") ``` --- ## 8. Typical Verification Workflow 1. Materialize an asset from Dagster. 2. Locate `.insert_values.gz` files in `~/.dlt/pipelines/.../completed_jobs`. 3. Confirm the pipeline writes to `/tmp/oso.duckdb`. 4. Inspect the tables in DuckDB: ```bash duckdb /tmp/oso.duckdb .show tables; ``` 5. Verify row counts and schema. 6. Optionally back up a table before re-materializing: ```sql ALTER TABLE giveth.qf_rounds RENAME TO giveth.qf_rounds_old; ``` --- ## 9. Key Takeaways | Concept | Description | Location | | ------------------- | ------------------------------------- | ------------------------------- | | Dagster Asset | Orchestrates data ingestion | `warehouse/oso_dagster/assets/` | | DLT Pipeline | Handles fetch → normalize → load | `~/.dlt/pipelines/` | | DuckDB | Local data lake | `/tmp/oso.duckdb` | | Instance Type | Controls Dagster metadata store | `ephemeral` or `postgres` | | Schema | Namespace created from `dataset_name` | Inside DuckDB | | `.insert_values.gz` | Intermediate SQL inserts | DLT load folder | | `_dlt_*` tables | Metadata about loads and versions | Inside schema | --- ### Summary Dagster triggers DLT to fetch and normalize data, DLT writes the results into a schemaed DuckDB file, and Dagster treats that as the local data lake. Understanding where data lives (`~/.dlt/pipelines` for raw batches, `/tmp/oso.duckdb` for final tables) and how instance types affect persistence allows developers to debug, customize, and extend the OSO data pipelines confidently.