Dbt Tutorials - HackMD

# Practices https://github.com/modern-de/dbt_project_1 https://github.com/modern-de/jaffle_shop_duckdb https://learn.getdbt.com/courses/dbt-fundamentals # What is dbt? A transformation workflow: modularize and centralize your analytics code, collaborate on data models, version them; Test and document your queries; Deploying to production, with monitoring and visibility. dbt compiles and runs your analytics code against your data platform. ![image](https://hackmd.io/_uploads/S13H2c4Vkg.png) # The power of dbt As a dbt user, your main focus will be on writing **models** (***select queries***) that reflect core business logic – there’s no need to write boilerplate code to create tables and views, or to define the order of execution of your models. Instead, dbt handles turning these models into objects in your warehouse for you. **Handle boilerplate code to materialize queries as relations:** For each model you create, you can easily configure a ***materialization***. A materialization represents a build strategy for your select query – the code behind a materialization is robust, boilerplate SQL that wraps your select query in a statement to ***create a new, or update an existing, relation***. **Use a code compiler:** SQL files can contain ***Jinja***, a lightweight templating language. Using Jinja in SQL provides a way to use ***control structures*** in your queries. For example, if statements and for loops. It also enables *repeated SQL* to be shared through ***macros***. **Determine the order of model execution:** Often, when transforming data, it makes sense to do so in a ***staged approach***. dbt provides a mechanism to implement transformations in stages through the ***ref*** function. Rather than selecting from existing tables and views in your warehouse, you can select from another model. **Document your dbt project:** dbt provides a mechanism to write, version-control, and share documentation for your dbt models. You can ***write descriptions*** (in plain text or markdown) for each ***model and field***. **Test your models:** Tests provide a way to improve the integrity of the SQL in each model by making ***assertions*** about the ***results*** generated by a model. **Manage packages:** dbt ships with a package manager, which allows analysts to use and publish both public and private repositories of dbt code which can then be referenced by others. **Load seed files:** Often in analytics, raw values need to be mapped to a more readable value (for example, converting a country-code to a country name) or enriched with static or infrequently changing data. These data sources, known as seed files, can be saved as a CSV file in your project and loaded into your data warehouse using the seed command. Read more about Seeds. ***Snapshot data:*** Often, records in a data source are mutable, in that they change over time. This can be difficult to handle in analytics if you want to reconstruct historic values. dbt provides a mechanism to snapshot raw data for a point in time, through use of snapshots. # dbt Cloud vs dbt Core * dbt Cloud: cloud managed service (ref https://docs.getdbt.com/docs/cloud/about-cloud/dbt-cloud-features) * dbt Core: free open-source, command line tool, can be installed locally in your environment. # Setup with dbt Core Project structure: ``` my_dbt_project/ ├── dbt_project.yml ├── profiles.yml ├── models/ │ └── my_model.sql └── seeds/ └── my_seed.csv ``` dbt_project.yml: ``` name: my_dbt_project version: '1.0' config-version: 2 profile: my_redshift_profile source-paths: ["models"] seed-paths: ["seeds"] analysis-paths: ["analysis"] test-paths: ["tests"] macro-paths: ["macros"] target-path: "target" clean-targets: - "target" - "dbt_modules" models: my_dbt_project: +materialized: view # or table, incremental, etc. ``` ![image](https://hackmd.io/_uploads/SJQbpdS4yg.png) Ref: https://docs.getdbt.com/docs/build/projects profiles.yml: connection details to target databases (targets) targets: separate environments (development, staging, production, etc.) - seperate clusters, databases, schemas. ``` my_redshift_profile: target: dev outputs: dev: type: redshift host: your-redshift-cluster.amazonaws.com user: username password: password1 dbname: analytics schema: analytics port: 5439 # Optional Redshift configs: sslmode: prefer role: None ra3_node: true autocommit: true threads: 4 connect_timeout: None ``` Install dbt Core (recommend using virtual environments instead) ``` # dbt core and adapters python3 -m pip install dbt-core dbt-postgres dbt-redshift # check dbt --version ``` # Build your DAG dbt — transforming data—the 'T' in ELT. When you execute dbt run, you are running a model that will transform your data ***without that data ever leaving your warehouse.*** ![image](https://hackmd.io/_uploads/HJwYfeU4kg.png) Models are primarily written as a ***select*** statement and saved as a ***.sql*** file. A model is a single file containing a final select statement, and a project can have **multiple models**, and models can even ***reference each other***. ## SQL models A SQL model is a select statement. Models are defined in .sql files (typically in your models directory): * Each .sql file contains one model / select statement * The model name is inherited from the filename. * We strongly recommend using underscores for model names, not dots. For example, use models/my_model.sql instead of models/my.model.sql. * Models can be nested in subdirectories within the models directory. Example model `models/customers.sql`: ``` with customer_orders as ( select customer_id, min(order_date) as first_order_date, max(order_date) as most_recent_order_date, count(order_id) as number_of_orders from jaffle_shop.orders group by 1 ) select customers.customer_id, customers.first_name, customers.last_name, customer_orders.first_order_date, customer_orders.most_recent_order_date, coalesce(customer_orders.number_of_orders, 0) as number_of_orders from jaffle_shop.customers left join customer_orders using (customer_id) ``` When you execute `dbt run`, dbt will build this as a view named customers in your target schema: ``` create view dbt_alice.customers as ( with customer_orders as ( select customer_id, min(order_date) as first_order_date, max(order_date) as most_recent_order_date, count(order_id) as number_of_orders from jaffle_shop.orders group by 1 ) select customers.customer_id, customers.first_name, customers.last_name, customer_orders.first_order_date, customer_orders.most_recent_order_date, coalesce(customer_orders.number_of_orders, 0) as number_of_orders from jaffle_shop.customers left join customer_orders using (customer_id) ) ``` By default dbt will: * Create models as views * Build models in a target schema you define * Use your file name as the view or table name in the database ## Configuring models Configurations are "model settings" that can be set in your dbt_project.yml file, and in your model file using a config block. dbt_project.yml: ``` name: jaffle_shop config-version: 2 ... models: jaffle_shop: # this matches the `name:`` config +materialized: view # this applies to all models in the current project marts: +materialized: table # this applies to all models in the `marts/` directory marketing: +schema: marketing # this applies to all models in the `marts/marketing/`` directory ``` models/customers.sql: ``` {{ config( materialized="view", schema="marketing" ) }} with customer_orders as ... ``` ## Depedencies between models You can build dependencies between models by using the `ref` function in place of table names in a query. Use the name of another model as the argument for ref. ``` with customers as ( select * from {{ ref('stg_customers') }} ), orders as ( select * from {{ ref('stg_orders') }} ), ... ``` Compiled code: ``` create view dbt_alice.customers as ( with customers as ( select * from dbt_alice.stg_customers ), orders as ( select * from dbt_alice.stg_orders ), ... ) ... ``` Determine the order to run the models by creating a dependent acyclic graph (DAG). ![image](https://hackmd.io/_uploads/SJIWLFBNJx.png) # Enhancements ## Materializations Materializations are ***strategies for persisting dbt models in a warehouse***. There are five types of materializations built into dbt. * view: (default) your model is rebuilt as a view on each run, via a `create view as` statement. * table: your model is rebuilt as a table on each run, via a `create table as` statement. * incremental: allow dbt to `insert or update records` into a table since the last time that model was run. * ephemeral: models are not directly built into the database. Instead, dbt will interpolate the code from an ephemeral model into its dependent models using a `common table expression (CTE)` * materialized view: allows the creation and maintenance of materialized views in the target database (advanced, consult data platform'docs for details) ![image](https://hackmd.io/_uploads/BJMO-UwVJg.png) ***staging*** layer: views ***mart*** layer: table, incremental Golden Rule of Materializations Start with models as views, when they take too long to query, make them tables, when the tables take too long to build, make them incremental. https://docs.getdbt.com/docs/build/materializations ## Incremental models a materialization strategy designed to efficiently update your data warehouse tables by `only transforming` and `loading new or changed data` since the last run. Incremental models are built as tables in your data warehouse. The first time a model is run, the table is built by transforming all rows of source data. On subsequent runs, dbt transforms only the rows in your source data that you tell dbt to filter for, inserting them into the target table which is the table that has already been built. ``` {{ config( materialized='incremental' ) }} select ... ``` you also need to tell dbt: * How to filter the rows on an incremental run * The unique key of the model (if any) ``` {{ config( materialized='incremental', unique_key='order_id' ) }} select * from orders {% if is_incremental() %} where updated_at > (select max(updated_at) from {{ this }}) {% endif %} ``` **unique key (optional)** A unique_key enables `updating existing rows instead of just appending new rows`. If new information arrives for an existing unique_key, that new information can replace the current information instead of being appended to the table. If a duplicate row arrives, it can be ignored. Not specifying a unique_key will result in `append-only` behavior, which means dbt inserts all rows returned by the model's SQL into the preexisting target table without regard for whether the rows represent duplicates. ``` models: +incremental_strategy: "insert_overwrite" ``` ![image](https://hackmd.io/_uploads/BkyoeLvV1e.png) https://docs.getdbt.com/docs/build/incremental-models-overview ## Code https://docs.getdbt.com/docs/build/enhance-your-code https://docs.getdbt.com/docs/build/project-variables https://docs.getdbt.com/docs/build/hooks-operations # Deploy your models **Dbt & Airflow!** ![image](https://hackmd.io/_uploads/SkqdBLDV1g.png) https://github.com/astronomer/astronomer-cosmos?tab=readme-ov-file # Data tests Data tests are assertions you make about your models and other resources in your dbt project (e.g. sources, seeds and snapshots). https://docs.getdbt.com/docs/build/data-tests # Ref https://docs.getdbt.com/docs/core/about-core-setup https://popsql.com/learn-dbt/dbt-build https://github.com/dbt-labs/jaffle_shop_duckdb https://docs.getdbt.com/best-practices