# DBT ## What is dbt? dbt is a transformation workflow that helps you get more work done while producing higher quality results. ## Integration with DataChannel DataChannel is providing a run time dbt environment where a dbt project can run. Why? Because DataChannel's integration with dbt makes sense because a user can schedule their projects/models synchronously based on their configured pipelines. A user can also use DC managed warehouse to run their transformations. ## General Frontend Flow - Select dbt from the transformations page - A project list view will open which is essentially a main view to edit projects or create a new project. ### Creating or editing a project will open a steps window #### Step 1: Add a warehouse - Select a warehouse from the given cards and click on the add button. #### Step 2: Add credentials Two options to add a credential 1. GitHub oauth - Using a **repo** scope 3. Using https (supported for GitHub, BitBucket, GitLab) - We'll be asking for user's **personal access token** which basically acts as a password #### Step 3: Project Configuration - dbt repository settings: - Select Repository - Branch - dbt core version - Advanced settings - Project Path: path to *dbt_project.yml* file in your repo - Packages Path: path to *requirements.txt* or *packages.yml* file in your repo. (TBD) - Parameters for *args* list(TBD) #### Step 4: Scheduling Scheduling will be done at the project and model level - All models - Single/specific model - multiple models/collection of models - part of a collection Click SCHEDULE to set up the schedule. This creates a schedule to run all the models in your dbt project. You can also create a schedule for a model in your dbt project by selecting Set Schedule next to the model name in step 4 above. Note: Setting a schedule at the project level overrides any previously created schedules for individual models. For a collection of models: we need to figure out how multiple models shall share the same schedule and removing a schedule for one collection or a single model would behave in case of a project level schedule? #### Frontend (Screens) ![](https://hackmd.io/_uploads/S1tbNNGS2.jpg) ## General backend Flow 1. Checkout and take a pull of the mentioned branch. 2. Create/Check if *virtual environment* is created - then simply activate and install all the dependencies 3. Combine *args* list passed from frontend and backend driven list and construct a new one. 4. Create *profiles.yml* file based on a warehouse detail (Need to figure out various configurations for supported warehouses) 5. Invoke a python script to run a dbt project and capture the logs to be sent to AmazonS3 ### Intricacies - Do we check dependencies installed on our local before installing the new ones. VM's name should be unique and to be constructed from *account_name*, *repo_name*, *project_name*, and *UUID* (TBD) - Python invocation script should be kept outside and will used to invoke a project with a path argument. - Logs should be sent to AmazonS3 to show it on frontend or stream it realtime(A service to be written) - Detailed architecture to be discussed ### DB Design #### dc_admin_staging Design ![](https://hackmd.io/_uploads/BkgOMybdh.png) #### dc_logs_staging Design ![](https://hackmd.io/_uploads/HyRBfyZOn.png) ## Table Definition ### dc_admin_staging - **e_dbt_model_types** table ```sql CREATE TABLE e_dbt_model_types ( name VARCHAR(255) PRIMARY KEY ); ``` **Model types:** All Models, Selected Models - **e_dbt_project_notification_levels** table ```sql CREATE TABLE e_dbt_project_notification_levels ( name VARCHAR(255) PRIMARY KEY ); ``` **Notification levels**: SUCCESS, ERROR - **e_dbt_project_statuses** table ```sql CREATE TABLE e_dbt_project_statuses ( name VARCHAR(255) PRIMARY KEY ); ``` **Statuses are:** ACTIVE, PAUSED, DELETED - **dbt_projects** table ```sql CREATE TABLE dbt_projects ( id INT PRIMARY KEY AUTO_INCREMENT, account_id INT UNSIGNED NOT NULL, data_warehouse_id INT UNSIGNED NOT NULL, dbt_credential_id INT NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, model_type VARCHAR(255), status VARCHAR(255), notification_level VARCHAR(255), name VARCHAR(255), configuration_json JSON, FOREIGN KEY (account_id) REFERENCES accounts(id), FOREIGN KEY (data_warehouse_id) REFERENCES data_warehouses(id), FOREIGN KEY (dbt_credential_id) REFERENCES dbt_credentials(id), FOREIGN KEY (status) REFERENCES e_dbt_project_statuses(name), FOREIGN KEY (model_type) REFERENCES e_dbt_model_types(name), FOREIGN KEY (notification_level) REFERENCES e_dbt_project_notification_levels(name), ); ``` - **dbt_project_change_history** table ```sql CREATE TABLE dbt_project_change_history ( id INT PRIMARY KEY AUTO_INCREMENT, dbt_project_id INT NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, property_name VARCHAR(255), prev_value TEXT, new_value TEXT, updated_by INT UNSIGNED NOT NULL, FOREIGN KEY (dbt_project_id) REFERENCES dbt_projects(id), FOREIGN KEY (updated_by) REFERENCES users(id) ); ``` - **configured_dbt_models** table ```sql CREATE TABLE configured_dbt_models ( id INT PRIMARY KEY AUTO_INCREMENT, dbt_project_id INT NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, name VARCHAR(255), path VARCHAR(2000), model_json JSON, FOREIGN KEY (dbt_project_id) REFERENCES dbt_projects(id) ); ``` - **dbt_project_schedules** table ```sql CREATE TABLE dbt_project_schedules ( id INT PRIMARY KEY AUTO_INCREMENT, dbt_project_id INT NOT NULL, schedule_option_id INT, hour INT DEFAULT NULL, day INT DEFAULT NULL, week_day VARCHAR(255) DEFAULT NULL, schedule_interval INT DEFAULT NULL, start_hour INT DEFAULT NULL, start_min INT DEFAULT NULL, months VARCHAR(255) DEFAULT NULL, days VARCHAR(255) DEFAULT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, hours TEXT, cron_string VARCHAR(255), FOREIGN KEY (dbt_project_id) REFERENCES dbt_projects(id), FOREIGN KEY (schedule_option_id) REFERENCES schedule_options(id) ); ``` - **dbt_credentials** table ```sql CREATE TABLE dbt_credentials ( id INT PRIMARY KEY AUTO_INCREMENT, account_id INT UNSIGNED NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, name VARCHAR(255), credential_json JSON, FOREIGN KEY (account_id) REFERENCES accounts(id) ); ``` - **dbt_project_details** table ```sql CREATE TABLE dbt_project_details ( id INT PRIMARY KEY AUTO_INCREMENT, dbt_credential_id INT NOT NULL, repository_id INT NOT NULL, branch_name VARCHAR(255) NOT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, manifest_json JSON, FOREIGN KEY (dbt_credential_id) REFERENCES dbt_credentials(id) ); ``` ### dc_logs_staging - **e_log_dbt_project_run_statuses** table ```sql CREATE TABLE e_log_dbt_project_run_statuses ( name VARCHAR(255) PRIMARY KEY ); ``` **Statuses are:** RUNNING, SUCCESSFUL, PARTIAL SUCCESS, FAILED, QUEUED - **e_log_dbt_project_run_error_types** table ```sql CREATE TABLE e_log_dbt_project_run_error_types ( name VARCHAR(255) PRIMARY KEY ); ``` - **e_log_dbt_project_run_types** table ```sql CREATE TABLE e_log_dbt_project_run_types ( name VARCHAR(255) PRIMARY KEY ); ``` Types are : MANUAL, SCHEDULED - **log_dbt_project_run** table ```sql CREATE TABLE log_dbt_project_run ( id INT PRIMARY KEY AUTO_INCREMENT, dbt_project_id INT NOT NULL, model_type VARCHAR(255), run_type VARCHAR(255), created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, status VARCHAR(255), log_file_path TEXT, FOREIGN KEY (run_type) REFERENCES e_log_dbt_project_run_types(name), FOREIGN KEY (status) REFERENCES e_log_dbt_project_run_statuses(name) ); ``` - **log_dbt_project_run_errors** table ```sql CREATE TABLE log_dbt_project_run_errors ( id INT PRIMARY KEY AUTO_INCREMENT, log_dbt_project_run_id INT NOT NULL, err_type VARCHAR(255), created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, traceback TEXT, FOREIGN KEY (log_dbt_project_run_id) REFERENCES log_dbt_project_run(id), FOREIGN KEY (err_type) REFERENCES e_log_dbt_project_run_error_types(name) ); ``` ## API Development ### FrontendAPI #### Requirements - Endpoints to ADD, UPDATE or PATCH, and DELETE operations on all dbt related tables. - An async job which would run while configuring a project or when a user has triggered a refresh button. - Clone a repository - Checkout a mentioned branch - execute **dbt compile** command (This command will be executed on a common environment where we just need access to dbt commands). This is being done to generate a **manifest.json** file - Save a generated manifest.json to a table: dbt_project_details - Return a list of models ### MetaAPI Would be required in creating of logging endpoints. - An endpoint to add an entry in the database and update its status based on a run's outcome. - Insert a new row in case of any error during each run An endpoint to upload a log file to AmazonS3 and return its url. Or this should be written in the backend code. TBD