# DBT
## What is dbt?
dbt is a transformation workflow that helps you get more work done while producing higher quality results.
## Integration with DataChannel
DataChannel is providing a run time dbt environment where a dbt project can run.
Why?
Because DataChannel's integration with dbt makes sense because a user can schedule their projects/models synchronously based on their configured pipelines.
A user can also use DC managed warehouse to run their transformations.
## General Frontend Flow
- Select dbt from the transformations page
- A project list view will open which is essentially a main view to edit projects or create a new project.
### Creating or editing a project will open a steps window
#### Step 1: Add a warehouse
- Select a warehouse from the given cards and click on the add button.
#### Step 2: Add credentials
Two options to add a credential
1. GitHub oauth
- Using a **repo** scope
3. Using https (supported for GitHub, BitBucket, GitLab)
- We'll be asking for user's **personal access token** which basically acts as a password
#### Step 3: Project Configuration
- dbt repository settings:
- Select Repository
- Branch
- dbt core version
- Advanced settings
- Project Path: path to *dbt_project.yml* file in your repo
- Packages Path: path to *requirements.txt* or *packages.yml* file in your repo. (TBD)
- Parameters for *args* list(TBD)
#### Step 4: Scheduling
Scheduling will be done at the project and model level
- All models
- Single/specific model
- multiple models/collection of models - part of a collection
Click SCHEDULE to set up the schedule. This creates a schedule to run all the models in your dbt project. You can also create a schedule for a model in your dbt project by selecting Set Schedule next to the model name in step 4 above.
Note: Setting a schedule at the project level overrides any previously created schedules for individual models.
For a collection of models: we need to figure out how multiple models shall share the same schedule and removing a schedule for one collection or a single model would behave in case of a project level schedule?
#### Frontend (Screens)

## General backend Flow
1. Checkout and take a pull of the mentioned branch.
2. Create/Check if *virtual environment* is created - then simply activate and install all the dependencies
3. Combine *args* list passed from frontend and backend driven list and construct a new one.
4. Create *profiles.yml* file based on a warehouse detail (Need to figure out various configurations for supported warehouses)
5. Invoke a python script to run a dbt project and capture the logs to be sent to AmazonS3
### Intricacies
- Do we check dependencies installed on our local before installing the new ones. VM's name should be unique and to be constructed from *account_name*, *repo_name*, *project_name*, and *UUID* (TBD)
- Python invocation script should be kept outside and will used to invoke a project with a path argument.
- Logs should be sent to AmazonS3 to show it on frontend or stream it realtime(A service to be written) - Detailed architecture to be discussed
### DB Design
#### dc_admin_staging Design

#### dc_logs_staging Design

## Table Definition
### dc_admin_staging
- **e_dbt_model_types** table
```sql
CREATE TABLE e_dbt_model_types (
name VARCHAR(255) PRIMARY KEY
);
```
**Model types:** All Models, Selected Models
- **e_dbt_project_notification_levels** table
```sql
CREATE TABLE e_dbt_project_notification_levels (
name VARCHAR(255) PRIMARY KEY
);
```
**Notification levels**: SUCCESS, ERROR
- **e_dbt_project_statuses** table
```sql
CREATE TABLE e_dbt_project_statuses (
name VARCHAR(255) PRIMARY KEY
);
```
**Statuses are:** ACTIVE, PAUSED, DELETED
- **dbt_projects** table
```sql
CREATE TABLE dbt_projects (
id INT PRIMARY KEY AUTO_INCREMENT,
account_id INT UNSIGNED NOT NULL,
data_warehouse_id INT UNSIGNED NOT NULL,
dbt_credential_id INT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
model_type VARCHAR(255),
status VARCHAR(255),
notification_level VARCHAR(255),
name VARCHAR(255),
configuration_json JSON,
FOREIGN KEY (account_id) REFERENCES accounts(id),
FOREIGN KEY (data_warehouse_id) REFERENCES data_warehouses(id),
FOREIGN KEY (dbt_credential_id) REFERENCES dbt_credentials(id),
FOREIGN KEY (status) REFERENCES e_dbt_project_statuses(name),
FOREIGN KEY (model_type) REFERENCES e_dbt_model_types(name),
FOREIGN KEY (notification_level) REFERENCES e_dbt_project_notification_levels(name),
);
```
- **dbt_project_change_history** table
```sql
CREATE TABLE dbt_project_change_history (
id INT PRIMARY KEY AUTO_INCREMENT,
dbt_project_id INT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
property_name VARCHAR(255),
prev_value TEXT,
new_value TEXT,
updated_by INT UNSIGNED NOT NULL,
FOREIGN KEY (dbt_project_id) REFERENCES dbt_projects(id),
FOREIGN KEY (updated_by) REFERENCES users(id)
);
```
- **configured_dbt_models** table
```sql
CREATE TABLE configured_dbt_models (
id INT PRIMARY KEY AUTO_INCREMENT,
dbt_project_id INT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
name VARCHAR(255),
path VARCHAR(2000),
model_json JSON,
FOREIGN KEY (dbt_project_id) REFERENCES dbt_projects(id)
);
```
- **dbt_project_schedules** table
```sql
CREATE TABLE dbt_project_schedules (
id INT PRIMARY KEY AUTO_INCREMENT,
dbt_project_id INT NOT NULL,
schedule_option_id INT,
hour INT DEFAULT NULL,
day INT DEFAULT NULL,
week_day VARCHAR(255) DEFAULT NULL,
schedule_interval INT DEFAULT NULL,
start_hour INT DEFAULT NULL,
start_min INT DEFAULT NULL,
months VARCHAR(255) DEFAULT NULL,
days VARCHAR(255) DEFAULT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
hours TEXT,
cron_string VARCHAR(255),
FOREIGN KEY (dbt_project_id) REFERENCES dbt_projects(id),
FOREIGN KEY (schedule_option_id) REFERENCES schedule_options(id)
);
```
- **dbt_credentials** table
```sql
CREATE TABLE dbt_credentials (
id INT PRIMARY KEY AUTO_INCREMENT,
account_id INT UNSIGNED NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
name VARCHAR(255),
credential_json JSON,
FOREIGN KEY (account_id) REFERENCES accounts(id)
);
```
- **dbt_project_details** table
```sql
CREATE TABLE dbt_project_details (
id INT PRIMARY KEY AUTO_INCREMENT,
dbt_credential_id INT NOT NULL,
repository_id INT NOT NULL,
branch_name VARCHAR(255) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
manifest_json JSON,
FOREIGN KEY (dbt_credential_id) REFERENCES dbt_credentials(id)
);
```
### dc_logs_staging
- **e_log_dbt_project_run_statuses** table
```sql
CREATE TABLE e_log_dbt_project_run_statuses (
name VARCHAR(255) PRIMARY KEY
);
```
**Statuses are:** RUNNING, SUCCESSFUL, PARTIAL SUCCESS, FAILED, QUEUED
- **e_log_dbt_project_run_error_types** table
```sql
CREATE TABLE e_log_dbt_project_run_error_types (
name VARCHAR(255) PRIMARY KEY
);
```
- **e_log_dbt_project_run_types** table
```sql
CREATE TABLE e_log_dbt_project_run_types (
name VARCHAR(255) PRIMARY KEY
);
```
Types are : MANUAL, SCHEDULED
- **log_dbt_project_run** table
```sql
CREATE TABLE log_dbt_project_run (
id INT PRIMARY KEY AUTO_INCREMENT,
dbt_project_id INT NOT NULL,
model_type VARCHAR(255),
run_type VARCHAR(255),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
status VARCHAR(255),
log_file_path TEXT,
FOREIGN KEY (run_type) REFERENCES e_log_dbt_project_run_types(name),
FOREIGN KEY (status) REFERENCES e_log_dbt_project_run_statuses(name)
);
```
- **log_dbt_project_run_errors** table
```sql
CREATE TABLE log_dbt_project_run_errors (
id INT PRIMARY KEY AUTO_INCREMENT,
log_dbt_project_run_id INT NOT NULL,
err_type VARCHAR(255),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
traceback TEXT,
FOREIGN KEY (log_dbt_project_run_id) REFERENCES log_dbt_project_run(id),
FOREIGN KEY (err_type) REFERENCES e_log_dbt_project_run_error_types(name)
);
```
## API Development
### FrontendAPI
#### Requirements
- Endpoints to ADD, UPDATE or PATCH, and DELETE operations on all dbt related tables.
- An async job which would run while configuring a project or when a user has triggered a refresh button.
- Clone a repository
- Checkout a mentioned branch
- execute **dbt compile** command (This command will be executed on a common environment where we just need access to dbt commands). This is being done to generate a **manifest.json** file
- Save a generated manifest.json to a table: dbt_project_details
- Return a list of models
### MetaAPI
Would be required in creating of logging endpoints.
- An endpoint to add an entry in the database and update its status based on a run's outcome.
- Insert a new row in case of any error during each run
An endpoint to upload a log file to AmazonS3 and return its url. Or this should be written in the backend code. TBD