# M&A Pipeline Project: Airflow + Streamlit: Doom Reset & Fresh Run
> How to rerun (start from step 0) or get the pipeline running from scratch (start from step 1).
This README explains:
* **Step 0 – Doom reset**: delete old environments, clear ports, and restart Docker.
* **Step 1 – Clean Airflow setup**: correct Python, install Airflow, and init metadata.
* **Step 2 – Run Airflow (scheduler + webserver)**.
* **Step 3 – Trigger the DAG and verify Bronze → Silver → Gold**.
* **Step 4 – Launch the Streamlit dashboard.**
All commands are shown so you can copy‑paste them into your terminal.
Github repo: https://github.com/excecutors/wrds-ma-impact-pipeline/tree/main
---
## Step 0 – Doom Reset (Clean Slate)
Use this if things are in a weird state or you just pulled the repo on a new machine and want to be sure everything is clean.
### 0.1. Kill anything on Airflow ports (8080, 8793)
```bash
lsof -i :8080
kill -9 <PID>
lsof -i :8793
kill -9 <PID>
```
If you can’t kill what’s on 8080 (or Docker is using it), we’ll just use **8081** for Airflow later.
### 0.2. Make sure Docker is running
* Open **Docker Desktop**.
* Wait until it says **Running**.
Then check:
```bash
docker ps
```
If you see errors like:
```bash
Cannot connect to the Docker daemon ... Is the docker daemon running?
```
→ Docker Desktop is not fully up yet.
### 0.3. Stop any old project containers
From the repo root (`wrds-ma-impact-pipeline`):
```bash
cd wrds-ma-impact-pipeline
# Stop and remove old containers
docker-compose -f .devcontainer/docker-compose.yml down
```
Optional hard reset (if you really want to clean Docker):
```bash
docker system prune -f
```
### 0.4. Delete any broken Airflow venv
If you previously installed Airflow with the wrong Python version, clean it:
```bash
deactivate # if a venv is active
rm -rf .airflow-venv
```
At this point: **no containers from this project are running and no Airflow venv exists.**
---
## Step 1 – Create a Fresh Airflow Environment
Airflow 2.9.1 supports Python **3.10** and **3.11**, not 3.12+.
We’ll set up a new venv with Python 3.11 and install Airflow using the official constraints.
### 1.1. Create and activate the venv
From the project root:
```bash
python3.11 -m venv .airflow-venv
source .airflow-venv/bin/activate
pip install --upgrade pip
```
### 1.2. Install Airflow 2.9.1 with constraints
```bash
export AIRFLOW_VERSION=2.9.1
export PYTHON_VERSION=3.11
export CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
```
Verify:
```bash
airflow version
# should print: 2.9.1
```
---
## Step 2 – Start a Clean Docker Stack
Now bring up the app + Postgres containers.
From the repo root:
```bash
cd wrds-ma-impact-pipeline
docker-compose -f .devcontainer/docker-compose.yml up -d --build
```
Check containers:
```bash
docker ps
```
You should see something like:
* `ma_project_app` (exposes 8080, 8501)
* `ma_project_db` (Postgres on 5432)
If you get a daemon error, reopen Docker Desktop and run the `docker-compose ... up` command again.
---
## Step 3 – Configure and Initialize Airflow
Now we wire up Airflow to the local `airflow` folder, initialize the metadata DB, and create an admin user.
### 3.1. Activate venv and set `AIRFLOW_HOME`
```bash
source .airflow-venv/bin/activate
export AIRFLOW_HOME=$(pwd)/airflow
```
### 3.2. Initialize the Airflow metadata database
Run once per clean reset:
```bash
airflow db init
```
> If you see an error like:
>
> ```
> Can't locate revision identified by '1949afb29106'
> ```
>
> then delete the old DB and re‑init:
>
> ```bash
> rm -f airflow/airflow.db
> airflow db init
> ```
### 3.3. Create the Airflow admin user
Do this once per DB reset so the UI is immediately usable:
```bash
airflow users create \
--username admin \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@example.com \
--password admin
```
### 3.4. Confirm the DAG is registered
```bash
airflow dags list | grep ma_value_impact_pipeline
```
You should see a row pointing to:
```text
airflow/dags/ma_pipeline_dag.py
```
If not:
* Make sure the file exists at `airflow/dags/ma_pipeline_dag.py`.
---
## Step 4 – Run Airflow (Scheduler + Webserver)
You’ll use **two terminals** for this.
### 4.1. Terminal A – Run the scheduler
From project root:
```bash
source .airflow-venv/bin/activate
export AIRFLOW_HOME=$(pwd)/airflow
airflow scheduler
```
Leave this running.
### 4.2. Terminal B – Run the webserver (use 8081 to avoid conflicts)
Open a new terminal, go to the project root, then:
```bash
source .airflow-venv/bin/activate
export AIRFLOW_HOME=$(pwd)/airflow
airflow webserver -p 8081
```
Now open the UI:
```text
http://localhost:8081
```
Log in with:
* **Username:** `admin`
* **Password:** `admin`
---
## Step 5 – Trigger the DAG (Bronze → Silver → Gold)
In the Airflow UI:
1. Find DAG **`ma_value_impact_pipeline`**.
2. Toggle it **ON** (unpause).
3. Click **▶ Trigger DAG** (top right).
4. Watch **Graph View** to see:
* `load_bronze`
* `build_silver`
* `build_gold`
If tasks get stuck:
* Check the scheduler logs (Terminal A).
* Make sure Docker containers `ma_project_app` and `ma_project_db` are healthy:
```bash
docker ps
```
---
## Step 6 – Verify Gold Layer in Postgres
You can quickly check inside the app container that `gold.final_data` is populated.
```bash
docker exec -it ma_project_app bash
```
Inside the container:
```bash
python - << 'EOF'
from src.utils.db import get_postgres_engine
import pandas as pd
engine = get_postgres_engine()
df = pd.read_sql("SELECT COUNT(*) AS rows FROM gold.final_data", engine)
print(df)
EOF
```
You should see a non‑zero row count.
---
## Step 7 – Launch the Streamlit Dashboard
From your host machine:
```bash
docker exec -it ma_project_app bash
cd /workspace
streamlit run streamlit_app/app.py \
--server.address=0.0.0.0 \
--server.port=8501
```
Then open:
```text
http://localhost:8501
```
You should see the M&A Value Impact dashboard using the **Gold** table.
---
## Step 8 – Daily Short Start (No Doom)
If everything is already set up and you just want to run things again on the same machine:
1. **Start Docker stack**
```bash
docker-compose -f .devcontainer/docker-compose.yml up -d
```
2. **Start Airflow scheduler (Terminal A)**
```bash
source .airflow-venv/bin/activate
export AIRFLOW_HOME=$(pwd)/airflow
airflow scheduler
```
3. **Start Airflow webserver (Terminal B)**
```bash
source .airflow-venv/bin/activate
export AIRFLOW_HOME=$(pwd)/airflow
airflow webserver -p 8081
```
4. **Trigger DAG in UI** → `ma_value_impact_pipeline`.
5. **Start Streamlit**
```bash
docker exec -it ma_project_app bash
cd /workspace
streamlit run streamlit_app/app.py \
--server.address=0.0.0.0 \
--server.port=8501
```