Notes on data engineering

# Notes on data engineering ###### tags: `data engineering` ## Preq 1. make sure your Python is set to 3.9 ```bash= $ python -V $ conda create -n de-zoomcamp python=3.9 $ conda activate de-zoomcamp ``` install the lib for postgres ('psycopg2-binary' fails to install because of an old version of Python (3.7.3)) ```bash= $ pip install psycopg2-binary ``` installing pgcli, but use conda. This is a command line interface for Postgres. ```bash= $ conda install -c conda-forge pgcli ``` 2. we ran a postgres image using the below, in order to create a container where we can work with PostgreSQL. (Need to create a new folder named 'ny_taxi_postgres_data' in the working directory first) You can see below we're passing a number of options and inputs `-e`, `-v` etc. The name of image we've passed to it is postgres:13 which will be pulled from the registry. ``` docker run -it \ -e POSTGRES_USER="root" \ -e POSTGRES_PASSWORD="root" \ -e POSTGRES_DB="ny_taxi" \ -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \ -p 5431:5432 \ postgres:13 ``` * -e POSTGRES_USER : The name of our PostgreSQL user * -e POSTGRES_PASSWORD : The password for our user * -e POSTGRES_DB : The name we want for our database * -v : This specifies an external volume. Basically, postgres would by default create a database on a folder inside the container, which will disappear when we kill it. This is good, because it allows us to generate as many containers as needed (all of them identical). However, in our case, we mounted a volume. In other words, the internal folder in the container is replicated to the external, local folder we specified. This allows the container to access that info when recreated again. This is how we handle persistent data with stateless containers. * -p 5431:5432 : This maps a postgres port on our host machine to one in our container. It seemed that 5432 was already in use on my machine, so I used 5431. 3. Next up, I connected to my new database using the postgres command line tool we installed earlier. I specified the postgres port, database name, user, and localhost as the host - just to check things are working okay, and that we can interact with our currently empty database. ``` $ pgcli -h localhost -p 5431 -u root -d ny_taxi ``` 4. create a Jupyter Notebook file and download data ``` $ pip install jupyter $ jupyter notebook ``` If wget is not recognized as an internal or external command: On MacOS, the easiest way to install wget is to use Brew: ``` $ brew install wget ``` Alternatively, you can use a Python wget library, but instead of simply using “wget” you’ll need to use ``` $ pip install wget $ python -m wget ``` Use the backup data source ``` $ wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz $ gzip -d file.gz ``` 5. In this step, we wrote our `upload-data` jupyter notebook script. This essentially connected to our database, read from the CSV file (chunk by chunk) into a pandas dataframe, created a table within our database, and added our CSV data to the database in chunks. 6. Next I setup `pgadmin`. This is a way for us to interact with our database in a more user friendly, web based graphical interface. We can run a docker container of this. We just need to make sure our postgres and pgadmin containers are in the same network - and give both of them a name in that network. This ensures that we can connect to our postgres engine, and thus database, from pgadmin. Here's how we can amend our `docker run` command for postgres to add it to a network and give it a name. Giving a name to pgadmin isn't too important, but it's important we give postgres a name so we can connect to it from pgadmin. We first creare the network: ``` docker network create pg-network ``` Then run postgres container: ``` docker run -it \ -e POSTGRES_USER="root" \ -e POSTGRES_PASSWORD="root" \ -e POSTGRES_DB="ny_taxi" \ -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \ -p 5431:5432 \ --network=pg-network \ --name pg-database-2 \ postgres:13 ``` Here is how we set up pgadmin. Notice how we've given them both the same network name: ``` docker run -it \ -e PGADMIN_DEFAULT_EMAIL="admin@admin.com" \ -e PGADMIN_DEFAULT_PASSWORD="root" \ -p 8080:80 \ --network=pg-network \ --name pgadmin-2 \ dpage/pgadmin4 ``` * -e PGADMIN_DEFAUL_EMAIL = email we'll use to sign into pgadmin * -e PGADMIN_DEFAULT_PASSWORD = password we'll need to sign into pgadmin * -p 8080:80 = like with postgres, we set up ports to map pgadmin to our host machine. Once we run both of these in seperate terminals, we have created two containers. To access pgadmin, we can got to `localhost:8080` in our browser, and use the pgadmin details to login. We should also make sure to specify `pg-database-2` as the host name we want to connect to. 7. Next up, I completed the data ingestion pipeline script, now named ingest-data and converted it to a python script (instead of a jupyter notebook). 8. Dockerizing Ingestion Script: We can now create an image using: ``` docker build -t taxi_ingest:v001 . ``` This build an image from our Dockerfile. We put a `.` at the end to specify the current working directory (where our Dockerfile is located). The `-t` specified we want to add a tag to the name of our image, in this case, `001`. Once we build this, we then have a blueprint for our containers. ``` docker run -it \ --network=pg-network \ taxi_ingest:v001 \ --user=root \ --password=root \ --host=pg-database-2 \ --port=5432 \ --db=ny_taxi \ --table_name=yellow_taxi_data \ --url="https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv" ``` 9. Let's look into simplifying the process. We'll use `docker-compose`. This comes as part of docker desktop, which we already have installed. We'll first create a YAML file (YAML is a language typically used for configuration files). To actually run this file, we use: ``` docker-compose up ``` To shut down containers, we use the following the working directory we started the previous command: ``` docker-compose down ```