# Mall-Data K8s P.O.C.
Created: February 9, 2022 10:50 AM
## What has been done:
- Mall-Data:
- Successfully dockerized with the Catalog Platform repository as a git submodule
- Images pushed to the ECR can be pulled by the agent to run the jobs
- Inside the Dev cluster:
- Kubernetes Agent is actively waiting to receive flow runs from Prefect Cloud
- MongoDB service running on `mongo:27017`
- Flows can be deployed using the normal deployment commands:
- Flows are registered and pickles stored in s3. Flows only re-register if there is a change in structure (version will bump on Prefect)
- Flows can be run from prefect cloud (UI/API)
- The agent successfully spins up a new Kubernetes job for each run
- multiple concurrent jobs can be run in the cluster simultaneously
- pods are created on demand by the agent when triggered by Prefect Cloud
## Deploying a Kubernetes Prefect Agent
```bash
kubectl apply kubectl apply -f prefect_agent.yaml --namespace=<namespace>
```
Modify prefect_agent.yaml in the root directory to add additional tags and configurations.
- We can just have one agent that is configured with all the tags (according to prefect that should be enough)
## Building and pushing the docker image
```bash
# run from mall-data base directory (don't forget to make it executable)
./build_docker.sh
# tag
docker tag mall-data:latest 160712987787.dkr.ecr.us-west-2.amazonaws.com/mall-data:latest
# push (takes a while)
docker push 160712987787.dkr.ecr.us-west-2.amazonaws.com/mall-data:latest
```
- This is the biggest bottle neck right now since we are building the image locally.
- Building the docker image for the first time takes ~10+ mins — but it is dramatically shorter the following builds cuz of the layer caches
- Pushing to ECR takes a really long time since the image is almost 1gb in size.
- This makes sense are we are including BOTH catalog platform and mall data in this image
- We might need to take some time and look into pruning the dependencies
## Port Forwarding (needed for migrations)
```bash
kubectl port-forward svc/mongo 8080:27017 -n mall-data-test
# now the mongodb can be accessed from port 8080
```
- You can make this any valid port, I just picked something different from 27017 since my local mongo is already running on that.
## Notes
### Migrations
- Since the mongodb in the Dev cluster is brand new; new flows need migrations to be run in order for the flows to have data to work with
- have to run the `pop-col` command and migration command
- **Make sure that the config yaml files are pointing to the correct port or if supported; pass in the correct port**
### Making Changes in Catalog Platform from Mall-Data
- Catalog Platform (CP) is a git submodule — it is pointing to a specific commit in the CP repo’s history. When changes are made in CP, the submodule needs to be updated to point to the latest commit containing the changes.
- Changes to CP can be made directly from within Mall-Data; you just have to make sure those changes are also committed to the actual CP repo if they should be persisted.
- Mall-Data does not use the code in the CP submodule directory directly, it treats is like a package — you have to `pip install` in order for it to be used.
- Because it is a package, any updates to the CP will not show in Mall-Data until you reinstall it:
```bash
# e.g.
pip install Catalog-Platform/pipeline
```
### Load
- 40+ flows can be run on the K8s cluster — where it fails is the other services (like CU which we connect to)
- Load is typically not an issue since when they run on schedule the timing is spaced out enough that we do not crash other services
## Use-cases
1. Normal Operations - code changes to mall-data that are permanent
1. a new docker image containing the changes needs to be built and pushed to ECR
2. the flow needs to be registered with Prefect Cloud
3. Kubernetes agent will automatically pull the new image for all future flows
4. running the flow can be triggered using Prefect Cloud via the api or UI
3. Running Locally
1. Make use of environment variables; we need a set of Mongo and ES endpoints for local and a separate set of endpoints for K8s
4. Special Operations
1. Want to deploy a single flow run in kubernetes on demand -- this needs more research.
- Needs to be fast but right now the pushing to ECR can take around 10 mins
- Maybe we can have a dedicated Prefect Agent to run these one-off jobs and use a different storage method
## Next Steps
- Move the POC off the dev cluster and into its own (or existing cluster)
- Environment based execution (right now everything is configured to run on K8s only, we are using a Kubernetes Run and all the endpoints are to Kubernetes services)
- set up actual MongoDB
- set up actual ES or refactor CP to not make ES connections on initialization.
- Optional: figure out the resource requirements for each pod (flow run). We are currently using the defaults defined by Prefect. No jobs have failed due to insufficient resources yet.
- Optional: Figure out number of concurrent flow runs we can have (more a CU issue than a K8s issue)