ETL Pipeline for Brigade

# ETL Pipeline for Brigade This is an already existing pipeline that uses Tensorflow and Keras, currently running on top of Azure ML Services. The goal is to translate the pipeline on top of Brigade. ## Pipeline description * preprocessing * one job * CPU bound * job takes ~40 minutes for current data set on F8 machines (8 cores, 16 GB RAM) * reads files from the input persistent volume and writes on the _per build_ persistent volume * training * 5 parralel jobs that use different folds of the input data * GPU bound * job takes ~24 hours for current data set * the jobs use `nvidia-docker` and CUDA (but I think it is possible to configure the use of the Docker runtime and mount the NVidia drivers in the pod) * output of this step is a model file in the output persistent volume * prediction service * uses the created model file from the previous step to create a container image with a web server in front * push to container registry * `helm upgrade` for the prediction service * some instrumentation testing to validate the model ## Requirements * mount data persistent volumes: * immutable input data persistent volume, used as input across builds and not modified by jobs * new output data persistent volumes _per build_ * log streaming / `stdout` streaming for pods → live metric inspection used by a Python CLI plot library * job run inspection + logs * cluster autoscaling * run history * trigger pipeline by file updates in the input persistent volume * email / Slack / Teams notifications * stop jobs / builds on demand * low priority nodes (+ resume job) * current pipeline uses `nvidia-docker` and CUDA ## Notes * for testing purposes, a smaller data set can be used to limit the job times. However, investigation is needed for running a 24-hour build on Brigade. * reference architecture in image below: ![](https://i.imgur.com/ZsgRB7T.jpg) ## Incremental ETL Most of the pipelines for ETL jobs are designed to be executed from start to finish in a single run. However, there are cases where it's convenient to store the intermediate state and buffer the data before it's finalised. Consider the following trivial example. - An RSS feed is parsed every 30 seconds - A module takes the input URL and fetches the full page - an email is delivered with the full content of the article Unfortunately, you don't want to publish the same article twice, so you should keep track of what items you saw in the past. You probably think that this requirement is specific to RSS only, so why not embedding that logic there. It's a fair point, but the RSS example isn't an isolated case. - Imagine scanning an IMAP inbox for new messaged and reacting to those - Imagine collecting metrics from IoT devices that only occasionally publish data - Imagine building a collection of artefacts from a monorepo If your pipeline is not a straight line but a DAG, then having intermediate jobs makes sense since you can process and cache the intermediate result before it is further transformed downstream. That is even more relevant when you have an agent that accumulates tasks, another that is used as a trigger and an agent downstream that combines the two. ``` Accumulator +--------------+ | | | | events +-----------> | +--------------+ | | | | | v | | +----+-----+ +--------------+ | | | merge | | +----------> | | Trigger +----^-----+ +--------------+ | | | | | | | | every hour | | | +--------------+ | | | | +--------------+ ``` The example I can offer for the functionality above is [huginn/huginn](https://github.com/huginn/huginn). You can also find similarities with [Bazel](https://bazel.build/).