CASE STUDY : COMPADO DATA PLATFORM

# CASE STUDY : COMPADO DATA PLATFORM Compado is a market leader for automated ranking and smart recommendation services . By leveraging on big data technology combined with advanced machine learning algorithms, the company lists and recommends commercial products enabling clients to have direct and non-incentivized traffic to their brands. * **Industry** : Advertising * **Headquarters** : Uhlandstraße 171-172, 10719 Berlin, Germany * **Founded Date** : 2015 * **Website**: www.compado.com * **Ownership Status** : Privately Held (Venture Capital-Backed) * **Compado Executive Team** :Andreas Hoogendijk (Co-Founder & CEO), Jennie Wage(COO), Emanuel Hoch (Co-Founder & CPO) * **Employees** : 88 ## Introduction Using various data sources such as the clickstream of web pages/online stores, analytics insight from trusted services (lead feeder,google analytics, etc.) and the client's profile (company + potential purchaser), it is crucial to build a robust data platform that : * *Guarantees the privacy and security of whole pipeline,* * *Is fault-tolerant, scalable, fast and highly available in processing structured as well as semi/unstructured data,* * *Revolves around a combination of robust open source stacks and managed services within a cloud provider ecosystem (AWS is the main choice for Compado),* * *Facilitates the generation of reports and real-time dashboards for the business/operations team,* * *Provides a rich environment for the data-science team to run experiments, models and store the results,* * *Is evolutive, maintainable and cost-effective.* `The following study focuses mainly on architecture aspects and provides solutions for some pain points within the data pipeline that present engineering challenges.` ## Architecture The description of the technical components and the data flow in the case study suggests a Lambda paradigm with 2 separate serving layers. This big data architecture is designed to handle massive quantities of information by taking advantage of both batch and stream-processing methods. A high level view of that kind of paradigm would be : ![](https://i.imgur.com/twRg7dF.png) To translate that paradigm into the company's suggested stacks, we can compose the data flow into 3 differents stages : * **Raw Data Sources** : The first step of the platform is to identify accurately the differents data sources that will feed the whole process. 3 main data sources can be deducted from the script (backend services, files/documents and third party providers) and we can safely assume 4 various types (transactional data for the companies' metada and the customers' profile, unstructured data for the file, analytics data for third party API's and streaming events for the clickstream ) * **Processing paradigms** : The classic one (batch paradigm) ingests all data except for the streaming events using an ETL stack (AWS Lake and AWS Glue), then loads the processed data into an intermediate storage with low cost and high availabiliity (AWS S3). The data is then cleaned and put into the right format within large datawarehouses (AWS Redshift). Batch Views aggregating similar business data (vertices,sectors, or a group of products, etc.) are finally created inside datamarts (AWS Redshift) to be queried for reporting purposes. Parallel to this pipeline runs a Real-time processing that loads streaming events using a fault-tolerant/scalable message broker (AWS Firehose). A specific event process analyses the events (AWS Analytics) and produces results conveyed using the same bus messenger. Raw events are stored within AWS S3 and processed events are stored in AWS Redshift. Both processings are managed by a scheduler (Apache Airflow on top of a celery/flower framework) that supervises all the jobs necessary for the pipelines. * **Visualization & Reporting** : 3 differents types of reporting can be deduced from the script. Business reports (Tableau) that use the results processed by the batch and real-time pipelines, these are the end recommandations/scores/rankings at the core of Compado business. The software reports generate metrics and alarms to supervise/maintain the different technical components (AWS Prometheus, AWS Timestream and AWS Grafana). On the other hand, hardware maintainbility is guaranteed by a differents monitoring (AWS Cloudwatch, AWS Timestream and AWS Grafana) that collects and crunches logs from the EC2 instances (mainly CPU, RAM, Disk, Bandwith and queries). ![](https://i.imgur.com/gnT0yXJ.png) The main technology components/stacks visible on the previous figure are listed in the below table *(n.b the versions of each component are not present in the script, therefore we assume that the engineering team is using fairly recent versiioins.)* : | Stage | Step | Component | |:-------------------------:|:---------------------------------------:|:----------------------------------------:| | Batch Processing | Raw Data Extraction | AWS Lake | | | RAW Data Transformation | AWS Glue | | | Intermediate Data Storage | AWS S3 | | | Processed Data Storage (Data Warehouse) | AWS RedShift | | | Batch Views (Data Marts) | AWS Redshift | | | Scheduler | Apache Airflow (Celery/Clower framework) | | | | | | Stream Processing | Raw Events Extraction | AWS Kinesis Firehose | | | RAW Events Storage | AWS S3 | | | Raw Events Transformation | AWS Kinesis Analytics | | | Processed Events storage | AWS Redshift | | | Speed View | AWS Redshift | | | Scheduler | Apache Airflow (Celery/Flower framework) | | | | | | Visualization & Reporting | Business Reports | Tableau | | | Software Monitoring | AWS Prometheus | | | | AWS Timestream | | | | AWS Grafana | | | Hardware Monitoring | AWS Cloudwatch | | | | AWS Timestream | | | | AWS Grafana | ## Challenges `1. Some of the data we want to visualize/report has a high dimensionality and is stored in large denormalized tables in Redshift. These tables are too large to do extracts on and the live connection to the redshift table is not very performant for the workbooks we have.Have you encountered a similar issue in your past positions?. If yes, how did you deal with them? If not, do you have any suggestions for improvements to provide fast access to this data to our end-users?` Two issues can be deduced from the question : a) Complexity to perform extracts due to the size of tables in the database, b) Poor performance of the live connection from the workbooks to the redshift table. * **Issue a)** From the script, AWS Redshift seems to be a data warehouse used as a single source of truth for analytics and reporting . However, in order to achieve efficient extract performances for this kind of columnar database, optimisation techniques for storage, compute and cluster level are needed. * **Archiving strategy for old/unused data** : Data engineers need to create a specific maintenance pipeline to unload that knid of data from Redshift to S3 and delete it from Redshift. Apache Airflow could easily schedule that pipeline to run in off peak hours. A high level flow of the pipeline would be as follows : Set a specific target data that needs archiving --> insert information about that data (oldness, decomissioned services, etc) in a separate configuration table (name, filters, etc.)--> Define a SQL query generated susinig previous niformation (name, filters, etc.) and use the unload command from Redshift to S3 --> Supress old/unused/target data from AWS Redshift based on the previous SQL query --> Leverage on AWS Vacuum to clean the AWS Redshift DB and to retrieve storage capacity * **Access strategy for rarely used data** : This kind of data stresses the developed pipeline for few business benefits, therefore it is better to store these in a compressed way with a columnar Parquet format. For this job, Apache Airflow can leverage on AWS Spectrum to run a simple job : Store the compressed data in AWS S3 (Parquet), a folder will contain a unique table to ease retrieval --> Configure Apach Airflow to retrieve the AWS S3 folders --> before launching the airflow job, create a data catalog using AWS Glue --> Based on that catalog, enforce a data schema in Redshift to ease the programmed loading. * **Sort strategy for query performance** : The data engineering team needs to assess the sort keys confiigured in AWS Redshift. By allowing the query planner to analyse less data for CPU-demanding operations (especially joints and groups), a modification in the sorting keys can greatly improve query performance because it allows . * **Compression strategy for better I/O utilization** : In order to analyze less data blocks within AWS Redshift, compression can be very handy since iit improves available storage and minimizes I/O resort . The Data engineering team should check if columns have been provided a compression strategy, if not then use the *ALTER* operator wherever needed. * **Loading strategy for new data** : Inserting news data wiithout a specific sortiing beforehand can become a nightmare when a huge dump is expected. This issue can greatly affect the extraction performance for AWS Redshift, therefore using a SORT key that is predefined guarantees the performa,nce of the dztabase whenever queried. In fact, AWS Vacuum allows to enforce a sorting strategy when new data is inserted. Moreover, Apache Airflow can schedule this sorting command for arriving big dumps/frequent dumps. * **Manage query priority**: During the everyday workload, some queries made for reporting purposes can run in a short time, however they can be blocked after complex queries made for example by the data scientists (CPU and time-consuming operators). AWS Redshift offers a workload manager to setup query priority and specific rules for a group os users. * **Distribution strategy** : Query performance can be greatly improved using a smart distribution style based on the data used for the company's business. The ideal scenario would be to achieve evenly rows distributed through the differents nodes for parallel processes. * **Disk type strategy** : Choosing between AWS DS2, DC2, and RA3 cluster types can be crucial for optimizing the storage . Between HDDs, SSDD and DC2, the architects needs to be aware of the suitable cluster type depending on the use cases and the data involved. To choose the right solution (or solutions since we can combine several of the previous strategies), it is most vital to create a rich and complete monitoring systems dedicated to maintain the platform but more importantly finid the best ways to improve it : * **Software Monitoring** : Based on a robust, scalable, and low-letency stack, the software monitoriing tools (mainy AWS Prometheus, AWS Timestream and AWS Grafana) can list, analyze raiso alarms and vizualize the most importants metrics for each component in the architecture.DB Connection (threshold limit), top long-running queries, data throughput, lantency, etc. can all be captured precisely by the stack and allow data engineers to find the source of the problem and the most efficient solution for thei software. For the main business DB (AWS Redshift), we recommend following closely recommandation from AWS Redshift Adviso (distribution keys, compression encoding, sorting strategies, etc.) * **Hardware Monitoring** : On the oher hand, the stack used for hardware differs only by the metric capturing tool (AWS Cloudwatch). In deed, since the platform's architecture seems to run on Amazon Cloud EC2 instances, the best and cost efficient tool for capturing the machines' behaviour is to go with Amazon provided tool. CPU usage alert is sent when CPU usage reaches a percentage threshold after some period of time. Disk usage, CPU usage, RAM, bandwith, etc. are all natively accessible via Cloudwatch and can be easily connected to a Grafana dashboard for the OPS team in a real-time look. * **Issue b)** Poor performance of the live connection on AWS Redshift for the workbooks. In this section we assume that AWS Redshift DB is optimized (following previous recommandations) and the problem comes for poor performance of Tableau or the connection between the 2 components. * **Check Tableau Performance Recorder** :First we need to be sure that something is wrong with the tableau side. For that, we can view metrics about the query (or queries) made by data scientists and take too much time/effort for Tableau to sketch. Tableau Performance recorder can be very handy for this purpose since we can visualize where the bottleneck comes from.![](https://i.imgur.com/fnjIHkL.jpg) * **Downsize by adding filters** : Using the "Add to Conext" option in the Filter menu of Tableau can the data set size significantly if used with multiple views. * **Downsize by adding sets** : Viewing the top tiers list of a data set can provide enough information for a business decision. Therefore, providing sets for as many views as possible can lead to faster reports. * **Accelerate views by aggregating** : Instead of visualizing all the records of a data set, one way to ease Tableau's work on heavy data sets is to aggregate the measures. `2. We have around 500 domains and each of these domains have an average of 10 partners per domain. The number of domains will grow significantly and we would like to monitor the individual performance of all of our partners.Have you encountered a similar issue in your past positions? If yes, how did you deal with them? If not, do you have any suggestions on how to set up a monitoring for this?` For this issue, we assume that a "domain" refers to a coherent business stream with its own specific rules and constraints. For instance, insurance sector along with the client company (or partner) and the results generated from the platform for that specific business is a domain. However, the monitoring rate/speed here is not clear as the performance can be reviewed on a daily basis, weekly basis, or more complicated in a real-time fashion. As a data engineer, we need to assume the worst case scenario and tackle the individual performance for all the partners in real-time. In order to achieve this goal, some requirements are needed : * The data catalog enforced in AWS needs to be reviewed to check for meta data completion. A domain has to be named and that name needs to be attached with the data source. Therefore, upon ingesting the data from the different sources, a specific job for Airflow would be to ensure that the domain name has been correctly provided during each stage of the process. * Since we want to monitor more than 500 domains (maybe more as business develops), we need to create a specific instance within AWS Redshift that stores results for each events). This DB is highlighted in green color within the architecture figure. * A specific Airflow DAG needs to be created for collecting and processing data for each domain. A domain group can be created to aggregate data to ease the views within the visualization tool. * Finally, a specific instance/cluster of Tableau (depending on the performance needed) will be deployed to create the needed reports. ![](https://i.imgur.com/dmyV7x1.png) *N.B : It crucial to have the data catalog connect to both processing paradigms (batch and real-time) to ensure coherence in the specific Redshift domain monitoring DB. Moreover, the recommandations stated in the previous section for both AWS Redshift and Tableau need to be followed here in order to ensure a performant real-time domain monitoring.* `3. (Bonus) As our company and our pipelines grow, the amount of dependencies for our Airflow environment increases. More and more we are running into trouble as packages are not compatible. Have you encountered a similar issue in your past positions? If yes, how did you deal with them? If not, do you have any suggestions on how to make sure we can keep scaling?` From the script, we can infer a type of dependency issue within Airflow environment related to python packages used for sub tasks within a specific DAG. Since Compado has varied Airflow workflows ranging from data science exploratory work to day-to-day business reporting, it can easily create a cyclic package dependency as several teams might use vastly different libraries for their workflows : * The modern way to address this issue is to use the KubernetesPodOperator to allow users to launch arbitrary Kubernetes pods and configurations (runtime environments, resources, etc.). Indeed, custom Docker images guarantee that the tasks environment, configuration, and dependencies are completely idempotent. The figure below shows a high level of that decoupling process using the Kubernetes Operator. Using the Python client to create a request for the APIServer , the orchestrator will then launch a pod that contains the specific environment variables and dependencies for the task. For each task, Kubernetes can the collect and display the logs into a visual tool for monitoring purposes. Please note to import the function before creating the DAG *(from airflow.contrib.operators.kubernetespodoperator import KubernetesPodOperator)*. ![](https://i.imgur.com/CEhkVgq.png) *NB : To a certain extent, the Docker Operator and Python Virtual Environment Operator allow to do the same but with less flexibility and more engineering process down the road.* > Below is an example with Python Virtual Environment operator: > > def function_x(arguments): > #the code > > taskX = PythonVirtualenvOperator( > task_id='python-virtual-env-X', > python_callable=function_x, > op_args=[arg1, arg2, arg3, arg4], > requirements=['pip-packageZ','pip-packageY','pip-packageW'], > python_version='3', > dag=dag) Below is an example with Docker Operator : > task_launch = DockerOperator( > task_id='task_1', > image='influx/influxdb:latest', > api_version='auto', > auto_remove=True, > environment={ > 'PYSPARK_PYTHON': "python3", > 'SPARK_HOME': "/spark" > }, > volumes=['/home/X/applicationY:/applicationY-folder'], > command='/spark/bin/spark-submit --master local[*] /simple-app/applicationY.py', > docker_url='unix://var/run/docker.sock', > network_mode='bridge' > )