ETLs to Argo Migration Validation; The integrity check

# ETLs to Argo Migration Validation; The integrity check This is a document to discuss a *proposed* solution for checking the integrity of the *ETLs* deployment on **Argo**. ### Before we start: It is important to distinguish between two main issues we are and will be dealing with in our ETLs: 1. The lack of having a robust process for the data and development parties that makes changes whether on the database level or on some software pieces that directly affect the production ETLs 2. The Argo migration project and making sure that all is going as expected or at least behaves as the production ETLs for now. I am mentioning that while we are deploying the ETLs on Argo the issues might still occur and break both production and Argo ETLs. That is why we need to focus from the beginning to have robust integrity and data validation tests not only to make sure that both ETLs are returning the same data at a certain point the time (as maybe some corrupted data doesn't throw an error) but also to make sure that both are still behaving as expected and both are returning the same ***correct*** data. And to achieve that we will need to work on three separate projects that will be affecting each other: 1. ETLs refactoring: which should cover: - refactoring the ETLs - creating all database schemas - developing unit tests (CI/CD) - adding middle ETL stages data validation checks 2. Developing Git and Data Collaboration Manifest Aims to have a robust, widely communicated, and adopted best practices do and don'ts for all parties that collaborate to the digital stack and should include: - CI/CD workflows - Data Schemas change process - Git Repos collaboration rules and workflows 3. Argo Migration project Aims to make sure that the migration process is lean and following concrete validation methods to avoid rework and not to get affected by any data issues that would break both production and Argo ETLs, as we already will be dealing with that in the first project mentioned above. And that is to focus on one thing at a time. This document aims to discuss the 3rd project "Argo Migration project" in detail. ## Argo Migration project Long story short, I suggest that we do not need to test both ETL against the daily updated data snapshot to calibrate. We just need to test production ETLs on a sample representing its data and then save it to a DB and use this saved snapshot as simulation data (that represents the usual data schemas for a certain ETL) for testing the Argo ETL. And the next is a more detailed explanation of the solution: The following are some conventions that we will be using through the document: ### The Integrity Check Process Flow - **Validation Data (snippet) Snapshot**. > A validation data snippet: is a snippet of data at a certain point in the ETL that represents the input data after being transformed by a specific ETL stage and is used to make sure that the same stage of that ETL on Argo will be returning the same data snippet. - **Validation Stage**. > The validation stage is a calibration unit in the ETL after some transformation layers to compare the determined data from that stage (driven out by Argo) against the Validation snapshot Snippet for the corresponding stage in the production ETL. - **Validation Snapshots Database**. > A database dedicated to the integrity validation snippets. - **ETL Input Data Sample**. > Is a data sample that represents the ETL input data. <img src="https://i.imgur.com/q09EhFK.png" width="350"> You can rename the current file by clicking the file name in the navigation bar or by clicking the **Rename** button in the file explorer. ![The process flow](https://i.imgur.com/bBWi7gD.png) ## The Validation Process Steps: 1. Taking a sample that represents the input data for a specific ETL "ETL Input Data Sample". 2. Saving the data to the "Validation Snippets Database". 3. Running the ETL against this data snippet. 4. Saving the determined data "Validation Snippet" after each main transformational .layer "Validation Stages" of the ETL in the "Validation Snippets Database". 5. Deploying the ETL on Argo. 6. Running the Argo ETL against the input data for a specific ETL from the "Validation Snippets Database". 7. After each "Validation Stage" of the Argo ETL we compare the determined data against the "validation Snippet" of the same "Validation Stage" of the production ETL. 8. The Argo ETLs should pass all their validation tests to be verified. You can delete the current file by clicking the **Remove** button in the file explorer. The file will be moved into the **Trash** folder and automatically deleted after 7 days of inactivity. ## The Integrity Check Methods: <img src="https://i.imgur.com/GqFE2w6.png" width="300"> data sample - **SHA-256 Hashing**. > A patented cryptographic hash function that outputs a value that is 256 bits long. >>Output: '6a7938b427d8b24d0fd09c2c7aed8c5109d2146ca0235d2dc32967c894d1ed5e' - **R Table Summary**. > A generic function in R used to produce result summaries of the results of various model fitting functions. >> Output: <img src="https://i.imgur.com/LXRqwJ3.png" width="350"> ### Why I do prefer this model over the full daily database snapshots method? Firstly, to avoid getting the Argo ETL broken while the development phase by any unexpected data. In case any unexpected data found its way into the production ETL the corresponding Argo ETL should not be exposed to that data because the production ETL at this stage will go through its own maintenance flow after the Slack failing report and after that if there are any changes on the database level we will consider those changes on both the Argo ETL and also the "validation snippets database". Another reason is that this is a less complex and more independent solution as it will not be affected by any expected changes on the data schemas level.