# ETL - Code Improvements ### Separation of phases Many stages are chained, but not all. Many still exist as loops. We could separate them in three main stages: - Extraction (from Excel) - Transform (from RawData + JobDefinition to ParsedData) - Load (sequelize -> bulkInsert) Ideally, each stage will have many more "pipes" and "transform" chainable functions - Each stage can contain its own validation (again, as a function) - Each function should return Either with DATA or ERRORs - On same cases, the DATA is DATA + WARNINGS (e.g.: DataTypeMismatch, DuplicateRecords, etc) **Proposal**: [Separation of phases](https://hackmd.io/@LnF_3akuRP65LmZ6lwz5lQ/r1LLd7jb2) COST: Medium ### One-Job = N-Files - Do we want to keep a single Job for multiple files? Or trigger N jobs as N JobDefinitions are detected? - Currently, the code handles them as a sequence (in a "for" loop) and cannot be extracted (e.g.: to process in parallel; or in "other processes") - We could map the job names/filenames to Lazy ETL jobs, then run them as needed: keep them in sequence or ask to externalize to another process **Proposal**: [Handle independent jobs](https://hackmd.io/@LnF_3akuRP65LmZ6lwz5lQ/HkBKAVsbn) COST: Medium ### Push side-effects handling to the outsides - Operations like READING from bucket or INSERTING data should be pushed to the boundaries of the ETL job. - This can also help if we need to implement other input sources or outputs. - Also, replace some INTERNAL "for" loops, with array.map and lazy evaluation - E.g.: "how can we retrieve totalProcessedRows or errors" before a side effect is run? COST Medium ### Job Definition Extensability - Job definitions are defined as a Dictionary, in which *ExcelFileName.xlsx => JobDefinition* - not bad, not highly scalable - Today, the etl runs all jobs, or some by specifing the filenames as cmd parameters - Suggestion: detect which files are present in bucket, process those only - JobDefinition are tieds to Excel as source. To provide support for other sources, we need a *nice* refactor: - An option is to make a "EtlSource" type, each containing method to read data. We can start with an ExcelEtlSource and define a basic interface: - Set Input Parameters (filename, JobDefiniion) - Validate Schema - Retrieve RawData - Some other soures could be: - extract from a database with a query - extract from CSV file COST: High #### Excel Job Definition - Currently, we are locating in Excel using the COLs position (A,B,C,D). Then, we validate against its header row text. - If the columns of the Excel are re-ordered, the JobDefinition needs to be updated - We could just remove the COL position and lookup by header row text. - Decouple COL order from header row text and simplify JobDefinition for Excel COST: Medium ### Repository Abstraction - Abstraction is simple, but sequelize complicates things - Issues: - no coverage on data pieces (repository/sequelize) - new jobdefinition require new sequelize entities and migrations - Alternatives: - Use "dynamic" SQL instead, but migrations will need to be rethinked - Maybe, keep sequelize *just* for migrations COST: Highest ### Writeback: - Currently supports ONE table only with some specific fields. - Lambda and Database abstractions are specific to the Master table - Not prepared to support more tables easily, or based on JobDefinition **Proposal**: [ETL - Table writeback](https://hackmd.io/@agustinmartinucci/BJc5gmjWn) COST: Medium/High (not clear based on current and future requirements) ### Improve Traceability - "console.log" is ugly, we could add a simple logger abstraction - logger.log / logger.warn / logger.error - measure long running operations - Group messages by operation (context) - logger.newContext("Job:master.xlsx") COST: Low ### Code TODOs: - Group helper functions together - excel helpers - convertion helpers - jobdefinition/strategy helpers - Add meaningfull comments to code - Remove unused properties in Sequelize models COST: Low ### Items in Radar: - DB maintainabiliy: - jobs/rows/errros will accumulate - cronjob to delete old, failed jobs and their related data? - COST: Medium