# ETL - Code Improvements
### Separation of phases
Many stages are chained, but not all. Many still exist as loops. We could separate them in three main stages:
- Extraction (from Excel)
- Transform (from RawData + JobDefinition to ParsedData)
- Load (sequelize -> bulkInsert)
Ideally, each stage will have many more "pipes" and "transform" chainable functions
- Each stage can contain its own validation (again, as a function)
- Each function should return Either with DATA or ERRORs
- On same cases, the DATA is DATA + WARNINGS (e.g.: DataTypeMismatch, DuplicateRecords, etc)
**Proposal**: [Separation of phases](https://hackmd.io/@LnF_3akuRP65LmZ6lwz5lQ/r1LLd7jb2)
COST: Medium
### One-Job = N-Files
- Do we want to keep a single Job for multiple files? Or trigger N jobs as N JobDefinitions are detected?
- Currently, the code handles them as a sequence (in a "for" loop) and cannot be extracted (e.g.: to process in parallel; or in "other processes")
- We could map the job names/filenames to Lazy ETL jobs, then run them as needed: keep them in sequence or ask to externalize to another process
**Proposal**: [Handle independent jobs](https://hackmd.io/@LnF_3akuRP65LmZ6lwz5lQ/HkBKAVsbn)
COST: Medium
### Push side-effects handling to the outsides
- Operations like READING from bucket or INSERTING data should be pushed to the boundaries of the ETL job.
- This can also help if we need to implement other input sources or outputs.
- Also, replace some INTERNAL "for" loops, with array.map and lazy evaluation
- E.g.: "how can we retrieve totalProcessedRows or errors" before a side effect is run?
COST Medium
### Job Definition Extensability
- Job definitions are defined as a Dictionary, in which *ExcelFileName.xlsx => JobDefinition*
- not bad, not highly scalable
- Today, the etl runs all jobs, or some by specifing the filenames as cmd parameters
- Suggestion: detect which files are present in bucket, process those only
- JobDefinition are tieds to Excel as source. To provide support for other sources, we need a *nice* refactor:
- An option is to make a "EtlSource" type, each containing method to read data. We can start with an ExcelEtlSource and define a basic interface:
- Set Input Parameters (filename, JobDefiniion)
- Validate Schema
- Retrieve RawData
- Some other soures could be:
- extract from a database with a query
- extract from CSV file
COST: High
#### Excel Job Definition
- Currently, we are locating in Excel using the COLs position (A,B,C,D). Then, we validate against its header row text.
- If the columns of the Excel are re-ordered, the JobDefinition needs to be updated
- We could just remove the COL position and lookup by header row text.
- Decouple COL order from header row text and simplify JobDefinition for Excel
COST: Medium
### Repository Abstraction
- Abstraction is simple, but sequelize complicates things
- Issues:
- no coverage on data pieces (repository/sequelize)
- new jobdefinition require new sequelize entities and migrations
- Alternatives:
- Use "dynamic" SQL instead, but migrations will need to be rethinked
- Maybe, keep sequelize *just* for migrations
COST: Highest
### Writeback:
- Currently supports ONE table only with some specific fields.
- Lambda and Database abstractions are specific to the Master table
- Not prepared to support more tables easily, or based on JobDefinition
**Proposal**: [ETL - Table writeback](https://hackmd.io/@agustinmartinucci/BJc5gmjWn)
COST: Medium/High (not clear based on current and future requirements)
### Improve Traceability
- "console.log" is ugly, we could add a simple logger abstraction
- logger.log / logger.warn / logger.error
- measure long running operations
- Group messages by operation (context)
- logger.newContext("Job:master.xlsx")
COST: Low
### Code TODOs:
- Group helper functions together
- excel helpers
- convertion helpers
- jobdefinition/strategy helpers
- Add meaningfull comments to code
- Remove unused properties in Sequelize models
COST: Low
### Items in Radar:
- DB maintainabiliy:
- jobs/rows/errros will accumulate
- cronjob to delete old, failed jobs and their related data?
- COST: Medium