# ETL - Error Handling ### ETL (high-level) Error Handling 1. **Read Excel file** - `loadWorkbook(buffer: Buffer) => EitherAsync<Error, Workbook>` - On Error → Persist Error & halt job 2. **Validate Excel against Job Definition** - `GetSourceDdl(workbook: Workbook, definition: MappingDefinition) => { errors: EtlError[]; ddl: DDL };` - Inside this method we need retrieve the Excel columns and validate them against the Job Definition - `GetColumnsFromMap = (headers: Row, columnsMap: ColumnMapping) => { errors: DefinitionMismatch[]; ddl: DDL };` - On errors - `BulkInsertErrors (errors)` - **Halt job** - Othewise → continue to next step 3. **Validate target tables (DB) matches the Job Definition** - Using JobDefinition and Sequelize model: - Read DDL from DB (retrieve tables + columns definition) - Compare to JobDefinition (and to Sequelize model?) - Either return _TargetMismatch\[]_ or nothing - On errors - `BulkInsertErrors (errors)` - **Halt job** 4. **Parse and validate Excel using strategies** - `ApplyStrategy(workbook: Workbook, definition: MappingDefinition) => { errors: DataTypeMismatch[], ddl: DDL, targetTable: string, data: ParsedRow[] }` - (Return errors and parsed data) - errors → `BulkInsertErrors(errors)` - data → continue to next step - _NOTE: For future support of different sources, this could be refactored to recieve RawData instead of an Excel Workbook)_ 5. **Bulk Insert** - `BulkInsertRecords (records: ParsedRow[]) => { errors: DbInsertError[] } ` - Process all rows and return SQL errors - errors → `BulkInsertErrors(errors)` - mark job success/fail depending if there were errors + last processed row number (from source) ### Identified Error Cases 1. **DefinitionMismatch** - Missing columns in source (Excel), compared to Job Definition. - E.g.: Column D does not exist - Mismatch column in source (Excel), compared to Job Definition. - E.g.: On column B, the header was expected to be "Comment", but found "comment" 2. **DataTypeMismatch** - Field cannot be parsed or coverted to target type - E.g.: At row X and column Y, expected value to be convertible to "Date". Found value is "2023-03". 3. **TargetMismatch** - Column not found in target table - E.g.: Column XYZ not found in database table - Cannot convert column from source to target - E.g.: Column ZYX is using "boolean" in database, but defined as "integer" in job definition - Target table not found - E.g.: Table "detailReviews" not found in database 4. **DbInsertError** - Failure while inserting, could be any reason ### Error Types definition ```TYPESCRIPT export type DefinitionMismatch = { type: "DefinitionMismatch", message: string, file: string, //master//master_details//.. column: string, //A//B//.. expected: string, //NameExpectedA//NameExpectedB//.. received: string, //NameFoundA//NameFoundB//.. }; export type DataTypeMismatch = { type: "DataTypeMismatch", message: string, // "At row X and column Y, expected a "Date" but found a "Boolean" file: string, //master//master_details//.. row: number, //1//150//.. column: string, //A//B//.. expected: string, //date//boolean//.. received: string, //boolean//date//.. }; export type TargetMismatch = { type: "TargetMismatch", message: string, //Column "completed" is using "boolean" in database, but defined as "date" in job definition table: string, //"detailReviews" column: string, //"completed" expected: string, //"date" received: string, //"boolean"|"null" }; export type DbInsertError = { type: "DbInsertError", message: string, table: string, }; export type EtlError = | DefinitionMismatch | DataTypeMismatch | TargetMismatch | DbInsertError; ``` ### Errors table ```SQL CREATE TABLE [dbo].[errors]( [job_id] [uniqueidentifier] NOT NULL, [type] [nvarchar](50) NOT NULL, [message] [nvarchar](max) NOT NULL, [file_name] [nvarchar](20) NULL, // For future use, we could rename it to 'source' [row] [int] NULL, // (optional - depends on error type) [column] [nvarchar](2) NULL, // (optional) A, B... AA, AB... [details] [nvarchar(MAX)] NOT NULL, // other error attributes, serialized as JSON ) ``` ### Some objetives for implementation - Separate steps from each other - Plan with a weekly incremental progress - Provide plan and refactor steps - Start with Excel validation - then Data Parsing validation - Target DDL validation (can take longer) - ... - Make them easy to (unit)test - Start with "full ETL" tests - As unit-tests - Input: Excels - Assertion: on ParsedRows output and Errors based on different error cases. E.g.: - Happy path - Excel to JobDefinition does not match (and abort early) - Excel with mismatched data types (should complete with errors) - Excel with mismatched - Target Mismatch between mocked DDL and Job Definition - Test harness to validate JobDefinition maps correctly to Sequelize Models - This can be a unit test that validates Job Definition against the Sequelize Models (checks all columns are mapped, data types mappings are correct, etc.) - This test should be part of CI/CD - Eventually, we can think of a dynamic tests that can validate ANY job; or eventually auto-generate Sequelize from Job Definition