# ETL - Error Handling
### ETL (high-level) Error Handling
1. **Read Excel file**
- `loadWorkbook(buffer: Buffer) => EitherAsync<Error, Workbook>`
- On Error → Persist Error & halt job
2. **Validate Excel against Job Definition**
- `GetSourceDdl(workbook: Workbook, definition: MappingDefinition) => { errors: EtlError[]; ddl: DDL };`
- Inside this method we need retrieve the Excel columns and validate them against the Job Definition
- `GetColumnsFromMap = (headers: Row, columnsMap: ColumnMapping) => { errors: DefinitionMismatch[]; ddl: DDL };`
- On errors
- `BulkInsertErrors (errors)`
- **Halt job**
- Othewise → continue to next step
3. **Validate target tables (DB) matches the Job Definition**
- Using JobDefinition and Sequelize model:
- Read DDL from DB (retrieve tables + columns definition)
- Compare to JobDefinition (and to Sequelize model?)
- Either return _TargetMismatch\[]_ or nothing
- On errors
- `BulkInsertErrors (errors)`
- **Halt job**
4. **Parse and validate Excel using strategies**
- `ApplyStrategy(workbook: Workbook, definition: MappingDefinition) => { errors: DataTypeMismatch[], ddl: DDL, targetTable: string, data: ParsedRow[] }`
- (Return errors and parsed data)
- errors → `BulkInsertErrors(errors)`
- data → continue to next step
- _NOTE: For future support of different sources, this could be refactored to recieve RawData instead of an Excel Workbook)_
5. **Bulk Insert**
- `BulkInsertRecords (records: ParsedRow[]) => { errors: DbInsertError[] } `
- Process all rows and return SQL errors
- errors → `BulkInsertErrors(errors)`
- mark job success/fail depending if there were errors + last processed row number (from source)
### Identified Error Cases
1. **DefinitionMismatch**
- Missing columns in source (Excel), compared to Job Definition.
- E.g.: Column D does not exist
- Mismatch column in source (Excel), compared to Job Definition.
- E.g.: On column B, the header was expected to be "Comment", but found "comment"
2. **DataTypeMismatch**
- Field cannot be parsed or coverted to target type
- E.g.: At row X and column Y, expected value to be convertible to "Date". Found value is "2023-03".
3. **TargetMismatch**
- Column not found in target table
- E.g.: Column XYZ not found in database table
- Cannot convert column from source to target
- E.g.: Column ZYX is using "boolean" in database, but defined as "integer" in job definition
- Target table not found
- E.g.: Table "detailReviews" not found in database
4. **DbInsertError**
- Failure while inserting, could be any reason
### Error Types definition
```TYPESCRIPT
export type DefinitionMismatch = {
type: "DefinitionMismatch",
message: string,
file: string, //master//master_details//..
column: string, //A//B//..
expected: string, //NameExpectedA//NameExpectedB//..
received: string, //NameFoundA//NameFoundB//..
};
export type DataTypeMismatch = {
type: "DataTypeMismatch",
message: string, // "At row X and column Y, expected a "Date" but found a "Boolean"
file: string, //master//master_details//..
row: number, //1//150//..
column: string, //A//B//..
expected: string, //date//boolean//..
received: string, //boolean//date//..
};
export type TargetMismatch = {
type: "TargetMismatch",
message: string, //Column "completed" is using "boolean" in database, but defined as "date" in job definition
table: string, //"detailReviews"
column: string, //"completed"
expected: string, //"date"
received: string, //"boolean"|"null"
};
export type DbInsertError = {
type: "DbInsertError",
message: string,
table: string,
};
export type EtlError =
| DefinitionMismatch
| DataTypeMismatch
| TargetMismatch
| DbInsertError;
```
### Errors table
```SQL
CREATE TABLE [dbo].[errors](
[job_id] [uniqueidentifier] NOT NULL,
[type] [nvarchar](50) NOT NULL,
[message] [nvarchar](max) NOT NULL,
[file_name] [nvarchar](20) NULL, // For future use, we could rename it to 'source'
[row] [int] NULL, // (optional - depends on error type)
[column] [nvarchar](2) NULL, // (optional) A, B... AA, AB...
[details] [nvarchar(MAX)] NOT NULL, // other error attributes, serialized as JSON
)
```
### Some objetives for implementation
- Separate steps from each other
- Plan with a weekly incremental progress
- Provide plan and refactor steps
- Start with Excel validation
- then Data Parsing validation
- Target DDL validation (can take longer)
- ...
- Make them easy to (unit)test
- Start with "full ETL" tests
- As unit-tests
- Input: Excels
- Assertion: on ParsedRows output and Errors based on different error cases. E.g.:
- Happy path
- Excel to JobDefinition does not match (and abort early)
- Excel with mismatched data types (should complete with errors)
- Excel with mismatched
- Target Mismatch between mocked DDL and Job Definition
- Test harness to validate JobDefinition maps correctly to Sequelize Models
- This can be a unit test that validates Job Definition against the Sequelize Models (checks all columns are mapped, data types mappings are correct, etc.)
- This test should be part of CI/CD
- Eventually, we can think of a dynamic tests that can validate ANY job; or eventually auto-generate Sequelize from Job Definition