--- title: CDC Diagnosis (working) tags: ded --- **TODO**: Ask IT - [x] Contact IT support to get info of `ACCOUNT` and `STMT_ENTRY` table (size, number of rows) - [x] Contact IT support to confirm that the data contained in current DWH is data from 2021 - [x] How to get trigger condition for: + DBLINK: `INFO_CARD` + OGG: `ACCOUNT`, `STMT_ENTRY` + Batch: `VDWHPROCESSINGGATEINAMC` - [x] How to get trigger condition for: (đã gửi mail hỏi IT) + SQL procedure: `EDW_DMT.R_TBL_LN_0016_PLN` + SQL procedure: `EDW_DMT.R_TBL_LN_0024_HR` Other - [x] Ask Duong Risk: Description of table `VDWHPROCESSINGGATEINAMC` - [x] Ask Duong Rick: Output of feature list? A table contains all features and by customerId, process_date or what? - [x] Ask Duong Risk: the timestamp needed for each daily snapshot fact table and transaction fact table (eg. Need last 6 months? or need the whole history in last 3 years?) To follow: - [ ] Follow up with Duong risk on final version of feature lists. He said end of this week will complete. **NOTE**: - All data table in `DWH` contains data from `2021-01-01`. All past data need to get from `DWH2018/DWH2019/DWH2020` - All data estimated below based on `DWH` ## `DWH.ACCOUNT`, `T24: FBNK_ACCOUNT` **Description**: Daily snapshot of customer account. **Logic ETL from source**: - ETL 1-1 from source, add `PROCESS_DATE` as ETL day. PS5 needs historical data to calculate casa balance and overdraft balance of each account. - Load all data in `DWH.ACCOUNT/V_MASK_ACCOUNT` for historical data, remember to mask salary account (`product_category=1006`) + Size/Number of rows in total: ~ 1 billion rows + Partitioning based on: `PROCESS_DATE` - Get daily data from source `T24: FBNK_ACCOUNT` + Batch loading + Estimate workload daily: ~ 8 million rows + How and When to trigger job? - S3 structure: `YEAR/MONTH/DATE/CUSTOMER_ID/ACCOUNT_NUMBER` ## `DWH.STMT_ENTRY`, `T24: FBNK_STMT_ENTRY` **Description**: Card transaction fact table. **Logic ETL from source**: - ETL 1-1 from source PS5 needs all historical data. - Load all data in `DWH.STMT_ENTRY`, remember to mask transactions from salary account (`product_category=1006`) + Size/Number of rows in total: , contains data from `2021-01-01` to `T-1` + Partitioning based on: `PROCESSING_DATE`, `CUSTOMER_ID` - Get daily data from source `T24: FBNK_STMT_ENTRY` + Batch loading (_or streaming, which one is better?_) + Estimate workload daily: ~ 16 million rows + CDC based on `BOOKING_DATE` (Get all transaction in current processing date) - S3 structure: `YEAR/MONTH/DATE/CUSTOMER_ID/ACCOUNT_NUMBER` (year, month, date extracted) ## `DWH.INFO_CARD`, `T24: FBNK_INFO_CARD` **Description**: Daily snapshot of customer card. **Logic ETL from source**: - `INFO_CARD`: ETL 1-1 from source, get lastest data. - `INFO_CARD_TSD`: ETL 1-1 from source, add three more fields: + `SURROGATE_KEY`: `INFO_CARD_TSD_SEQ.NEXTVAL` + `CURRENT_RECORD`: 1 + `CREATION_DATE`: equals `to_date('#v_gbl_gg_processing_date','YYYYMMDD')` + `END_DATE`: equals `to_date('24000101','YYYYMMDD')` PS5 needs only lastest data. - (optional for historical data) Load all data in `DWH.INFO_CARD_TSD` (SCD type 2 table) + Size/Number of rows in total: 11Gb, 19 million rows, contains data from `2017-01-03` + Batch loading, cut file based on `SURROGATE KEY` - Get daily data from source `T24: FBNK_INFO_CARD` + Batch loading + Estimate workload daily: 9Gb, ~ 11 million rows + How and When to trigger job? - S3 structure: `YEAR/MONTH/DATE/CUSTOMER_ID/ACCOUNT_NUMBER` ## `CMS: DWH_COLLECTIONCENTRALBANK` **Description**: Daily snapshot of credit cards. **Logic ETL from source**: - ETL 1-1 from source, add `PROCESS_DATE` as ETL day. PS5 needs historical data (to calculate credit card statement balance in last 6 months). - Load all data in `DWH.CMS_COLLECTIONCENTRALBANK` + Size/Number of rows in total: ~60 million rows, contains data from `2020-01-01` to `T-1` + Partitioning based on `PROCESS_DATE`, `PRODCODE` - Get daily data from source `CMS: DWH_COLLECTIONCENTRALBANK` + Batch loading + Estimate workload daily: 500k rows + Trigger: check logic in function `FNC_CHECK_CMS_COB_STATUS` and `FNC_CHECK_CMS_COB_SCH` - S3 structure: `YEAR/MONTH/DATE/CUSTOMER_ID/PRODCODE/ACCOUNT_NUMBER` ## `DMS: VDWHPROCESSINGGATEINAMC` **Description**: Type IPL (upsert), save lastest information of??? **Logic ETL from source**: - ETL 1-1 from source, add `PROCESS_DATE` as ETL day. For PS5: - No need to initial load from DWH - Directly daily load from source: + Batch loading, daily job (ETL by truncating and repopulating) + Size/Number of rows: ~2 millions rows + Trigger: ??? - S3 structure: `CUSTOMER_ID/IDCARD` or `PROCESSINGGATE/CUSTOMER_ID` Schedule ETL data at source: - Data từ DWH được đổ vào bảng Temp (1.1), rồi từ bảng Temp đổ vào các view trên DMS (1.2). Bước 1.1 thì real time; còn bước 1.2 thì đc schedule vào 7h, 12h, 20h và được chạy trong khoảng 30-60'. - Dữ liệu trên view sẽ là dữ liệu tại thời điểm bên chị kéo ## `EDW_DMT.R_TBL_LN_0016_PLN` **Description**: Daily snapshot for customer's loan contract. For PS5: - Initial load from `EDW_DMT.R_TBL_LN_0016_PLN` + Size/Number of rows in total: ~430 million rows, 82Gb, contains data from `2017-07-11` to `T-1` + Partitioning based on: `NGAY_BAO_CAO` or `BUT_TOAN` or `NGAY_BAO_CAO/CUSTOMER_ID` - Get daily data from `EDW_DMT.R_TBL_LN_0016_PLN`, where `NGAY_BAO_CAO` is the current ETL date: + Batch loading + Estimate workload daily: 500k rows + Trigger: ??? (job not run during holiday/sunday) - S3 structure: `NGAY_BAO_CAO/CUSTOMER_ID/BUT_TOAN` Schedule ETL data at source: - ## `EDW_DMT.R_TBL_LN_0024_HR` **Description**: Daily snapshot of employees of Techcombank. For PS5: - Initial load from `EDW_DMT.R_TBL_LN_0024_HR` + Size/Number of rows in total: ~6.5 mil rows, contains data from `2019-07-19` to `T-1` + Partitioning based on `PROCESS_DATE` - Get daily data from `EDW_DMT.R_TBL_LN_0024_HR`, where `PROCESS_DATE` is the current ETL date: + Batch loading + Estimate workload daily: ~14k rows + Trigger: ??? (job not run during holiday/sunday) - S3 structure: `YEAR/MONTH/DATE`