# Task Script YAML 文件 ## Task Script YAML * 以下文件以`yaml`為主,但datawriter支援Task Script以`json_array`或`jsonl`格式指派任務 * 初始必填欄位: 未填寫key&value皆會報錯,其餘必填欄位根據不同資料來源及寫入目的有所不同 ```yaml Task: Name: Owner: Reader: Source: Writer: Target: Path: ``` * 支援多task 順序執行 ```yaml Task: Name: task1 Owner: Reader: Source: Writer: Target: Path: --- Task: Name: task2 Owner: Reader: Source: Writer: Target: Path: ``` ## Task * `Name`: 任務名稱 * `Owner`: 任務負責人(可多人,以list表示) * `EmailNotification`: 填入email,運行結果通知 * `Succeeded` `Failed` 可根據成功與失敗,分別定義寄件名單(可多人,以list表示) * `ApNotification`: 填入sns arn訂閱,運行結果通知執行ap * `Succeeded` `Failed` 可根據成功與失敗,分別定義通知ap(可多人,以list表示) * `Throughput`: 每`Second`秒,寫入`Count`筆數;未設定則預設為每秒寫入500筆 ```yaml Task: Name: <string> Owner: <string> EmailNotification: Succeeded: - <email> Failed: - <email> ApNotification: Succeeded: - <sns::arn> Failed: - <sns::arn> Throughput: Second: <float> Count: <integer> ``` ## Reader * `Source`: 來源資料格式 * `csv` * `jsonl` 僅支援jsonline 格式 * `static` 直接給定讀取單(多)筆資料格式 * `Properties`: 根據不同`Source`來源,設定細節 ### csv `Properties`設定 * `Path`: **必填** s3 object路徑(ex.s3://xxx/yyy/zzz.csv) * `FieldsName:`指定header名稱,若指定則使用指定名稱,若無指定則使用第一行為header(first row as header) ```yaml FieldsName: - "col1" - "col2" - "col3" - "col4" ``` * `DefaultValues`: 當特定欄位為空字串 (`""`) 時,所做的處理 * `Name`: 欄位名稱 * `Action`: 三種處理方式 * `setToNull` 設定為null * `setValue` 設定為指定值 * `skipRow` 設定只要欄位有空字串該筆資料跳過不讀取(寫入) * `Value`: 僅給 `setValue` 使用 ```yaml DefaultValues: - Name: col1 Action: setToNull - Name: col2 Action: setValue Value: 0 - Name: col3 Action: skipRow ``` 以上設定會產生以下結果: input(read) ``` col1, col2, col3 a1, b1, c3 , b2, c3 ,, c3 a4, b4, ,, ``` output(write) ``` col1, col2, col3 a1, b1, c3 null, b2, c3 null , 0, c3 (this row was not written) (this row was not written) ``` ### jsonl `Properties`設定 * `Path`: **必填** s3 object路徑(ex.s3://xxx/yyy/zzz.json) ### static `Properties`設定 * `FieldsName`: **必填** 欄位名稱 * `Rows`: **必填** list of list 格式 ```yaml Reader: Source: static Properties: FieldsName: - "idno" - "score" - "note" - "update_time" Rows: - - 12345678 - 100 - "this is a book" - "2021/05/20 09:00:00" - - 87654321 - 60 - "this is an apple" - "2021/05/20 09:00:01" ``` ## Writer * `Target`: `mysql`, `mongo`, or `redis` * `Path`: ssm parameter store name (ex. `/idc-mariadb/sc00001_f0049_as_aws` or `/idc-mongodb/mdb0g00009_f0049_as_aws`) * `Properties`: 根據不同`Target`寫入目地,設定細節 ### MySQL DB `Properties`設定 * `Table`: **必填** 資料表名稱 * `Key`: **必選填** 下列語法需要用到的key * `update` * `upsert` * `delete` * `Mode`:**必填** query 語法 * `insert` * `update` * `upsert` * `replace` * `delete` * Option * `deleteAll`: true | false 僅在`delete`有支援 * `deleteFirst` : true | false 僅在`insert` 有支援 ### MongoDB `Properties`設定 * `Collection`: **必填** Mongo collection * `Mode`:**必填** query 語法 * `insert` * `updateOne` * `updateMany` * `replaceOne` * `deleteOne` * `deleteMany` ## Transform ### SpecifiedFields * `Source` (For reader) * `Name`:欄位名 * `Type`:欄位型別,可允許底下的種類 * `Integer` * `String` * `Float` * `UnixTimestamp`:從 1970 年 1 月 1 日 00:00:00 到當下所經過的秒數 * `Datetime`:Datetime 格式,需搭配 `Format` 來設定時間格式 * `Format`:欄位格式 (Datetime 時使用:`%Y/%m/%d %H:%M:%S`) * `Target` (For writer) * `Name`:欄位名 (當與 `Source.Name` 不同時進行重新命名) * `Type`:欄位型別,可允許底下的種類 * `Integer` * `String` * `Float` * `UnixTimestamp`:從 1970 年 1 月 1 日 00:00:00 到當下所經過的秒數 * `Datetime`:Datetime 格式,需搭配 `Format` 來設定時間格式 * `Format`:欄位格式 (Datetime 時使用:`%Y/%m/%d %H:%M:%S`) * `New` (For writer) * `Name`:欄位名(不存在於 reader 的欄位) * `Expression`: now() 可指定now()為當下寫入時間 ## Demo 以AP1429為例,執行時間為一週與一個月 * 三個task `ap1429_edu2jobcat` `ap1429_exp2jobcat` `ap1429_updateapitime` * 有email通知、不需要ap通知 * `ap1429_edu2jobcat` `ap1429_exp2jobcat` 輸入為s3 csv檔,輸出為mysql * 空字串設為`null` * 寫入 mysql,db: `m027001` table:`resume_analysis_edu_to_jobcat` `resume_analysis_expjob_to_jobcat` * 資料在insert之前會 執行全部資料先刪除 * 所有欄位由string轉換為int才寫入 * `ap1429_updateapitime` 於 前2任務結束後更新 Table: `api_update_time` `update_time` 欄位當下寫入時間 ```yaml Task: Name: ap1429_edu2jobcat Owner: - sharon.lin@104.com.tw EmailNotification: Succeeded: - sharon.lin@104.com.tw Failed: - sharon.lin@104.com.tw Throughput: Count: 500 Second: 1 Reader: Source: csv Properties: Path: s3://104dtt-dev01-code-base-ap-northeast-1/dev01_ap1429/tmp/findJobcatByEdu/20210524/findJobcatByEdu_20210524.csv DefaultValues: - Name: sex Action: setToNull - Name: year_range Action: setToNull - Name: degree_level Action: setToNull - Name: major_cat Action: setToNull - Name: job_catno Action: setToNull - Name: cnt Action: setToNull Writer: Target: mysql Path: /idc-mariadb/m027001_f0049_ai Properties: Database: m027001 Table: resume_analysis_edu_to_jobcat Mode: insert DeleteAllFirst: True Transform: SpecifiedFields: - Source: Name: sex Type: string Target: Name: sex Type: integer - Source: Name: year_range Type: string Target: Name: year_range Type: integer - Source: Name: degree_level Type: string Target: Name: degree_level Type: integer - Source: Name: major_cat Type: string Target: Name: major_cat Type: integer - Source: Name: job_catno Type: string Target: Name: job_catno Type: integer - Source: Name: cnt Type: string Target: Name: cnt Type: integer --- Task: Name: ap1429_exp2jobcat Owner: - sharon.lin@104.com.tw EmailNotification: Succeeded: - sharon.lin@104.com.tw Failed: - sharon.lin@104.com.tw Throughput: Count: 500 Second: 1 Reader: Source: csv Properties: Path: s3://104dtt-dev01-code-base-ap-northeast-1/dev01_ap1429/tmp/findJobcatByJobcat/20210524/findJobcatByJobcat_20210524.csv DefaultValues: - Name: jobcat Action: setToNull - Name: sex Action: setToNull - Name: year_range Action: setToNull - Name: area_range Action: setToNull - Name: job_catno Action: setToNull - Name: cnt Action: setToNull Writer: Target: mysql Path: /idc-mariadb/m027001_f0049_ai Properties: Database: m027001 Table: resume_analysis_expjob_to_jobcat Mode: insert DeleteAllFirst: True Transform: SpecifiedFields: - Source: Name: jobcat Type: string Target: Name: jobcat Type: integer - Source: Name: sex Type: string Target: Name: sex Type: integer - Source: Name: year_range Type: string Target: Name: year_range Type: integer - Source: Name: area_range Type: string Target: Name: area_range Type: integer - Source: Name: job_catno Type: string Target: Name: job_catno Type: integer - Source: Name: cnt Type: string Target: Name: cnt Type: integer --- Task: Name: ap1429_updateapitime Owner: - sharon.lin@104.com.tw EmailNotification: Succeeded: - sharon.lin@104.com.tw Failed: - sharon.lin@104.com.tw Throughput: Count: 500 Second: 1 Reader: Source: static Properties: FieldsName: - "id" - "api_name" Rows: - - 3 - resume_analysis_edu_to_jobcat - - 4 - resume_analysis_expjob_to_jobcat Writer: Target: mysql Path: /idc-mariadb/m027001_f0049_ai Properties: Database: m027001 Table: api_update_time Mode: update Key: - id Transform: SpecifiedFields: - New: Name: update_time Expression: now() ```