# Task Script YAML 文件
## Task Script YAML
* 以下文件以`yaml`為主,但datawriter支援Task Script以`json_array`或`jsonl`格式指派任務
* 初始必填欄位: 未填寫key&value皆會報錯,其餘必填欄位根據不同資料來源及寫入目的有所不同
```yaml
Task:
Name:
Owner:
Reader:
Source:
Writer:
Target:
Path:
```
* 支援多task 順序執行
```yaml
Task:
Name: task1
Owner:
Reader:
Source:
Writer:
Target:
Path:
---
Task:
Name: task2
Owner:
Reader:
Source:
Writer:
Target:
Path:
```
## Task
* `Name`: 任務名稱
* `Owner`: 任務負責人(可多人,以list表示)
* `EmailNotification`: 填入email,運行結果通知
* `Succeeded` `Failed` 可根據成功與失敗,分別定義寄件名單(可多人,以list表示)
* `ApNotification`: 填入sns arn訂閱,運行結果通知執行ap
* `Succeeded` `Failed` 可根據成功與失敗,分別定義通知ap(可多人,以list表示)
* `Throughput`: 每`Second`秒,寫入`Count`筆數;未設定則預設為每秒寫入500筆
```yaml
Task:
Name: <string>
Owner: <string>
EmailNotification:
Succeeded:
- <email>
Failed:
- <email>
ApNotification:
Succeeded:
- <sns::arn>
Failed:
- <sns::arn>
Throughput:
Second: <float>
Count: <integer>
```
## Reader
* `Source`: 來源資料格式
* `csv`
* `jsonl` 僅支援jsonline 格式
* `static` 直接給定讀取單(多)筆資料格式
* `Properties`: 根據不同`Source`來源,設定細節
### csv
`Properties`設定
* `Path`: **必填** s3 object路徑(ex.s3://xxx/yyy/zzz.csv)
* `FieldsName:`指定header名稱,若指定則使用指定名稱,若無指定則使用第一行為header(first row as header)
```yaml
FieldsName:
- "col1"
- "col2"
- "col3"
- "col4"
```
* `DefaultValues`: 當特定欄位為空字串 (`""`) 時,所做的處理
* `Name`: 欄位名稱
* `Action`: 三種處理方式
* `setToNull` 設定為null
* `setValue` 設定為指定值
* `skipRow` 設定只要欄位有空字串該筆資料跳過不讀取(寫入)
* `Value`: 僅給 `setValue` 使用
```yaml
DefaultValues:
- Name: col1
Action: setToNull
- Name: col2
Action: setValue
Value: 0
- Name: col3
Action: skipRow
```
以上設定會產生以下結果:
input(read)
```
col1, col2, col3
a1, b1, c3
, b2, c3
,, c3
a4, b4,
,,
```
output(write)
```
col1, col2, col3
a1, b1, c3
null, b2, c3
null , 0, c3
(this row was not written)
(this row was not written)
```
### jsonl
`Properties`設定
* `Path`: **必填** s3 object路徑(ex.s3://xxx/yyy/zzz.json)
### static
`Properties`設定
* `FieldsName`: **必填** 欄位名稱
* `Rows`: **必填** list of list 格式
```yaml
Reader:
Source: static
Properties:
FieldsName:
- "idno"
- "score"
- "note"
- "update_time"
Rows:
-
- 12345678
- 100
- "this is a book"
- "2021/05/20 09:00:00"
-
- 87654321
- 60
- "this is an apple"
- "2021/05/20 09:00:01"
```
## Writer
* `Target`: `mysql`, `mongo`, or `redis`
* `Path`: ssm parameter store name (ex. `/idc-mariadb/sc00001_f0049_as_aws` or `/idc-mongodb/mdb0g00009_f0049_as_aws`)
* `Properties`: 根據不同`Target`寫入目地,設定細節
### MySQL DB
`Properties`設定
* `Table`: **必填** 資料表名稱
* `Key`: **必選填** 下列語法需要用到的key
* `update`
* `upsert`
* `delete`
* `Mode`:**必填** query 語法
* `insert`
* `update`
* `upsert`
* `replace`
* `delete`
* Option
* `deleteAll`: true | false 僅在`delete`有支援
* `deleteFirst` : true | false 僅在`insert` 有支援
### MongoDB
`Properties`設定
* `Collection`: **必填** Mongo collection
* `Mode`:**必填** query 語法
* `insert`
* `updateOne`
* `updateMany`
* `replaceOne`
* `deleteOne`
* `deleteMany`
## Transform
### SpecifiedFields
* `Source` (For reader)
* `Name`:欄位名
* `Type`:欄位型別,可允許底下的種類
* `Integer`
* `String`
* `Float`
* `UnixTimestamp`:從 1970 年 1 月 1 日 00:00:00 到當下所經過的秒數
* `Datetime`:Datetime 格式,需搭配 `Format` 來設定時間格式
* `Format`:欄位格式 (Datetime 時使用:`%Y/%m/%d %H:%M:%S`)
* `Target` (For writer)
* `Name`:欄位名 (當與 `Source.Name` 不同時進行重新命名)
* `Type`:欄位型別,可允許底下的種類
* `Integer`
* `String`
* `Float`
* `UnixTimestamp`:從 1970 年 1 月 1 日 00:00:00 到當下所經過的秒數
* `Datetime`:Datetime 格式,需搭配 `Format` 來設定時間格式
* `Format`:欄位格式 (Datetime 時使用:`%Y/%m/%d %H:%M:%S`)
* `New` (For writer)
* `Name`:欄位名(不存在於 reader 的欄位)
* `Expression`: now() 可指定now()為當下寫入時間
## Demo
以AP1429為例,執行時間為一週與一個月
* 三個task `ap1429_edu2jobcat` `ap1429_exp2jobcat` `ap1429_updateapitime`
* 有email通知、不需要ap通知
* `ap1429_edu2jobcat` `ap1429_exp2jobcat` 輸入為s3 csv檔,輸出為mysql
* 空字串設為`null`
* 寫入 mysql,db: `m027001` table:`resume_analysis_edu_to_jobcat` `resume_analysis_expjob_to_jobcat`
* 資料在insert之前會 執行全部資料先刪除
* 所有欄位由string轉換為int才寫入
* `ap1429_updateapitime` 於 前2任務結束後更新 Table: `api_update_time` `update_time` 欄位當下寫入時間
```yaml
Task:
Name: ap1429_edu2jobcat
Owner:
- sharon.lin@104.com.tw
EmailNotification:
Succeeded:
- sharon.lin@104.com.tw
Failed:
- sharon.lin@104.com.tw
Throughput:
Count: 500
Second: 1
Reader:
Source: csv
Properties:
Path: s3://104dtt-dev01-code-base-ap-northeast-1/dev01_ap1429/tmp/findJobcatByEdu/20210524/findJobcatByEdu_20210524.csv
DefaultValues:
- Name: sex
Action: setToNull
- Name: year_range
Action: setToNull
- Name: degree_level
Action: setToNull
- Name: major_cat
Action: setToNull
- Name: job_catno
Action: setToNull
- Name: cnt
Action: setToNull
Writer:
Target: mysql
Path: /idc-mariadb/m027001_f0049_ai
Properties:
Database: m027001
Table: resume_analysis_edu_to_jobcat
Mode: insert
DeleteAllFirst: True
Transform:
SpecifiedFields:
- Source:
Name: sex
Type: string
Target:
Name: sex
Type: integer
- Source:
Name: year_range
Type: string
Target:
Name: year_range
Type: integer
- Source:
Name: degree_level
Type: string
Target:
Name: degree_level
Type: integer
- Source:
Name: major_cat
Type: string
Target:
Name: major_cat
Type: integer
- Source:
Name: job_catno
Type: string
Target:
Name: job_catno
Type: integer
- Source:
Name: cnt
Type: string
Target:
Name: cnt
Type: integer
---
Task:
Name: ap1429_exp2jobcat
Owner:
- sharon.lin@104.com.tw
EmailNotification:
Succeeded:
- sharon.lin@104.com.tw
Failed:
- sharon.lin@104.com.tw
Throughput:
Count: 500
Second: 1
Reader:
Source: csv
Properties:
Path: s3://104dtt-dev01-code-base-ap-northeast-1/dev01_ap1429/tmp/findJobcatByJobcat/20210524/findJobcatByJobcat_20210524.csv
DefaultValues:
- Name: jobcat
Action: setToNull
- Name: sex
Action: setToNull
- Name: year_range
Action: setToNull
- Name: area_range
Action: setToNull
- Name: job_catno
Action: setToNull
- Name: cnt
Action: setToNull
Writer:
Target: mysql
Path: /idc-mariadb/m027001_f0049_ai
Properties:
Database: m027001
Table: resume_analysis_expjob_to_jobcat
Mode: insert
DeleteAllFirst: True
Transform:
SpecifiedFields:
- Source:
Name: jobcat
Type: string
Target:
Name: jobcat
Type: integer
- Source:
Name: sex
Type: string
Target:
Name: sex
Type: integer
- Source:
Name: year_range
Type: string
Target:
Name: year_range
Type: integer
- Source:
Name: area_range
Type: string
Target:
Name: area_range
Type: integer
- Source:
Name: job_catno
Type: string
Target:
Name: job_catno
Type: integer
- Source:
Name: cnt
Type: string
Target:
Name: cnt
Type: integer
---
Task:
Name: ap1429_updateapitime
Owner:
- sharon.lin@104.com.tw
EmailNotification:
Succeeded:
- sharon.lin@104.com.tw
Failed:
- sharon.lin@104.com.tw
Throughput:
Count: 500
Second: 1
Reader:
Source: static
Properties:
FieldsName:
- "id"
- "api_name"
Rows:
-
- 3
- resume_analysis_edu_to_jobcat
-
- 4
- resume_analysis_expjob_to_jobcat
Writer:
Target: mysql
Path: /idc-mariadb/m027001_f0049_ai
Properties:
Database: m027001
Table: api_update_time
Mode: update
Key:
- id
Transform:
SpecifiedFields:
- New:
Name: update_time
Expression: now()
```