2020_OS_Fall_HW2: ETL process

tags: `os-2020`

作業說明

link

作業程式碼

GitHub

學生資料

學號：P76091129
姓名：王子源
系級：資工所碩一

說明文件

開發環境：

OS: Ubuntu 16.04.1
CPU: Intel® Core™ i5-8500 CPU @ 3.00GHz × 6
Ｍemory: 8GB
Programming Language(version): c++

執⾏時間

input size：1004 Mb
秒數：77.2413s
thread 數⽬： 6
lines_per_thread ： 50

程式開發與使用說明：

程式開發概念
此次作業是希望能夠透過多執⾏緒，⽬的希望將 csv 檔案轉成 json 檔案。在讀檔時，能夠⾃⾏選定要多少條 thread 同時執⾏，每條 thread 會負責處理轉換特定幾⾏字串後，再換下⼀條 thread。

在寫完後才看到作業要求裡⾯提到 Stage 最少會包含讀檔、資料處理、輸出三個階段。⾃⼰這邊的處理並不是如作業要求的⽅式，⽽是這三個階段不斷重複性地做。會這樣寫是想說試著減少執⾏時間，不希望將⼤筆資料載入後⼜從頭做資料處理，做完⼜做⼀次輸出結果。
程式架構
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
在寫 HW1、HW2 的作業都有⼀個感想，就是必須去處理⼀些特殊的情況，尤其是在讀檔最後⼀ run 時。以這次作業來講，我希望每條 thread 可以負責轉換特定⾏數（不同⾏數會影響到執⾏時間，在下⾯效能分析有另外討論），並且在執⾏完所有 thread 後，依序等待每個 thread 的結果，此時就會依序將結果輸出。會在⼀開始建立需要⽤到的 thread 以及每個 thread 需要的參數
```
typedef struct{
    char** line;
    char* result;
    size_t limit_set;
} Param;

Param paramm_obj[NUM_THREADS];
pthread_t thread[NUM_THREADS];
```
接著使⽤ pthread api 產⽣新的 thread
```
pthread_create(&thread[t], &attr, threaed_task, (void*) &param_obj[t]);
```

這裡有兩個部分需要注意

若是要傳參數給 thread，只能透過(void *)的⽅式，所以若是要傳多個參數的話，需要先透過 struct 定義⼀個新的物件
這個是開發過程碰到的盲點，⼀開始並不是使⽤陣列去存取變數，⽽是多個 thread 共⽤⼀個變數，為何⼀定要使⽤陣列分開呢？這是因為如果產⽣⼀個 thread A 並且將參數 assign 之後，thread A 並沒有立刻執⾏，⽽是改變了參數後⼜ assign 給 threadB，此時改過的參數會影響到 thread A(因為是 call-by-reference)，進⽽導致資料錯誤。

透過上述 API 產出新的 thread 之後，就要去執⾏每個 thread 需要做的 task，定義如下







































void str_to_json(char** result, char* line){
    size_t ln = strlen(line) - 1;
    if (*line && line[ln] == '\n') 
        line[ln] = '\0';

    int index = 1;
    char* token = strtok(line, "|"); 
    
    while (token != NULL) { 
        if(index != 20){
            char tmp[64];
            sprintf(tmp, "\"col_%d\":\"%s\",\n", index++, token);
            strcat(*result, tmp); 
        }
        else{
            char tmp[64];
            sprintf(tmp, "\"col_%d\":\"%s\"\n", index++, token); 
            strcat(*result, tmp); 
        }
        token = strtok(NULL, "|"); 
    } 
    assert(index == 21);
}

void *threaded_task(void *param) {
    mu.lock();
    Param* p = (Param*) param;

    // 20 lines * 64 = 1 set
    p->result = (char*) malloc(p->limit_set * 20 * 64);
    for(size_t i=0;i<p->limit_set;i++){
        strcat(p->result, "{\n");
        str_to_json(&p->result, p->line[i]);
        strcat(p->result, "},");
    }
    mu.unlock();

    pthread_exit((void *)param);
}

因為希望當⼀個 thread 在處理轉換的過程時，不希望被其他的 thread ⼲擾，所以這部分需要使⽤ mutex 來產⽣⼀個 critical section，⽽ str_to_json 就只是基本字串處理。
當產的 thread 數量已經達到限制的數字時，就要將結果輸出成檔案。因為 multi-thread 的特性就是 non-synchronize，若不額外做處理的話，就會依據哪條 thread 做完就輸出哪條thread，這會導致輸出的結果沒有順序性。所以需要透過 pthread_join 的⽅式，讓所有的 thread 有順序的結束並將結果輸出，程式碼如下











for (t = 0; t < thread_in_use; t++) {
    rc = pthread_join(thread[t], &status);
    if (rc) {
        printf("ERROR; return code from pthread_join() is %d\n", rc);
        exit(-1);
    }
    fprintf(out_fp, "%s", ((Param*)status)->result);
    for(size_t j=0;j<param_obj[t].limit_set;j++){
        free(param_obj[t].line[j]);
    }
}

先做的 thread 先輸出，符合 FIFO 原則的話，這樣的結果就會是有順序性的了。重複以上的步驟直到結束時，基本上就能處理 99% 左右的資料了，剩下的就是⼀些特殊的case 需要額外做處理。以我的程式碼設計，唯有 thread 產出達到指定的數⽬時，才會執⾏ pthread_join 並且輸出，那假如在最後⼀回合沒有達到指定數⽬呢？這樣會有⼀些thread 被產⽣，但沒有將他們回收，所以需要在讀完檔離開 while 迴圈後，加了⼀段


if(thread_in_use != NUM_THREADS)

來進⾏檢查。

另外，還需要檢查另⼀個 case，每條 thread 都有負責的 lines_per_thread 列數需要處理，如果最後⼀回合沒有達成的話，那也需要在 while 迴圈後另外處理。















if(!new_obj){

    param_obj[t].limit_set = limit_ctr;
    rc = pthread_create(&thread[t], &attr, threaded_task, (void *)&param_obj[t]);
    if (rc) {
        printf("ERROR: return code from pthread_create() is %d\n", rc);
        exit(-1);
    }
    rc = pthread_join(thread[t], &status);
    if (rc) {
        printf("ERROR; return code from pthread_join() is %d\n", rc);
        exit(-1);
    }
    fprintf(out_fp, "%s", ((Param*)status)->result);
}

效能分析

測試資料

這邊測試使用了 thread 數目為 1, 5, 10，另外還有調整了 lines_per_thread 的大小為 1, 5, 10, 50, 100, 500，lines_per_thread 大小是控制每條 thread 應該處理多少條 line，為求測試速度效率，使用了約 52M 小筆資料當實驗 input file。
量測結果

上圖使用 gnuplot 量測工具幫忙繪圖，並且以測試時間當作 y 軸來探討，而做完實驗後覺得可以整理出以下 2 件觀察
1. 越多 thread 越好嗎？
  
  我想以時間的結果來看則是未必，可以看出其實最省時的是只有 1 條 thread 的情況去執行此程式，其中可能會有很多的因素，而自己認為就是因為越多的 thread 會導致 OS 管理上的負擔，使得越多 thread 反而時間花越久
2. lines_per_thread 越大越好嗎？
  
  這個 lines_per_thread 是指 thread 負責要處理多少列的資料，從圖中可以看出個有趣的部分，是當 lines_per_thread 在大約 50 時，會是時間花最少的。
  如果 lines_per_thread 越少，表示每條 thread 負責少量的資料，那將會重複更多回合在上面的步驟，產生 thread 的次數也將會變多次，所花時間越多。而 lines_per_thread 越大表示每個 thread 要負責的資料量越多，我在猜想會導致時間越多的原因，有可能是因為在 pthread_join 時要等待其他 thread 做完的時間變久了，而導致要花的時間也變多。
此外，這次還有使用一個第三方程式 KSar 來協助更詳細的監控作業系統，以下附上數據結果
- CPU usage
  
  以下這張圖是多筆測資綜合起來一起看的，圖中每個高峰的順序分別是 thread 數目為 1, 5, 10，另外還有調整了 lines_per_thread 的大小為 1, 5, 10, 50, 100, 500(每個 thread 由左到右)
從結果可以觀察出下列幾件事情：
1. 多執行緒的 CPU usage 普遍比 1 條 thread 的使用量還高
2. 當 lines_per_thread 提高時，同時也降低了不斷重複產生新的 thread 的次數，所以紅色的部分（%sys）看起來有明顯變少
  此部分也可以從 I/O 的 TPS (the number of transfers per second) 觀察出相同的結果
- process
  下圖指出了 Task creation 以及 Context switch
  
  可以看到當 lines_per_thread 變大時，兩個觀察項目的縮減相當明顯，也可以看出程式在 lines_per_thread 很小時，會發生多次的 context switch 以及 task creation ，所以才會使得上面量測結果的圖，在起初的 lines_per_thread 時間會花那麼多的原因。
  
  另外，也蠻好奇不確定是因為自己電腦是 6 核心的，所以在 10 threads 表現上並沒有比 5 threads 還高，還是其他原因沒有發現的。
  
  更多未提到的分析將會在「結果分析.pdf」呈現，裡頭尚包含了每個 CPU 的使用狀況、Kmem 以及 Swap 使用情形。

2020_OS_Fall_HW2: ETL process

tags: os-2020

作業說明

作業程式碼

學生資料

說明文件

開發環境：

執⾏時間

程式開發與使用說明：

效能分析

Read more

Untitled

深度學習系統與平行計算技術

ML 學習

第06章｜穿什麼有這麼重要？-裝飾模式（Decorator）

tags: `os-2020`