2020_OS_Fall_HW1: Benchmark Your Computer Black Box

# 2020_OS_Fall_HW1: Benchmark Your Computer Black Box ###### tags: `os-2020` ## 作業說明 - [link](https://hackmd.io/@dsclab/r1viIcBSw) ## 作業程式碼 - [GitHub](https://github.com/Jonec76/os-hw1) ## 學生資料學號：P76091129 姓名：王子源系級：資工所碩一 ## 說明文件 ### 開發環境： - OS: Ubuntu 16.04.1 - CPU: Intel® Core™ i5-8500 CPU @ 3.00GHz × 6 - Ｍemory: 8GB - Programming Language(version): c++ ### 程式執行時間： - input size: 18.122012554 G - numbers of sorted file: 100 - each sorted file size: 181.220125 MB 簡單測試了以下幾個步驟的時間： - merge stage: 2643.457031 s - external stage: 2401.927246 s - total time: 5045.384766 s ### 程式開發與使用說明： - 程式開發 - 讀取資料因為一個大的檔案，無法直接載入至記憶體裡頭，必須分批處理。經過實驗發現，若是一次讀一行值進來 (`getline`)，會導致 io 太多次而花很多時間在上面。於是使用 `fread` 的 API，根據 [man page-fread](https://man7.org/linux/man-pages/man3/fread.3.html) 描述 > size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream); > > The function fread() reads nmemb items of data, each size bytes long, from the stream pointed to by stream, storing them at the location given by ptr. 可以知道使用 `fread` 能夠一次讀 nmemb*size bytes 的資料，如此一來便能夠使讀取時，不需要頻繁的讀取 file。以這隻程式來說，是直接給定讀的 chunk size 就是 2 億 bytes ，並且另外再做切斷字元的處理，處理那些剛好切到一半的資料。補充：在讀到最後一個批次時，有可能會不滿 chunk size，此時可以透過 `fread` 的 return value 做判斷 > On success, fread() and fwrite() return the number of items read or written. 以避免發生 segment fault 的狀況。 - 將原資料切分，並排序（merge-sort）一個龐大的數據不可能一次載到記憶體，並且對他做排序。這邊的方法是，將原本的資料，分割成數個檔案，概念就是。若以 Divide & conquer 來看待大數據牌序問題，那此階段就是 "Divide" 階段。另外，也需要注意要切成多少個檔案。嘗試做 1 億個檔案，每個檔案 10 筆資料的方法，但結果是失敗的。程式裡頭使用 `FULL_FILE_SIZE` 做調整 ```c=1 FULL_FILE_SIZE = sz / 10000; ``` `sz` 是整個檔案的大小，後面的 `10000` 就是希望程式產生出來的檔案個數在 10000 之內，計算完後即可得知每個檔案應該包含多少的資料。在排序的部分使用了 merge-sort，自己寫過一次後又參考了 [geek](https://www.geeksforgeeks.org/merge-sort/) 網站。其概念就是先用遞迴方式（Divide），將一個陣列切一半變成兩個子陣列，切到當終止條件成立時，開始做排序並且合併的方式 (Conquer)。當此步驟完成後，將會產生數個 "sorted{index}.txt" 檔案，這些檔案的內容都是已經排序好的資料，接著只需要將這些資料合併起來就可以了。 - 合併 sorted 資料在這步驟需要使用到 `external sort`，在 [geek-external sort](https://www.geeksforgeeks.org/external-sorting/) 裡頭提到 > External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead, they must reside in the slower external memory (usually a hard drive). 整個的概念，是使用 min heap 來做存取，在程式裡頭設計了以下結構 ```c=1 struct Node { long long ID; long long int value; }; ``` 一個 Node 代表著在 heap 裡頭的一個節點，`ID` 是第幾個 sorted file 檔案，`value` 是該檔案目前讀到的值。起初創立 min-heap 時，是將所有檔案的第一個值插入 heap，做 `insert` 的同時就會透過 `heapify` 調整位置，使得最小的值在最上頭。當執行 `extrac-min` 後，會得到整個 heap 裡頭最小的那個值，此時會將這個 value 寫入 output file/tmp file 其中一個，端看此回合是否是最後一回合。接著再打開此 Node 紀錄的檔案，讀取該檔案的下一筆資料並且再次 insert 進去 heap。這邊處理步驟挺繁瑣的，在使用 `fstream.open` 時，原以為可以將所有 sorted 檔案都開起來，但是經過實驗發現最多只能開啟約 4000 筆檔案，也就是說，若程式想要開始超過此數字的檔案時，將讀不到任何資料（這部分沒有深入研究，但猜測是超過此 API 限制）。於是設計一個處理機制，若是要讀超過 4000 筆檔案時，會分批次處理，先將前面的 4000 筆檔案合併後產出一個 temp sorted 檔案，下一次執行時再將此 temp sorted 檔案加入 min-heap 中。 ## 效能分析 ## Disk io Input file 17g，使用 `iotop` 進行分析。根據 [man](http://manpages.ubuntu.com/manpages/xenial/man8/iotop.8.html) 裡頭所描述的，透過 `iotop` 可以偵測出 `actual` 以及 `total` 兩種形式的資訊 > Total DISK READ and Total DISK WRITE values represent total read and write bandwidth between processes and kernel threads on the one side and kernel block device subsystem on the other. > Actual DISK READ and Actual DISK WRITE values represent corresponding bandwidths for actual disk I/O between kernel block device subsystem and underlying hardware (HDD, SSD, etc.). `Total` 偵測出來的，是 process 到 device (kernel thread) 此段過程，`Actual` 偵測 `device` 和 `hardware` 此段。下面透過 [gnuplot](http://www.gnuplot.info/) 做出比較，x 軸代表時間（總共跑 5425 秒），而 y 軸代表了 k/s，也就是該秒傳輸了多少 k bytes 的 data - total write/read ![](https://i.imgur.com/VOPym3I.png) - actual read/write ![](https://i.imgur.com/PZZOTO7.png) - actual/total write ![](https://i.imgur.com/SSGdfk3.png) 會導致在 3000 秒之前有多次的高峰 read，主因是因為在設計程式的過程，自己是透過每次從 input.txt 讀入一個很大的 chunk size，等到這筆讀完的 chunk size 都排序完後，會再讀取下一個 chunk size 的資料進入。 total write 在 3000 秒之前，都只是輸出切分好的 sorted data，而到了 3000 秒之後，會有幾次的高峰原因應為將 sorted data 執行 external sort 後，會產生 output 或者 tmp output 檔案。 ## CPU time 這邊為了實驗快速，input size 皆使用 1G 做實驗。在這想要實驗的是，調整 sorted file 的數量，是否會影響到整體的執行時間呢？因為知道 OS 在做 I/O 時會耗許多時間，那要是多一點的 sorted file 會影響多少呢？以下是原始實驗結果 - sorted file: 10 個 merge: 197.157959 s external: 149.213715 s total: 346.371735 s - sorted file: 100 個 merge: 192.318405 external: 160.868225 total: 353.186676 - sorted file: 1000 個 merge: 190.202118 s external: 179.493271 s total: 369.695435 s ### 圖示比較 - merge stage ![](https://i.imgur.com/hgLQ1rB.png) - external stage ![](https://i.imgur.com/39YaBYZ.png) - total time ![](https://i.imgur.com/9R8DTXl.png) - Disk I/O ![](https://i.imgur.com/SizLVMq.png) 可以從實驗結果看出來，當 sorted file 的數目越多時，在 merge 階段所要花的時間就越短，因為當 sorted file 越多，表示每個 file 裡頭的資料越少，此時所花的 merge 時間就越短。而因為 sorted file 越多，在做 external sort 時所要放入 heap 的 node 就越多，且開檔讀檔的次數就會更加的頻繁（可以在 Disk I/O 看出），因為做 I/O 會佔用相當多的時間，所以在 external stage 會因為要處理的 sorted file 變多而耗時。我想這邊蠻像一個 tradeoff，但是不太平衡。因為 merge-sort 所節省得時間，遠小於 external-sort 所花得時間。若是要使得執行時間越短的話，那 sorted file 應該越少越好。 ## Memory 這部分嘗試使用 [valgrind-massif](https://valgrind.org/docs/manual/ms-manual.html) 進行檢測，但是因為此套件會使得程式執行時間變慢很多，所以這邊只使用小檔案 5.4 M 進行測試 - 5.4 M main function ![](https://i.imgur.com/h33t4ht.png) - 5.4 M heap ![](https://i.imgur.com/Pk7jma8.png) - 5.4 M ![](https://i.imgur.com/UbdSp5v.png) - 1G ![](https://i.imgur.com/lge09y2.png) 可以看到該程式所佔用的記憶體大約在 200 MB 左右，而當處理大資料（1G）時，除了原始的 main 佔用記憶體之外，還可以看到 `M` (Merge-sort) 佔用記憶體的部分。 ## 執行若干支程式這個檢測目的是希望測出，當同時開啟許多程式進行時，OS 會如何反應？目前想到的方法是觀察 context switch 的變化，於是參考了 `grep ctx /proc/{pid}/status` 的方法，來抓出每個 process 的 context switch 狀況。在抓取 ctxt 時，能夠得到 `voluntary` 以及 `involuntary` 兩種不同的 context swtich 資訊。參考 [slides](https://www2.cs.duke.edu/courses/spring01/cps110/slides/interleave/sld008.htm) 知道，`involuntary` 表示 OS 暫停了目前的程式，並且 switch 到其他的 process; `voluntary` 表示自己不需要 CPU，主動讓出給其他的 process。下面設計的實驗為，同時跑 3 個 process，以及只跑 1 個 process 的比較。在 involuntary 中 p2 會那麼高，我認為是因為 p1 p3 執行後，跟 p2 搶 CPU 資源，導致 P2 被 involuntary context switch。 - involuntary ![](https://i.imgur.com/toUhYwO.png) - voluntary ![](https://i.imgur.com/lx0D8ta.png)