2020_OS_Fall_HW3: Key-Value Stroages

# 2020_OS_Fall_HW3: Key-Value Stroages ###### tags: `os-2020` ## 作業說明 - [link](https://hackmd.io/@dsclab/rJDIP8YYv) ## 作業程式碼 - [GitHub](https://github.com/Jonec76/os-hw3) - 程式執行： 1. `$ make` 2. `$ ./main 1.input` ## 學生資料學號：P76091129 姓名：王子源系級：資工所碩一 ## 說明文件 ### 開發環境： - OS: Ubuntu 16.04.1 - CPU: Intel® Core™ i5-8500 CPU @ 3.00GHz × 6 - Ｍemory: 8GB - Programming Language(version): c++ ### 程式開發與使用說明： - 程式開發概念此次作業希望開發一隻能夠存取 key-value 資料的程式碼，自己原先想嘗試使用 b+ tree 來完成此作業，但是資料超過 RAM 的部分不知道如何處理，所以最後使用較簡單的方法，就是透過 `map` 的一塊 memory pool。當 input 進來的資料超過 pool 的上限時，就將 memory pool 的一部分存入 disk 裡面，如此一來就不會超過記憶體的上限了。而存入 `storage` 的部分，跟其他同學討論之後，是決定將原本的 key 拆成 4 個部分，然後建立前 3 個部分的資料夾，並將最後的部分當成檔名存入，此設計是希望能夠減少跟其他檔案的比較次數，以試著降低程式執行時間。 - 程式架構 ![](https://i.imgur.com/O0cOjOG.png) 主程式架構如上，其實就是依據不同的 operation 做不同的操作。下面會針對 `store_data` 以及 `get_data` 做細部分析。 - **store_data** ```c=1 void store_data(map<unsigned long long, string>* pool, unsigned long long base_capacity){ auto it = pool->begin(); unsigned long long idx = 0; while (it != pool->end() && idx < base_capacity) { unsigned long long k = it->first; string value = it->second; unsigned long long mask = MASK; string output_data_name = storage_path; for (int i = 0; i < 3; i++) { output_data_name += "/" + to_string(k & mask); const char *f = output_data_name.c_str(); DIR *dir = opendir(f); if (!dir) { if (mkdir(f, 0777) != 0) { cout << "Can't create the folder. " << endl; } } mask >>= 16; } output_data_name += "/" + to_string(k & mask) + ".txt"; ofstream output_data; output_data.open(output_data_name); output_data << value << endl; output_data.close(); it++; idx++; pool->erase(k); } } ``` 作法就是將一個 mask（0xFFFF）做右移，剛好可以將 64bit 的數字切成 4 塊，因為每一塊都會拿來創立成一個新的資料夾，所以在創立前需要先檢查該資料夾是否已經存在，此 function 是將滿了的 pool 存入 hard disk ，存到達到 `base_capacity` 就停止，這邊自己是直接給定 `limit=1000`, `base_capacity=500`。 - **get_data** ```c=1 void get_data(unsigned long long key, map<unsigned long long, string> pool, char* output_file_path) { ofstream output_result; output_result.open(output_file_path, ios::out | ios::app ); char *fn = (char *)malloc(128 * sizeof(char)); map<unsigned long long, string>::iterator it = pool.find(key); if (it == pool.end()) { if (DirectoryExists(key, &fn)) { string value; ifstream file; file.open(fn); file >> value; output_result << value << "\n"; file.close(); } else { output_result << "EMPTY\n"; } } else { output_result << it->second << endl; } free(fn); output_result.close(); } ``` 在 `get_data` 時會有 data coherence 的情況要考慮，也就是相同的 key 在 pool 裡頭的值跟 disk 裡頭的值並不同。而這邊的做法是以 memory 裡頭的值為最新的值，所以如果現在的 key 能夠在 pool 裡面找到，則回傳在 pool 裡頭所對應到的值，若不行的話再到 hard disk 裡頭找。 ### 效能分析 - 測試資料使用 1g 的檔案，因為沒有正確答案所以只有在小檔案裡確定都對，在大檔案比較注重在硬體設備的使用效能。測試自己的程式碼`limit_capacity` 以及 `base_capacity` ，前者代表 pool 的上限大小，後者代表當達到上限時，要輸出到 disk 之後所要降到的 pool size，下方是實驗結果 - 實驗 1: **limit_capacity=1000000, base_capacity=5** time ： real 2m34.056s user 0m56.999s sys 1m0.308s 1-1 ![](https://i.imgur.com/nPTpYnl.png) 1-2 ![](https://i.imgur.com/vWfqUx2.png) 1-3 ![](https://i.imgur.com/temOnlJ.png) 1-4 ![](https://i.imgur.com/aYotWux.png) - 實驗 2: **limit_capacity=1000000, base_capacity=500000** real 2m34.767s user 0m56.905s sys 1m0.469s 2-1 ![](https://i.imgur.com/AbR4uQL.png) 2-2 ![](https://i.imgur.com/ATQSAD5.png) 2-3 ![](https://i.imgur.com/Ll6kSQC.png) 2-4 ![](https://i.imgur.com/1UbgzXr.png) 從上方的實驗可以觀察出，即使 input 檔案有 1G 的 `PUT` 功能，但是在記憶體使用上維持在特定數值的水平，並沒有因為不斷的 `PUT` 而增加。另外，因為實驗 1 的 memory size 變化會很大，當達到 limit 時，會將整個 pool 幾乎清空（清到只剩下 5 筆）。畫紅虛線的部分就是已經達到 pool limit，正在存到 hard disk。可以比較 1-4 與 2-4 的 memory usage，可以發現在 1-4 畫紅虛線的部分的記憶體使用量比較少，就是因為移入較多的資料進入 hard disk。但也因為在實驗 1 移入較多的資料進去 hard disk，所以在 page fault (1-2, 2-2) 的比較上，實驗 1 的 page fault 會多出很多。