2017q3 Homework1 (phonebook)

--- tags: sysprog2017 --- # 2017q3 Homework1 (phonebook) contributed by <`Yuessiah`> ###### 本文件多處地方採用超連結來補完報告，建議點開來看 ## 開發環境 ``` eeeeeeeeeeeeeeeee eeeeeeeeeeeeeeeeeeeeeee eeeee eeeeeeeeeeee eeeee OS: elementary os elementary OS 0.3.2 freya eeee eeeee eee eeee Kernel: x86_64 Linux 4.4.0-93-generic eeee eeee eee eeee Shell: bash 4.3.11 eee eee eee eee Virtualization: VT-x eee eee eee eee CPU: Intel Core i7-7700 CPU @ 4.2GHz ee eee eeee eeee CPU(s): 8 ee eee eeeee eeeeee ee eee eeeee eeeee ee RAM: 1691MiB / 15934MiB eee eeee eeeeee eeeee eee L1d cache: 32K eee eeeeeeeeee eeeeee eee L1i cache: 32K eeeeeeeeeeeeeeeeeeeeeeee eeeee L2 cache: 256K eeeeeeee eeeeeeeeeeee eeee L3 cache: 8192K eeeee eeeee eeeeeee eeeeeee BogoMIPS: 7200.65 eeeeeeeeeeeeeeeee ``` ## 優化策略本節只會提及部份效能改進策略, 詳細直接看 [github](https://github.com/Yuessiah/phonebook) 或提供的參考資料 ### code 部份能用內建函式取代就用例如： ```clike /* before */ while (fgets(line, sizeof(line), fp)) { while (line[i] != '\0') i++; line[i - 1] = '\0'; i = 0; e = append(line, e); } /* after */ while (fgets(line, sizeof(line), fp)) { line[strlen(line)-1] = '\0'; e = append(line, e); } ``` ### 使用體積較小的 struct[^1] 將原本詳細的資料結構改用體積較小的結構，當有需要的時候才將詳細資料引進記憶體中。這樣能使 L1 cache 儲存更多筆資料 ### djb2[^2] 將原本 findName() 只能用循序查詢的結構改成隨機存取的結構，能讓查詢時間降為常數時間另外也利用 hash table[^3] 節省預設空間的使用 >hash table size 如何訂? >[name=課程助教][color=red] >>目前是直接沿用 programming small 那份 hackmd 上的，有空會再說明多大比較適合。感謝叮嚀 >>[name=Yuessiah][color=blue] ### 效能差異本實驗使用 ```words.txt``` 字典輸入五次，分別為 ```{"zyxel", "zyrian", "zymophoric", "zymophyte", "zymoplastic"}``` ``` Performance counter stats for './phonebook_orig' (100 runs): 6,347,936 cache-misses # 88.793 % of all cache refs ( +- 0.27% ) 7,149,104 cache-references ( +- 0.11% ) 287,173,261 instructions # 1.38 insns per cycle ( +- 0.03% ) 207,684,016 cycles ( +- 0.25% ) 0.051437951 seconds time elapsed ( +- 0.65% ) --- Performance counter stats for './phonebook_opt' (100 runs): 6,222,490 cache-misses # 75.992 % of all cache refs ( +- 0.74% ) 8,188,303 cache-references ( +- 0.73% ) 430,978,995 instructions # 1.21 insns per cycle ( +- 0.03% ) 357,578,775 cycles ( +- 0.21% ) 0.087407767 seconds time elapsed ( +- 0.23% ) ``` ![](https://i.imgur.com/XNqOLKm.png) 犧牲 append() 效能增進 findName() 的效能是很經濟的改進，因為 append() 的需求一定比 findName() 還來得低。若不是那這個人名就沒有登錄的必要了XD ## 實作遇到的問題本節提及的問題都已理解並解決。 - 原本 hash function 我是回傳 int 型態，導致一直碰上 segmentation fault - opt.txt 要手動清除，不然舊版本的資料會跟新版本的混在一起 ## 議題 ==探討本例選用的 dataset== ### 有代表性嗎？跟真實英文姓氏差異大嗎？ ### 資料難道「數大就是美」嗎？不，資料如果沒有鑑別性或是真實性的話，有礙於汲取這些資料的程式做出正確的判定或是整理出有用的"資訊"。應該說「質大就是美」，資料的品質才是最重要的。 > 企業真正要尋找的是非傳統的、而且未曾被挖掘過的資料，並且從這些資料中去提煉出價值，這才是對大數據應有的正確認知，而非只是執著於資料大小，只要能從看似毫無意義的數據礦坑中挖掘出金礦，有誰會在意那座礦坑原本是大得像座山還是小得像狗屋呢？[^4] ### 如果我們要建立符合真實電話簿情境的輸入，該怎麼作？ - 寫一些規則使得輸入保證正確，例如電話號碼不可能出現中文字。 - 應用模糊理論[^5]使真實情況輸入常發生的一些拼寫錯誤有個更寬鬆的判定。 [^1]: [Hackmd/ Programming Small](https://hackmd.io/s/SkfffA-jZ) [^2]: [cse.yorku.ca/ Hash Functions](http://www.cse.yorku.ca/~oz/hash.html) [^3]: [Hackmd/ Hash Function](https://hackmd.io/s/HJln3jU_e) [^4]: [數位時代/ 一次搞懂大數據](https://www.bnext.com.tw/article/35807/bn-2015-03-31-151014-36) [^5]: [數學傳播/ 模糊理論簡介](http://web.math.sinica.edu.tw/math_media/d181/18102.pdf)