2018q3 Homework3 (dict)

# 2018q3 Homework3 (dict) contributed by < danielian1121 > ## bug ### 逗號 ```bash sukamo:~/dict$ ./test_cpy ternary_tree, loaded 259112 words in 0.112772 sec Commands: a add word to the tree f find word in tree s search words matching prefix d delete word from the tree q quit, freeing all data choice: ``` 尋找 Taipei ，會顯示沒有找到 ```bash choice: f find word in tree: Taipei Taipei not found. ``` 但若是使用 s 來作搜尋 ```bash choice: s find words matching prefix (at least 1 char): Taipei Taipei - searched prefix in 0.000002 sec suggest[0] : Taipei, ``` 很明顯發現 Taipei 後面多了"," 原始輸入資料為以下 ``` ... Taipei, Taiwan Kinshasa, Democratic Republic of the Congo Lima, Peru Cairo, Egypt ... ``` 程式中是使用 `fscanf(fp, "%s", word)` 來讀入資料，而 `%s` 遇到空白、 TAB 或者是換行才會停止，因此會把資料中間的逗號給讀入修改一下程式碼 ```clike while ((rtn = fscanf(fp, "%s", word)) != EOF) { size_t length = strlen(word); if (word[length-1] == ',') word[length-1] = '\0'; ... ``` 這樣結果便正確了 ```bash choice: f find word in tree: Taipei found Taipei in 0.000002 sec. ``` ### bench.c 中的 buffer overflow `$ ./test_cpy --bench` 這指令可以將所有輸入的 cities 取前3字元( Taipei -> Tai )做 search words matching prefix 並紀錄下所耗時的時間程式中使用 `strncpy(prefix, word, 3)` 取出前3字元，然而prefix的宣告卻是 `char prefix[3] = ""` ，這樣看似沒有問題，但卻會略結尾 `\0` 的存在，因此修改為 `char prefix[4] = ""` 結果才會正確 ### test_ref.c 程式碼補齊 `test_ref.c` 少了 `--bench` 功能以及輸出 `ref.txt` ```clike if (argc == 2 && strcmp(argv[1], "--bench") == 0) { int stat = bench_test(root, BENCH_TEST_FILE, LMAX); tst_free_all(root); return stat; } FILE *output; output = fopen("cpy.txt", "a"); if (output != NULL) { fprintf(output, "%.6f\n", t2 - t1); fclose(output); } else printf("open file error\n"); ``` ## gnuplot 觀察一下專案裡的 .gp 檔，發現 `runtime3` 這檔案是針對 test_cpy 而做，而裡面的 `bench_cpy.txt` 則是由 `$ ./test_cpy --bench` 所產生先來試試看會產生什麼圖 ```bash $ ./test_cpy --bench ternary_tree, loaded 259112 words in 0.127841 sec $ gnuplot scripts/runtime3.gp $ eog runtime3.png ``` ![](https://i.imgur.com/UYuL1X4.png) 此為每個 city 前3位元做 search words matching prefix 所需要的時間 ## 效能評比 :::info 提問：作業要求上寫到 ==注意：這兩者都該加入 Bloom filter，而且分析具體效益 (原有程式存在若干缺陷，也該指出並修正)== 但 CPY 是用 TST 方式存取資料，而 REF 則是用 Bloom filter ，我的理解是要比較兩者之間的效能和優缺點這樣的想法和作業要求矛盾，是我理解錯誤了嗎? ::: 想要先從 add 和 find 來比較兩者 * 比較兩者存入預設資料所需的時間和記憶體大小 * 比較兩者搜尋原本資料所需要的時間 * (額外作業) Bloom filter 理論來說會有誤差，可以驗證真實誤差值是否接近理想誤差值 ### add 實驗設計 * `Makefile` ```clike addTest: $(TESTS) echo 3 | sudo tee /proc/sys/vm/drop_caches; perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_cpy --bench q perf stat --repeat 100 \ -e cache-misses,cache-references,instructions,cycles \ ./test_ref --bench q gnuplot scripts/runtimeAdd.gp eog runtimeAdd.png ``` * `runtimeAdd.gp` ```clike reset set xlabel 'order' set ylabel 'time(microsec)' set title 'perfomance comparison(add)' set term png enhanced font 'Verdana,10' set output 'runtimeAdd.png' set format x "%10.0f" set xtic 10 set xtics rotate by 45 right plot [:100][:]'cpy.txt' title 'cpy',\ 'ref.txt' title 'ref' ``` * 結果圖 ![](https://i.imgur.com/QELfv0l.png)