2016q3 Homework1 (phonebook)

contributed by <0140454>

Reviewed bt `finalallpass`

可以嘗試更多的hash function來做比較。

開發環境

Ubuntu 16.04 LTS
Linux 4.4.0-38-generic

在筆電上灌了Ubuntu Linux，為了讓比較基準一致，所以重新檢視先前的進度。

效能分析 - 優化前

./phonebook_orig

size of entry : 136 bytes
execution time of append() : 0.037285 sec
execution time of findName() : 0.005002 sec

make cache-test

 Performance counter stats for './phonebook_orig' (100 runs):

          104,2032      cache-misses #   86.554 % of all cache refs ( +-  0.24% )
          120,3911      cache-references ( +-  0.15% )
       2,6080,5148      instructions #    1.39  insns per cycle ( +-  0.02% )
       1,8753,3207      cycles ( +-  0.31% )

       0.056462097 seconds time elapsed ( +-  0.32% )

透過perf找熱點 (./phonebook_org & sudo perf top -p $!)

   PerfTop:   10440 irqs/sec  kernel:15.3%  exact: 100.0% [5000 cycles:pp],  (target_pid: 4387)
-------------------------------------------------------------------------------

    25.19%  libc-2.23.so      [.] __strcasecmp_l_avx     
    24.03%  phonebook_orig    [.] findName               
     9.70%  phonebook_orig    [.] main                   
     4.49%  libc-2.23.so      [.] _int_malloc            
     4.10%  libc-2.23.so      [.] _IO_fgets              
     3.50%  libc-2.23.so      [.] __memcpy_sse2          
     2.72%  libc-2.23.so      [.] _IO_getline_info       
     2.13%  libc-2.23.so      [.] __strcpy_sse2_unaligned
     1.84%  [unknown]         [k] 0x00007f9ed815602e     
     1.71%  [kernel]          [k] clear_page_c_e         
     1.71%  [kernel]          [.] native_irq_return_iret 
     1.66%  phonebook_orig    [.] append                 
     1.55%  libc-2.23.so      [.] malloc                 
     1.53%  libc-2.23.so      [.] memchr                 
     1.02%  [kernel]          [k] release_pages          
     1.02%  phonebook_orig    [.] strcasecmp@plt         
     1.02%  [kernel]          [k] unmap_page_range       
     0.92%  [kernel]          [k] free_pcppages_bulk     
     0.78%  libc-2.23.so      [.] __strcasecmp_avx       
     0.70%  [kernel]          [k] free_hot_cold_page

「程式輸出的文字訊息」請勿使用圖片表示，請改由文字貼上課程助教

了解！吳勃興

從上面的資料可以發現findName佔了24.03%，append則是佔了1.66%。但我們從執行秒數上面可以發現，append的時間卻是比findName長的。

另外，一開始在執行perf的時候，發現取樣的數量太少，導致append原本沒有出現在perf的列表上，所以用了-c的參數來增加取樣頻率。

Memory Leak

重新觀看main.c後，發現在最後並沒有完整將記憶體釋放。


if (pHead->pNext) free(pHead->pNext);

我們可以透過valgrind來觀看memory leak的程度

==24874== LEAK SUMMARY:
==24874==    definitely lost: 136 bytes in 1 blocks
==24874==    indirectly lost: 46,770,128 bytes in 343,898 blocks
==24874==      possibly lost: 816,000 bytes in 6,000 blocks
==24874==    still reachable: 0 bytes in 0 blocks
==24874==         suppressed: 0 bytes in 0 blocks
==24874== Rerun with --leak-check=full to see details of leaked memory

將程式碼改寫成





while(pHead->pNext) {
    entry *next = pHead->pNext;
    pHead->pNext = next->pNext;
    free(next);
}

valgrind的執行結果顯示沒有memory leak的問題

==24915== All heap blocks were freed -- no leaks are possible

效能分析 - 優化後

方法一：減少結構大小

這邊有參考一下先前學長姐的開發紀錄，另外定義一個結構 __PHONE_BOOK。
在分配 __PHONE_BOOK的時候也會順便分配 __PHONE_BOOK_ENTRY。





typedef struct __PHONE_BOOK {
    char lastName[MAX_LAST_NAME_SIZE];
    struct __PHONE_BOOK_ENTRY *pDetail;
    struct __PHONE_BOOK *pNext;
} entry;

雖然結構的大小縮減為32bytes，但cache miss卻沒有下降，時間反而變得更長。

size of entry : 32 bytes
execution time of append() : 0.047521 sec
execution time of findName() : 0.010678 sec

 Performance counter stats for './phonebook_opt' (100 runs):

          126,9535      cache-misses #   94.437 % of all cache refs ( +-  0.09% )
          134,4318      cache-references ( +-  0.07% )
       3,4116,6636      instructions #    1.23  insns per cycle ( +-  0.01% )
       2,7753,9644      cycles ( +-  0.18% )

       0.084211150 seconds time elapsed ( +-  0.17% )

在想應該是因為在append時做了兩次的malloc。
既然在dictionary內只有lastName的資訊，那append的時候只需要先分配名字的記憶體就好。未來如果有需要，再分配存放詳細資訊的空間。

size of entry : 32 bytes
execution time of append() : 0.030606 sec
execution time of findName() : 0.002104 sec

 Performance counter stats for './phonebook_opt' (100 runs):

           15,5705      cache-misses #   48.712 % of all cache refs ( +-  1.17% )
           31,9643      cache-references ( +-  1.02% )
       2,4467,7584      instructions #    1.90  insns per cycle ( +-  0.02% )
       1,2855,7826      cycles ( +-  0.33% )

       0.038822965 seconds time elapsed ( +-  0.36% )

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

方法二：利用hash function

雖然減少結構大小已經有不錯的改善，但假如每一次搜尋都需要從頭找，多少會影響執行效率。
如果可以利用hash function將名字map到數個slot，使得搜尋時只要在特定slot內的尋找，這樣應該也能夠改善效能。

在這邊我使用的是djb2，並將TABLE_SIZE定為32767，也就是原本的


entry *pHead, *e;

變成


entry pHead[TABLE_SIZE], *e[TABLE_SIZE];

findName的時間接近0秒，但append的時間卻增加了0.025秒。可是cache miss卻增加到68%。

透過hash table，只要發生碰撞的機率很低，我們便可以查詢所花的時間。但是，在插入資料的時候，因為必須做額外的運算，所以append的部份有所增長也是預期的結果。

size of entry : 32 bytes
execution time of append() : 0.051574 sec
execution time of findName() : 0.000000 sec

 Performance counter stats for './phonebook_hash' (100 runs):

          102,5574      cache-misses #   68.005 % of all cache refs ( +-  0.56% )
          150,8084      cache-references ( +-  0.09% )
       2,8155,3172      instructions #    0.96  insns per cycle ( +-  0.02% )
       2,9422,2148      cycles ( +-  0.28% )

       0.090163917 seconds time elapsed ( +-  0.28% ) ( +-  0.59% )

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

方法三：利用tree structure

利用binary search tree來實作的話，append的時間會增加很多。這是因為dictionary中的word都已經排好序了，使得binary search tree退化成一條線。
因此應該要用其他的樹來進行優化，才可能使執行時間縮短。

字典樹 trie
相較於普通的binary search tree，trie因為不會整個退化成一條線，所以在append的時間少很多，但因為每次都從root開始搜尋且entry大小比較大，所以還是比先前的優化方式慢。

size of entry : 232 bytes
execution time of append() : 0.134898 sec
execution time of findName() : 0.000000 sec

 Performance counter stats for './phonebook_trie' (100 runs):

           96,7428      cache-misses #   67.495 % of all cache refs ( +-  0.26% )
          143,3335      cache-references ( +-  0.19% )
      16,1746,5334      instructions #    1.61  insns per cycle          ( +-  0.01% )
      10,0716,5615      cycles ( +-  0.04% )

       0.299867931 seconds time elapsed ( +-  0.05% )

紅黑樹 red-black tree
因為dictionary的關係，binary search tree會退化，所以append的效率非常的差。還是要使用的話，則應該要使用自平衡樹，這邊是使用紅黑樹，在append的時候因為需要進行旋轉等操作，所以時間會比trie多一點點。

size of entry : 56 bytes
execution time of append() : 0.158551 sec
execution time of findName() : 0.000000 sec

 Performance counter stats for './phonebook_rbtree' (100 runs):

           21,0833      cache-misses #   53.825 % of all cache refs      ( +-  0.47% )
           39,1698      cache-references ( +-  1.25% )
       9,7743,9464      instructions #    1.72  insns per cycle          ( +-  0.00% )
       5,6885,8067      cycles ( +-  0.72% )

       0.169528500 seconds time elapsed ( +-  0.71% )

方法四：利用memory pool

在觀看陳品睿學長Hackpad時有看到老師提到可以將malloc替換成memory pool的方式來做，先分配好一大塊記憶體，等append需要的時候直接取一小塊來用，這樣append的時間也能有所改善，畢竟頻繁的要求/釋放記憶體也會造成系統的負荷。

將memory pool套用至各種版本上之後可看到總時間比原本短。看來對於大量的記憶體操作時，可以優先考慮利用memory pool來取代傳統的malloc，進而提升執行效率。

版本	套用前後時間差異
優化前	-0.01173
小結構	-0.009254
Hash function	-0.007918
字典樹	-0.063499
紅黑樹	-0.014424

演算法和實作的小結

經由上面幾種優化方式後發現，減少結構大小可以使得cache中保存較多資料，因此append與findName的時間都有縮短的趨勢。

而hash與tree則是在findName較有優勢，append因為需要額外處理，所以可能會較原本的慢。再來就是記憶體分配，頻繁的malloc/free對系統來說也是個負擔，善用memory pool可以提升效率。

經過這次作業才知道以前所學科目的重要性，包括計算機組織、作業系統等。因為這次作業也發現自己其實並沒有那麼了解C語言跟硬體方面的東西，這一部份是未來要多加努力來補齊的。

新進度

Dataset 的影響

首先，觀察程式中所使用的words.txt可以發現，已經照字典序排序好，那如果把他打亂 (sort -R words.txt)再測試一次的話，結果是否會有所不同？

打亂後發現，未優化前與前兩種方式所受到的影響比較小。而後面兩種有關樹的方式，均有明顯的不同。原因在於順序的不同，導致在維護樹的時候有所差異。

下一步，應該尋找一個適當的lastname dataset來做測試，因為words.txt中包含了許多沒有名字意義的單字。

Fuzzy Search (Levenshtein Distance)

計算出Levenshtein Distance並列出距離為2的結果。

































int levenshtein(char *src, char *dst)
{
    int i, j;

    int src_len = strlen(src);
    int dst_len = strlen(dst);
    int table[MAX_LAST_NAME_SIZE + 1][MAX_LAST_NAME_SIZE + 1];

    // initialize distance table
    for (i = 0; i <= src_len; ++i) {
        table[i][0] = i;
    }

    for (j = 0; j <= dst_len; ++j) {
        table[0][j] = j;
    }

    // calculate distance
    for (j = 1; j <= dst_len; ++j) {
        for (i = 1; i <= src_len; ++i) {
            table[i][j] = table[i - 1][j - 1];
            if (src[i - 1] != dst[j - 1]) {
                table[i][j] = min(
                    table[i - 1][j] + 1,
                    table[i][j - 1] + 1,
                    table[i - 1][j - 1] + 1
                );
            }
        }
    }

    return table[src_len][dst_len];
}

將原本要搜尋的zyxel改為xyzel的話，結果如下：

aczel (2)
azel (2)
etzel (2)
eyfel (2)
ezel (2)
fazel (2)
gazel (2)
gyel (2)
gymel (2)
gysel (2)
hazel (2)
hywel (2)
kydel (2)
kyzer (2)
kyzyl (2)
lyel (2)
mazel (2)
mykel (2)
mysel (2)
nyel (2)
oxymel (2)
pyhel (2)
rozel (2)
sydel (2)
vozel (2)
xdel (2)
xurel (2)
xylem (2)
xylol (2)
xylyl (2)
xyzzy (2)
yeel (2)
yoel (2)
zyxel (2)

原先想計算相似度，然後返回相似度最大的entry。但是從上面的結果可以看到，假如搜尋xyzel的話，aczel跟zyxel的相似度會是一樣的，所以不如將可能的清單列出，讓使用者判斷，或是增加其他搜尋條件來精確地過濾。