quiz6B - hash table

--- tags: Linux --- # quiz6B - hash table contributed by < [`Chialiang86`](https://github.com/Chialiang86) > > [GitHub](https://github.com/Chialiang86/Hash-Table-C) ###### tags: `linux2021` > [作業說明 - hash table](https://hackmd.io/@sysprog/linux2021-quiz6) > [2021期末專題說明 - hash table](https://hackmd.io/@sysprog/linux2021-projects#quiz6B---hash-table) ## Todo - [x] 整理其他學員資料 - [x] 看 [hash.h](https://github.com/torvalds/linux/blob/master/tools/include/linux/hash.h), [hashtable.h](https://github.com/torvalds/linux/blob/master/include/linux/hashtable.h) - [x] 看文章 [How does the kernel implements Hashtables?](https://kernelnewbies.org/FAQ/Hashtables) - [ ] 改寫更有效率的作法 - [x] 加上新功能(delete, collision count) - [x] 嘗試別的 hash funciton - [ ] 改善效能 - [ ] 參考 linux RCU 機制 - [ ] Hash table 應用 - [ ] sentence to NLP vector by hashing - [ ] Process management - [ ] Scheduler ## Hash Table 程式碼運作原理 ### 資料結構 ```c typedef struct { int bits; struct hlist_head *ht; } map_t; struct hlist_node { struct hlist_node *next, **pprev; }; struct hlist_head { struct hlist_node *first; }; struct hash_key { int key; void *data; struct hlist_node node; }; ``` `hash table` 主要是由一個 `hlist_head` 的動態陣列所構成，每個 entry 指向一個由 `struct hlist_node` 構成的非環狀 doubly linked list ，hash table 長度依照 `bits` 給定，可知大小為 2 的冪。 :::warning power of 2 的翻譯是「2 的冪」，而非「冪次方」 :notes: jserv ::: 而可以發現 `struct hlist_head` 只有一個 `struct hlist_node *` 的成員; 而 `struct hlist_node` 型態卻包含了一個 `struct hlist_node *` 及 `struct hlist_node **` ，其中一個原因為 hash table 指向的為非環狀的 linked list ，因此只需指向 list 的一端點，若單純使用 `struct hlist_node` 當作 head ，無用的 "pprev" 會造成大量的記憶體空間浪費，因此將 head 與 node 分開來實做。而 `struct hlist_node` 中的 pprev 為何使用「指標的指標」而非「指標」？回答這個問題前可以先參考 Linux 原始碼中 [type.h](https://gist.github.com/ian910297/d03edc271105854a0cc3fcf68c1cb527) ： ```c=75 struct list_head { struct list_head *next, *prev; }; struct hlist_head { struct hlist_node *first; }; struct hlist_node { struct hlist_node *next, **pprev; }; ``` 可知在 type.h 中有兩種 list 的結構： #### 1. struct list_head 此結構在 linux 中常實做成環狀 doubly linked list，且可以在行程管理 (process scheduling) 的相關實做上看到，如 [sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) 中有近 20 處使用到此結構，因為可以快速存取到頭以及尾的 node （時間複雜度 O(1) ）故有較好的效能，適用於行程管理這種對時間要求嚴謹的部分做使用。引用自 linux [sched.h](https://github.com/torvalds/linux/blob/master/include/linux/sched.h) ```c=461 struct sched_entity { /* For load-balancing: */ struct load_weight load; struct rb_node run_node; struct list_head group_node; unsigned int on_rq; ... } ``` #### 2. struct hlist_head 搭配 hlist_node 在 linux 中專門為了 hash table 而使用，`hlist_head` 的設計也省去了 list 起始端 `pprev` 的存放空間、在初始狀態就省去了一半的記憶體容量。而且同時 hash table 不會特別需要存取到 list 的尾端，並且走訪 list 相對沒那麼講求效率（因為 hash 的設計理念就是講求 hash collision rate 要低、因此一個 list 若太長比較需要改進的為 hash function 的設計而非改進整個資料結構）。綜合上述所說單向 list 已滿足 hash table 的需求。此外，回答前面對於 **`pprev` 為何是「指標的指標」** 的問題：若和 `list_head` 一樣使用單純的指標( `hlist_node *`)，則考慮到 list 有方向性，delete node 時需要額外檢查其是否為 list 的 head 或是 NULL 等等，有較冗餘的程式碼必須實做，因此使用 `hlist_node **pprev` 直接存取上一個 node 所在的位址 (`pprev = &node->next`) 。Linux 為求程式碼簡潔故以 pointer to pointer 的方式用 pprev 直接指向前一個元素的記憶體位址本身。以下範例用 hlist 刪除 list 中第一個元素搭配程式碼與圖示來解釋 (先忽略記憶體釋放的程式碼) 範例中值得注意的是，如果把 `hlist_node **pprev` 變成了 `hlist_node *prev` ，可以觀察到 `first->prev` 是沒有指向任何東西的，此時範例中 step2 直接將 `pprev` 指向 `first->pprev` 是會出錯的，需要額外的程式碼來處理刪除第一個 node 的狀況。 - 若 `prev` 為 `hlist_node *` （注意紅色箭頭）: ```graphviz digraph hash_list { node [shape=record]; rankdir=LR; map_t [label = "map|<f0>bits|<f1>ht*"] table [label = "<f0>ht[0]|<f1>ht[1]|<f2>...|<fn>ht[size - 1]"] first [label = "first"] hash_key1 [label = "<name>entry|{<pp>prev|<n>next}"] hash_key2 [label = "<name>entry_next|{<pp>prev|<n>next}"] null [shape=plaintext label="NULL"] blank [shape=plaintext label="NULL"] map_t:f1->table:f0 table:f1->first first->hash_key1:name hash_key1:pp->blank[color=red] hash_key1:n->hash_key2:name hash_key2:pp->hash_key1:name[color=red] hash_key2:n->null } ``` - 若 `pprev` 為 `hlist_node **` （注意紅色箭頭）: ```graphviz digraph hash_list { node [shape=record]; rankdir=LR; map_t [label = "map|<f0>bits|<f1>ht*"] table [label = "<f0>ht[0]|<f1>ht[1]|<f2>...|<fn>ht[size - 1]"] hash_key1 [label = "<name>entry|{<pp>pprev|<n>next}"] hash_key2 [label = "<name>entry_next|{<pp>pprev|<n>next}"] null [shape=plaintext label="NULL"] map_t:f1->table:f0 table:f1->first first->hash_key1:name hash_key1:pp->first[color=red] hash_key1:n->hash_key2:name hash_key2:pp->hash_key1:n[color=red] hash_key2:n->null } ``` 假設要刪除 list 的第一個元素 (first 指向的 entry) ```c // deletes entry from hlist void __hlist_del(struct hlist_node* entry) { struct hlist_node *next = entry->next; // step1 struct hlist_node **pprev = entry->pprev; //step2 *pprev = next; //step3 if (next) next->pprev = pprev; //step4 } ``` step1: `next` 指向 `entry->next` (也就是 `entry_next`) ```graphviz digraph hash_list { node [shape=record]; rankdir=LR; map_t [label = "map|<f0>bits|<f1>ht*"] table [label = "<f0>ht[0]|<f1>ht[1]|<f2>...|<fn>ht[size - 1]"] hash_key1 [label = "<name>entry|{<pp>pprev|<n>next}"] hash_key2 [label = "<name>entry_next|{<pp>pprev|<n>next}"] null [shape=plaintext label="NULL"] map_t:f1->table:f0 table:f1->first first->hash_key1:name hash_key1:pp->first hash_key1:n->hash_key2:name hash_key2:pp->hash_key1:n hash_key2:n->null next->hash_key2:s[color=red] } ``` step2: `pprev` 指向 `entry->pprev` ( 也就是 `&ht[1]->first`) ```graphviz digraph hash_list { node [shape=record]; rankdir=LR; map_t [label = "map|<f0>bits|<f1>ht*"] table [label = "<f0>ht[0]|<f1>ht[1]|<f2>...|<fn>ht[size - 1]"] hash_key1 [label = "<name>entry|{<pp>pprev|<n>next}"] hash_key2 [label = "<name>entry_next|{<pp>pprev|<n>next}"] null [shape=plaintext label="NULL"] map_t:f1->table:f0 table:f1->first first->hash_key1:name hash_key1:pp->first hash_key1:n->hash_key2:name hash_key2:pp->hash_key1 hash_key2:n->null next->hash_key2:s[color=red] pprev->first:s[color=red] } ``` step3: `*pprev` (即 `&ht[1]->first`) 指向「 `next` 指向的 `entry_next` 」 ```graphviz digraph hash_list { node [shape=record]; rankdir=LR; map_t [label = "map|<f0>bits|<f1>ht*"] table [label = "<f0>ht[0]|<f1>ht[1]|<f2>...|<fn>ht[size - 1]"] hash_key1 [label = "<name>entry|{<pp>pprev|<n>next}"] hash_key2 [label = "<name>entry_next|{<pp>pprev|<next>next}"] null [shape=plaintext label="NULL"] map_t:f1->table:f0 table:f1->first first->hash_key2:name[color=red] hash_key2:pp->hash_key1:n hash_key2:n->null next->hash_key2:s[color=red] pprev->first:s[color=red] } ``` step4: 當 `next` 指向的位址不為 `NULL` 則 `next->pprev` 指向 `entry_next->pprev` ```graphviz digraph hash_list { node [shape=record]; rankdir=LR; map_t [label = "map|<f0>bits|<f1>ht*"] table [label = "<f0>ht[0]|<f1>ht[1]|<f2>...|<fn>ht[size - 1]"] hash_key1 [label = "<name>entry|{<pp>pprev|<n>next}"] hash_key2 [label = "<name>entry_next|{<pp>pprev|<next>next}"] null [shape=plaintext label="NULL"] map_t:f1->table:f0 table:f1->first first->hash_key2:name[color=red] hash_key2:pp->hash_key1:n[color=none] hash_key2:pp->first[color=red] hash_key2:n->null next->hash_key2:s[color=red] pprev->first:s[color=red] } ``` 此時 `entry` 已不被任何指標指向、可以釋放其記憶體空間（略），由於以上範例在刪除非 first 元素時行為也相同所以在這邊也忽略。 #### Hash Function 此處僅先列出 hash 程式碼，[下一章節](#Hash-Function1) 會針對此處做詳細說明包含公式由來、相關出處、比較效能 (collision rate, scalability) 以及運算速度等。 ```c #define GOLDEN_RATIO_32 0x61C88647 static inline unsigned int hash(unsigned int val, unsigned int bits) { /* High bits are more random, so use them. */ return (val * GOLDEN_RATIO_32) >> (32 - bits); } ``` #### container_of 此結構常出現在課程以及 linux 核心程式碼中，主要是藉由某個 struct 中成員的記憶體位址回推該 struct 所在的記憶體位址。 ```c #define container_of(ptr, type, member) \ ({ \ void *__mptr = (void *) (ptr); \ ((type *) (__mptr - offsetof(type, member))); \ }) ``` ```c struct hash_key *kn = (struct hash_key *)malloc(sizeof(struct hash_key)); kn->key = key, kn->data = data; struct hlist_node *p = &kn->node; // 注意是取 kn->node 的位址給 p struct hash_key *kn = container_of(p, struct hash_key, node); ``` 例如以上例子， `container_of(p, struct hash_key, node)` 就是藉由 `p` 的位址回推儲存 `p` 的容器 `kn` 的位址，顯然可以用計算 `node` 這個成員在 `struct hash_key` 這個型態的相對記憶體位址來回推，假設 `p` 位址為 `0xff0c`, 而 `node` 這個成員在 `struct hash_key` 的第 12(`0x000c`) 個 byte 的位置，而 `p` 所在的容器位址可計算出為 `0xff0c - 0x000c = 0xff00`。 #### 重要細節 ```c struct hash_key { int key; void *data; struct hlist_node node; }; ``` `node` 在 `struct hash_key` 中不是「 pointer 」是「一般變數」，因為若是 pointer (`struct hlist_node*`) 的話，`container_of` 是算出 (「 pointer 所指的空間」 - 「成員在 struct 的相對位置」)、顯然這樣算會得出一個不合法的記憶體位址，值得注意。 ### 其他 hash table API 實作細節 #### map_init 先藉由 `MAP_HASH_SIZE(map->bits)` 將 hash table 的 bucket size 設定為 2 的 `map->bits` 次方，並將每個 bucket 指出去的 list 初始化為 `NULL` ，若有任何記憶體配置錯誤則初始失敗回傳 NULL ```c map_t *map_init(int bits) { map_t *map = (map_t*)malloc(sizeof(map_t)); if (!map) return NULL; map->bits = bits; map->ht = (struct hlist_head *)malloc(sizeof(struct hlist_head) * MAP_HASH_SIZE(map->bits)); if (map->ht) { for (int i = 0; i < MAP_HASH_SIZE(map->bits); i++) (map->ht)[i].first = NULL; } else { free(map); map = NULL; } return map; } ``` #### map_add 將 `key` 以及其對應的 `data` 加入 hash table 中。若 `find_key` 函數得出此 `key` 已經存在 hash table 中，則直接回傳，否則藉由 hash function 算出要存入 table 中的哪個 bucket 並將其加入 bucket 的 list 中。 ```c void map_add(map_t *map, int key, void *data) { struct hash_key *kn = find_key(map, key); if (kn) return; // step 1 kn = (struct hash_key *)malloc(sizeof(struct hash_key)); kn->key = key, kn->data = data; // step 2 struct hlist_head *h = &map->ht[hash(key, map->bits)]; struct hlist_node *n = &kn->node, *first = h->first; n->next = first; //step 3 if (first) first->pprev = &n->next; //step 4 h->first = n; //step 5 n->pprev = &h->first; //step 6 } ``` #### find_key 根據 `key` 找到對應的 hash bucket 中 list 的 `head`，走訪該 list 來尋找存放此 key 的結構 `struct hash_key`，並回傳此結構的 pointer，沒有比對到 key 則回傳 `NULL`。因為整個 list 只有 `struct hlist_node` 去參與並不是整個 `struct hash_key`，故要藉由 node 得到整個 hash_key 的結構必須用 `container_of` 去計算相對記憶體位址。 ```c static struct hash_key *find_key(map_t *map, int key) { struct hlist_head *head = &(map->ht)[hash(key, map->bits)]; for (struct hlist_node *p = head->first; p; p = p->next) { struct hash_key *kn = container_of(p, struct hash_key, node); if (kn->key == key) return kn; } return NULL; } ``` #### map_deinit 針對每個非空的 bucket 走訪所有節點釋放記憶體空間。 ```c void map_deinit(map_t *map) { if (!map) return; for (int i = 0; i < MAP_HASH_SIZE(map->bits); i++) { struct hlist_head *head = &map->ht[i]; for (struct hlist_node *p = head->first; p;) { struct hash_key *kn = container_of(p, struct hash_key, node); struct hlist_node *n = p; p = p->next; if (!n->pprev) /* unhashed */ goto bail; struct hlist_node *next = n->next, **pprev = n->pprev; *pprev = next; if (next) next->pprev = pprev; n->next = NULL, n->pprev = NULL; bail: free(kn->data); free(kn); } } free(map); } ``` ## Hash Function 程式碼中的 hash function 如下，可以看到其使用一個 magic number 稱為 `GOLDEN_RATIO` ，並且直接與 `val` 相乘並留下較高位元的 bits，此外 `bits` 變數為 hash 最終的值使用的 bit size ，實做中使用 `bits = 10` ，也就是 hash 出來的值介於 0 ~ 1023 之間。 ```c #define GOLDEN_RATIO_32 0x61C88647 static inline unsigned int hash(unsigned int val, unsigned int bits) { /* High bits are more random, so use them. */ return (val * GOLDEN_RATIO_32) >> (32 - bits); } ``` hash function 可以從以下幾個觀點考量： - hash collision 的機率(雜湊值的隨機性), 影響桶內的鏈表的長度 => 走訪的效率 - bucket 的使用率 => 空間使用率 - hash function 本身的計算效率 - 無法從雜湊值回推原值(加解密用, 應該不是 hash table 的考量點) 從 The Art of Computer Programming vol3. 6.4, Dr.Knuth 提到兩種 hash function - based on division - based on multiplication(multiplicative hashing) 書中描述 multiplicative hashing, 可以等價於 mod, 且在多數的 CPU 乘法都比除法快( 特例是 mod (power of 2) 可以轉爲位元運算), 此外對於非隨機性的值像是 "1 2 3 ..." 可以減少碰撞機率。在 [hash.h](https://github.com/torvalds/linux/blob/master/tools/include/linux/hash.h) 註解中出現的論文 [Linux Kernel Hash Table Behavior: Analysis and Improvements](http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf) 也提到 > Our multiplicative hash functions were derived from [Knuth], p. 513ff. The theory posits that machine multiplication by a large number that is likely to cause overflow is the same as finding the modulus by a different number. 利用 golden ratio 做 multiplicative hashing 的稱為 Fibonacci hashing(也是書中提出來的例子), 被上述的論文驗證過 :::danger 務必清楚標註，到底「根據 ref 裡面一篇討論」是什麼 "ref" 呢？又，討論在哪？ :notes: jserv ::: 而根據 ref 裡面一篇討論 Fibonacci hashing 的 blog, 作者認為 power of two binary and 雖然可以有較快的速度, 但是由於會捨棄掉 low bits , 在沒有留意的狀況可能會造成較高 hash collision，後面會用實驗來檢視此說法在各個情況的結果。而 [stackoverflow](https://stackoverflow.com/questions/38994306/what-is-the-meaning-of-0x61c88647-constant-in-threadlocal-java/38994612) 也有提到 > 0x61c88647 = 1640531527 ≈ 2 ^ 32 * (1 - 1 / φ), φ = (√5 + 1) ÷ 2, it is an another Golden ratio Num of 32 bits. > If you convert 0x61c88647 into decimal, you'll get 1640531527 which is nonsensical until you realise that in 32 bits, it's the signed version of 2654435769. Again this number seems a bit odd until you realise that it is 2 ^ 32 ÷ φ where φ is the golden ratio (√5+1)÷2. 再跟 [hash.h](https://github.com/torvalds/linux/blob/master/tools/include/linux/hash.h) 註解做比對 (phi 為黃金比例) >These are the negative, (1 - phi) = phi ^ 2 = (3 - sqrt(5))/2, which is very slightly easier to multiply by and makes no difference to the hash distribution. ### 整理上述資訊一般來說我們常聽到的 Golden Ratio : \begin{equation} Golden Ratio = \phi = \frac{1 + \sqrt{5}}{2} \end{equation} 而將註解中的公式 ` (3 - sqrt(5))/2` 進行推導如下： \begin{equation} \frac{3 - \sqrt{5}}{2} = 1 + \frac{1 - \sqrt{5}}{2} = 1 - \frac{2}{1 + \sqrt{5}} = 1 - \frac{1}{\frac{1 + \sqrt{5}}{2}} = 1 - \frac{1}{\phi} \end{equation} 由推導及引文的整理可知程式碼中的 Golden Ratio 實際上是將 `2 ^ 32 * (1 - 1 / phi)` 得到的，是一種 32 bit 的 Golden Ratio 表達方式，而 32 bit 版本又有兩種： - 第一種公式相當單純： \begin{equation} Golden Ratio = \frac{2 ^ {32}}{\phi} = \texttt{2654435769.5} \approx \texttt {0x9E3779BA} \end{equation} - 第二種稍複雜一些，也是 `hash.h` 使用的版本 \begin{equation} Golden Ratio = 2 ^ {32} \times (1 - \frac{1}{\phi}) = \texttt{1640531526.5} \approx \texttt {0x61C88647} \end{equation} 至於為什麼沒用第一種公式 `hash.h` 註解也有提到，主要是相乘上較為容易、並且表現跟第一種是一樣的。以下實際執行四個實驗來驗證其說法的正確性，包含比較 collision rate 、乘法執行時間比較等。也會進而延伸探討 scalability 、使用不同 bits 下當作 hash 結果等面向。 ### 實驗一： 32-bit 下不同 Golden Ratio 的 collision rate #### 動機及方法檢驗兩種 Golden Ratio 對於 hash 的表現，驗證註解提到的 hash distribution 是否真的相同。以下提供兩種 hash function 對 1~100 當作 key 去 hash 的結果比較圖 :::spoiler 實驗程式碼 hash.h 的 head 新增欄位 `cnt` 欄位紀錄 list 中有多少 key ; `map_add` 中新增一行 : `map->ht[hash(key, map->bits)].cnt += 1`，當然刪除 node 也必須有 `-= 1` 的操作。 ```cpp ... struct hlist_head { int cnt; struct hlist_node *first; }; ... void map_add(map_t *map, int key, void *data) { struct hash_key *kn = find_key(map, key); if (kn) return; kn = (struct hash_key*)malloc(sizeof(struct hash_key)); kn->key = key, kn->data = data; map->ht[hash(key, map->bits)].cnt += 1; struct hlist_head *h = &map->ht[hash(key, map->bits)]; struct hlist_node *n = &kn->node, *first = h->first; n->next = first; if (first) first->pprev = &n->next; h->first = n; n->pprev = &h->first; } ``` generate.c ```c= #include "hash.h" #include <stdio.h> #include <stdlib.h> struct timespec time_diff(struct timespec start, struct timespec end); int main(int argc, char *argv[]) { // hash major code if (argc < 3) { printf("should input bits and basic_size.\n"); exit(0); } int bits = atoi(argv[1]), basic_size = atoi(argv[2]), iteration_num = argc > 3 ? atoi(argv[3]) : 6; printf("hash table size = %d, basic num of key = %d, iteration num = %d .\n", MAP_HASH_SIZE(bits), basic_size, iteration_num); map_t* hmap = map_init(bits); if (!hmap) { printf("hash map malloc error!\n"); exit(0); } FILE *fout_scal; if (!(fout_scal = fopen("dat/scalability.dat", "w"))) { printf("file open error!\n"); exit(0); } for (int iter = 1; iter <= iteration_num; iter++) { printf("iteration %d ...\n", iter); FILE *fout_hash, *fout_dist; char *f_hash = (char *)malloc(sizeof(char) * 32), *f_dist = (char *)malloc(sizeof(char) * 32); if (!f_hash || !f_dist) { printf("filename malloc error!\n"); exit(0); } int size = iter * basic_size; sprintf(f_hash, "dat/generate_%d.dat", size); sprintf(f_dist, "dat/distribution_%d.dat", size); if (!(fout_hash = fopen(f_hash, "w")) || !(fout_dist = fopen(f_dist, "w"))) { printf("file open error!\n"); exit(0); } for (int i = 0; i < size; i++) { fprintf(fout_hash, "%d\n", hash(i, bits)); map_add(hmap, i, NULL); } int ncnt = 0, ucnt = 0, ccnt = 0; for (int i = 0; i < MAP_HASH_SIZE(bits); i++) { int tmp = (hmap->ht)[i].cnt; ncnt += !tmp ? 1 : 0; // count empty : increment if this bucket is empty ucnt += tmp == 1 ? 1 : 0; // count unique : increment if this bucket contains exactly one key ccnt += tmp ? tmp - 1 : 0; // count collision : if this bucket contains more than one key, then add ("num of key in this bucket" minus 1) fprintf(fout_dist, "%d\n", tmp); } // count ratio float load_factor = (float)size / MAP_HASH_SIZE(bits); // n divide by b, n is the num of keys, b is the num of buckets float not_used_rate = (float)ncnt / MAP_HASH_SIZE(bits) * 100; float exactly_one_rate = (float)ucnt / MAP_HASH_SIZE(bits) * 100; float more_than_one_rate = (float)(MAP_HASH_SIZE(bits) - ncnt - ucnt) / MAP_HASH_SIZE(bits) * 100; float collision_rate = (float)ccnt / size * 100; fprintf(fout_scal, "%.2f %f %f %f %f\n", load_factor, not_used_rate, exactly_one_rate, more_than_one_rate, collision_rate); printf("============== result ==============\n"); printf("num of key = %d\n", size); printf("load factor = %f\n", load_factor); printf("not-used rate = %f%%\n", not_used_rate); printf("exactly-one rate = %f%%\n", exactly_one_rate); printf("more-than-one rate = %f%%\n", more_than_one_rate); printf("collision rate = %f%%\n", collision_rate); printf("====================================\n"); fclose(fout_hash); fclose(fout_dist); free(f_hash); free(f_dist); } fclose(fout_scal); return 0; } ``` ::: #### 第一種公式: \begin{equation} Golden Ratio = \frac{2 ^ {32}}{\phi} = \texttt{2654435769.5} \approx \texttt {0x9E3779BA} \end{equation} - bits = 10, for 0~99 ![](https://i.imgur.com/0l0CGkd.png) ![](https://i.imgur.com/hACnTLS.png) ``` load factor = 0.097656 not-used rate = 90.234375% exactly-one rate = 9.765625% more-than-one rate = 0.000000% collision rate = 0.000000% ``` - bits = 10, for 0~499 ![](https://i.imgur.com/Mzy7Pqy.png) ![](https://i.imgur.com/q7I0EDU.png) ``` load factor = 0.488281 not-used rate = 51.171875% exactly-one rate = 48.828125% more-than-one rate = 0.000000% collision rate = 0.000000% ``` - bits = 10, for 0~1023 ![](https://i.imgur.com/4qEZGuA.png) ![](https://i.imgur.com/y4uwC9X.png) ``` load factor = 1.000000 not-used rate = 12.402344% exactly-one rate = 75.195312% more-than-one rate = 12.402344% collision rate = 12.402344% ``` #### 第二種公式 (`hash.h` 的版本)： \begin{equation} Golden Ratio = 2 ^ {32} \times (1 - \frac{1}{\phi}) = \texttt{1640531526.5} \approx \texttt {0x61C88647} \end{equation} - bits = 10, for 0~99 ![](https://i.imgur.com/i1FuGPS.png) ![](https://i.imgur.com/aACl63T.png) ``` load factor = 0.097656 not-used rate = 90.234375% exactly-one rate = 9.765625% more-than-one rate = 0.000000% collision rate = 0.000000% ``` - bits = 10, for 0~499 ![](https://i.imgur.com/XgpEiBQ.png) ![](https://i.imgur.com/mmC6nWu.png) ``` load factor = 0.488281 not-used rate = 51.171875% exactly-one rate = 48.828125% more-than-one rate = 0.000000% collision rate = 0.000000% ``` - bits = 10, for 0~1023 ![](https://i.imgur.com/oZyKj4H.png) ![](https://i.imgur.com/yF4MJ03.png) ``` load factor = 1.000000 not-used rate = 12.402344% exactly-one rate = 75.195312% more-than-one rate = 12.402344% collision rate = 12.402344% ``` #### 結果： 1. 同一個 key 對兩種版本的 hash 的值是不同的。 2. 兩種版本的 collision rate 及 hash table 中 key 在 bucket 中的分佈狀態皆得到完全一致的結果。 ### 實驗二：檢驗 Scability #### 動機及方法由以上實驗已經證實兩個 Golden Ratio 對 hash table 的分佈是一樣的，進一步探討 hash function 的 scalability (以 `hash.h` 中的 Golden Ratio 當作範例) ，同時分別使用連續的 key 值以及隨機的 key 值去做實驗。實驗程式碼同 [實驗一](#實驗一-：-32-bit-下不同-Golden-Ratio-的-collision-rate) ，此處用的 hash table 的 bucket 數皆控制在 1024 (bits = 10)。 #### 1. 使用連續的 key 值（ex: 0 ~ 2048）下去做 hash 以 [load factor](https://programming.guide/hash-table-load-factor-and-capacity.html) 當作 x 軸檢視 hash table 隨著 load factor 升高各個狀況 (not-used-bucket rate, exactly-one-bucket rate, more-than-one-bucket rate, collision rate)，以百分比 (%) 的呈現： ![](https://i.imgur.com/uUzkZOQ.png) #### 結果： - **not-used-bucket rate** : 可以看到 load-factor 一旦到 1.5 時，幾乎沒有 bucket 是空的。 - **exactly-one-bucket rate** : 起初隨著 load-factor 上升而升高（因為越來越多 bucket 被 hash 中），最高為 load-factor = 1 時來到近八成，接著隨著 load-factor 升高而急速降低（因為越來越多的 bucket 被 hash 中超過一次），直到 load-factor 來到 2.5 左右時幾乎所有的 bucket 都被 hash 中不只一次。 - **more-than-one-bucket rate** : 隨著 load-factor 升高而快速升高，直到 load-factor 來到 2.5 左右時幾乎為 100% 。值得注意的是 load-factor 為 1（ key 數量跟 hash table 的 bucket 數量一樣多時），已經有超過 10% 的 bucket 有超過一個 key 的現象。 - **collision rate** : 隨著 load-factor 升高而呈嚴格遞增曲線，增長速度漸緩。 #### 2. 使用隨機產生的 key 值下去做 hash 使用 c `stdlib.h` 的 `rand()` 搭配 `time.h` 的 `time(NULL)` 產生亂數種子，取其中一次結果 ![](https://i.imgur.com/R0YZTFw.png) #### 結果：相對於連續的 key 值來說，可以看到同一個 load factor 下 exactly-one rate 是低了許多，當然主要是因為 random 不像連續的 key 容易產生較為分散的 hash 結果，而 non-used rate 直到 load factor 接近 5 時才趨近於 0 。有趣的現象是，其 collision rate 的曲線竟然跟使用連續的 key 相當接近。 :::info todo : - 閱讀 [Hash Collision Probabilities](https://preshing.com/20110504/hash-collision-probabilities/) 嘗試比較不同 hash 的理論 collision rate 公式 - 做實驗討論若是不用較高位元的 bit 會造成什麼結果 ::: ### 實驗三： 32-bit 下不同 Golden Ratio 的執行時間 #### 動機及方法不確定 "very slightly easier to multiply" 是否是指運算成本較低一些些，因此做實驗來比較兩種版本的執行時間。實驗會將兩種不同的 Golden Ratio 做 hash 十億次分別計算執行時間，重複實驗 30 次 :::spoiler 實驗程式碼 ```c= #include <time.h> #include <stdio.h> #include <stdlib.h> #include "hash.h" #define NINO 1000000000 struct timespec time_diff(struct timespec start, struct timespec end); int main(int argc, char *argv[]) { if (argc < 3) { printf("should input bits and size.\n"); exit(0); } int bits = atoi(argv[1]), size = atoi(argv[2]); struct timespec start, end, diff; clock_gettime(CLOCK_MONOTONIC, &start); for (int i = 0; i < size; i++) int _ = hash(i, bits); clock_gettime(CLOCK_MONOTONIC, &end); diff = time_diff(start, end); printf("execution time = %lu.%lu sec\n", diff.tv_sec, diff.tv_nsec); return 0; } struct timespec time_diff(struct timespec start, struct timespec end) { struct timespec temp; if ((end.tv_nsec-start.tv_nsec) < 0) { temp.tv_sec = end.tv_sec - start.tv_sec - 1; temp.tv_nsec = NINO + end.tv_nsec-start.tv_nsec; } else { temp.tv_sec = end.tv_sec-start.tv_sec; temp.tv_nsec = end.tv_nsec-start.tv_nsec; } return temp; } ``` ::: ![](https://i.imgur.com/A1163wv.png) #### 結果：可以看到兩種版本的執行時間皆落在 1.86 ~ 1.89 秒之間，看起來 `hash.h` 中 `0x61C88647` 的版本確實執行時間略少於 `0x9E3779BA` ，不過有鑑於實驗次數不多、環境也僅限於本身的桌機，因此也難以斷定是否真是如此。 :::warning 1. 建議直接對 30 筆數據取平均值與標準差, 可以更精確地呈現兩者的差距 2. 可參考 [fibdrv#排除干擾效能分析的因素](https://hackmd.io/@KYWeng/rkGdultSU#%E6%8E%92%E9%99%A4%E5%B9%B2%E6%93%BE%E6%95%88%E8%83%BD%E5%88%86%E6%9E%90%E7%9A%84%E5%9B%A0%E7%B4%A0) 來降低數據的飄動幅度. 雖然 1.89 ~ 1.86 僅有約1.6%的差異, 但當兩者速度的差異也在相同級距時, 會降低數據的參考性 3. 可以備註使用的 compiler flags, compiler version 等實驗環境設置, 以利其他讀者重現實驗或是比對數據 -KYG- ::: ### 實驗四：保留不同 bits 下的 collision rate 由[本作業]()提供的[程式碼]()中有一行註解提到： > High bits are more random, so use them. 為了檢驗這個說法，我嘗試保留不同 bit 位置的 hash 結果，從最高位元的 bit 到最低位元的 bit 做一個 collision rate 的比較，關鍵程式碼如下： ```c #define GOLDEN_RATIO_32 0x61C88647 static inline unsigned int hash(unsigned int val, unsigned int bits, int lowbit) { /* High bits are more random, so use them. */ unsigned int ret = (val * GOLDEN_RATIO_32); //step1 ret = ret << (32 - bits - lowbit); // step2 ret = ret >> (32 - bits); // step3 return ret; } ``` 將 `hash` 函數的 parameter 新增 `lowbit` 欄位，`lowbit` 為會被 right shift 掉的位元數，並保留第 `lowbit + 1` 到 `lowbit + bits` 位元、也就是提取中間 `bits` 個位元的值。 - 假設 `bits = 10` 、`lowbit = 3` step1 : 算出 `val * GOLDEN_RATIO_32` 的值存到 `ret` ```graphviz digraph hash_list { node [shape=record]; head [shape=plaintext label="32 - bits - lowbit"] mid [shape=plaintext fontcolor=red label="bits"] tail [shape=plaintext label="lowbit"] ret [label="<h>1000111011100101000|<m>1000101010|<t>001"] head->ret:h mid->ret:m[color=red] tail->ret:t } ``` step2 : 將 `ret` 左移 `32 - bits - lowbit` 個 bits (清除高位元不必要的值) ```graphviz digraph hash_list { node [shape=record]; head [shape=plaintext fontcolor=red label="bits"] mid [shape=plaintext label="lowbit"] ret [label="<h>1000101010|<m>001|<t>0000000000000000000"] head->ret:h mid->ret:m[color=red] } ``` step3 : 將 `ret` 右移 `32 - bits` 個 bits (讓要保留的 `bits` 個 bit 回到最低位元) ```graphviz digraph hash_list { node [shape=record]; mid [shape=plaintext fontcolor=red label="bits"] ret [label="<h>0000000000000000000000|<t>1000101010"] mid->ret:t[color=red] } ``` 最後的 `1000101010` (十進位 554) 為 hash 結果。 #### 動機及方法控制 hash 出來的 key 介於 0 ~ 1023 (10 bits) 之間，用不同的 `lowbit` (0 ~ 22) 做出的 hash function 去對不同的 load factor 測試其 scalability ，並將結果作圖。此外，也嘗試兩種 key ，一種為連續的數字、一種為隨機的數字。 :::spoiler 實驗程式碼 - 對 generate.c 進行更改如下 ```c #include "hash.h" #include <stdio.h> #include <stdlib.h> #include <time.h> struct timespec time_diff(struct timespec start, struct timespec end); int main(int argc, char *argv[]) { // hash major code if (argc < 3) { printf("should input bits and basic_size.\n"); exit(0); } int bits = atoi(argv[1]), basic_size = atoi(argv[2]), iteration_num = argc > 3 ? atoi(argv[3]) : 1; printf("hash table size = %d, basic num of key = %d, iteration num = %d .\n", MAP_HASH_SIZE(bits), basic_size, iteration_num); FILE *fout_scal; if (!(fout_scal = fopen("dat/scalability_cmp.dat", "w"))) { printf("file open error!\n"); exit(0); } srand(time(NULL)); int shift_num = 12; int shift_arr[] = {22, 20, 18, 16, 14, 12, 10, 8, 6, 4, 2, 0}; // 12 float **shift_data = (float **)malloc(sizeof(float*) * (shift_num + 1)); for (int j = 0; j <= shift_num; j++) shift_data[j] = (float *)malloc(sizeof(float) * iteration_num); for (int shift_iter = 0; shift_iter < shift_num; shift_iter++) { printf("iteration %d ...\n", shift_iter); for (int iter = 1; iter <= iteration_num; iter++) { // printf("iteration %d ...\n", iter); map_t* hmap = map_init(bits); if (!hmap) { printf("hash map malloc error!\n"); exit(0); } int size = iter * basic_size; for (int i = 0; i < size; i++) { int v = rand(); // fprintf(fout_hash, "%d\n", hash(i, bits)); map_add(hmap, v, NULL, shift_arr[shift_iter]); } int ncnt = 0, ucnt = 0, ccnt = 0; for (int i = 0; i < MAP_HASH_SIZE(bits); i++) { int tmp = (hmap->ht)[i].cnt; ncnt += !tmp ? 1 : 0; // count empty : increment if this bucket is empty ucnt += tmp == 1 ? 1 : 0; // count unique : increment if this bucket contains exactly one key ccnt += tmp ? tmp - 1 : 0; // count collision : if this bucket contains more than one key, then add ("num of key in this bucket" minus 1) // fprintf(fout_dist, "%d\n", tmp); } // count ratio float load_factor = (float)size / MAP_HASH_SIZE(bits); // n divide by b, n is the num of keys, b is the num of buckets float not_used_rate = (float)ncnt / MAP_HASH_SIZE(bits) * 100; float exactly_one_rate = (float)ucnt / MAP_HASH_SIZE(bits) * 100; float more_than_one_rate = (float)(MAP_HASH_SIZE(bits) - ncnt - ucnt) / MAP_HASH_SIZE(bits) * 100; float collision_rate = (float)ccnt / size * 100; shift_data[0][iter - 1] = load_factor; shift_data[shift_iter][iter] = collision_rate; printf("%.2f %f %f %f %f\n", load_factor, not_used_rate, exactly_one_rate, more_than_one_rate, collision_rate); map_deinit(hmap); } } for (int i = 0; i < iteration_num; i++) { for (int j = 0; j <= shift_num; j++) { fprintf(fout_scal, "%.2f ", shift_data[j][i]); } fprintf(fout_scal, "\n"); } fclose(fout_scal); return 0; } ``` ::: 連續數字下保留不同位置的 `bit` 產生的結果 (shift 22 bits 就是保留最高位元的 10 個 bits) ![](https://i.imgur.com/pUbj55c.png) 用隨機數字保留不同位置的 `bit` 產生的結果 (shift 22 bits 就是保留最高位元的 10 個 bits) ![](https://i.imgur.com/H8sY0f5.png) #### 結果：由第一張圖可以看出，只有在 load factor 較小時不同 hash function 有較不穩定的結果，而低 load factor 的情況下不一定 shift 的 bit 數越多（越高位元）就越穩定。由第二張圖結果可以看出，在隨機情況下取不同的 bit 當作 hash 結果顯然對 collision rate 影響程度不大，且 load factor 在較大的情況的表現幾乎是一致的。但儘管如此，hash function 考慮到 shift bit 是有一定的根據的。有一種 hash function 稱作 [Mid-Square hashing](https://www.geeksforgeeks.org/mid-square-hashing/)，顧名思義就是將數字取平方後、提取中間幾個位數當作 hash 結果，例如對 4765 這個數字做 Mid-Square hashing，首先 4,765 x 4,765 = 22,705,225 ，這時假設只保留中間三個位數、捨棄最低的三位，得到 hash 結果為 705 。至於為什麽要取中間而不取最高位或最低位主要有幾個原因： - 取高位元：萬一只有少數幾個資料的 key 可以到很高的位元、其他都是較小的數字，則可能 hash 出來的值大都為 0, 1, 2 等很小的數字，顯然這不符合負載均衡 (load balance) 或碰撞機率低(low collision rate) 的精神。 - 取低位元：考慮到平方的特性 (2, 8 作為尾數取平方的最低位都是 4 ; 3, 7 作為尾數取平方的最低位都是 9 等) ，顯然容易抓到 hash 的規律、也不符合負載均衡的原則。而容易抓到 hash 的規律也造成在資安上使用這種容易猜測的 hash 函數被駭客破解。因此取中間位數顯然是折衷又相對有效的方法，一般要取中間幾位也沒有制式化規定、要看 key 的範圍跟特性。而至於 Golden Ratio 相乘的情況不同於取平方、也不見得有容易猜出規律的問題，因此是否取高位元就是較 `random` 以實驗結果來看，是一個有還待商榷的說詞。 ## 改進實做參考 [hash.h](https://github.com/torvalds/linux/blob/master/include/linux/hash.h), [hashtable.h](https://github.com/torvalds/linux/blob/master/include/linux/hashtable.h) 的實作方式，嘗試對程式碼做優化改進 ### Macro 從 Linux 核心對於 hashtable 的程式碼中可以有許多巨集 (Macro) 的定義，從中探討實做上的考量以及意義 #### ARRAY_SIZE/HASH_BITS 以下程式碼參考自 [hashtable.h](https://github.com/torvalds/linux/blob/master/include/linux/hashtable.h) ```c=27 #define HASH_SIZE(name) (ARRAY_SIZE(name)) #define HASH_BITS(name) ilog2(HASH_SIZE(name)) ``` 非常容易即可看出 `HASH_SIZE(name)` 的功能就是在計算 `name` 這個 array 的元素個數、並且 HASH_BITS 就是從 array 的元素個數回推其佔有的 bit 數，而簡單的程式碼暗藏玄機，注意到 `ARRAY_SIZE(name)` 這個 Macro 時去進行搜索，發現裡面暗藏許多相當厲害的技巧參考自[從0開始](http://frankchang0125.blogspot.com/2012/10/linux-kernel-arraysize.html) ，從 Linux 程式碼中尋找 `ARRAY_SIZE` 時，可以發現它定義在 [linux/kernel.h](https://github.com/torvalds/linux/blob/master/include/linux/kernel.h) 中： ```c=38 /** * ARRAY_SIZE - get the number of elements in array @arr * @arr: array to be sized */ #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr)) ``` 前面並不難理解，計算 `(sizeof(arr) / sizeof((arr)[0])` 就是因為要根據不同的 type 去計算元素個數，因此若 arr 佔有的連續 byte 數為 `(arr)[0]` （即一個元素）佔有 byte 數的 k 倍顯然結果就是 k 。不過 Macro 有一個很危險的特質是它只會將 define 後面的敘述做帶入的動作，若使用 Macro 做了相對複雜的操作，他是不會幫你檢查程式碼的正確性的，因此要是使用 `ARRAY_SIZE` 這個 Macro 的人不夠謹慎，把錯誤的參數傳入是容易出錯的，因此 Macro 中有時會加入一些檢查用的機制防止誤用。顯然 `__must_be_array(arr)` 由名字來看就是表明 `arr` 本身必須是一個 array （連續的記憶體空間），但為什麼是用「加」的 ? 這就要參考 [linux/compiler.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler.h) 的程式碼： ```c=239 /* &a[0] degrades to a pointer: a different type from an array */ #define __must_be_array(a) BUILD_BUG_ON_ZERO(__same_type((a), &(a)[0])) ``` 此時又發現了 `BUILD_BUG_ON_ZERO` 和 `__same_type` 兩個關鍵字，以下來分別探討。 1. 首先先看 `BUILD_BUG_ON_ZERO` ，參考自 [linux/build_bug.h](https://elixir.bootlin.com/linux/latest/source/include/linux/build_bug.h#L16) ```c=7 `#ifdef __CHECKER__ #define BUILD_BUG_ON_ZERO(e) (0) #else /* __CHECKER__ */ /* * Force a compilation error if condition is true, but also produce a * result (of value 0 and type int), so the expression can be used * e.g. in a structure initializer (or where-ever else comma expressions * aren't permitted). */ #define BUILD_BUG_ON_ZERO(e) ((int)(sizeof(struct { int:(-!!(e)); }))) #endif /* __CHECKER__ */` ``` 從註解可以理解到當 `!!(e)` 這個 condition 為 true 時會造成 compilation error ，先探討這個 condition 的意義： - `!(e)` 有兩種值，若 `e` 為 0 結果為 true(1) ; 反之 `e` 只要非 0 結果就是 0(false) 。 - `!!(e)` 就是再把 `!(e)` 做一次 Logical negation (NOT) - `-!!(e)` 的結果就是： `e` 為 0 則算出 0 ; `e` 為非 0 則算出 -1 而搭配 c 語言 [Bit-field](https://en.wikipedia.org/wiki/Bit_field) 的技巧搭配程式碼來看 `struct { int:(-!!(e)); }` ```c= // bit field box properties struct BoxProps { unsigned int opaque : 1; unsigned int fill_color : 3; unsigned int : 4; // fill to 8 bits unsigned int show_border : 1; unsigned int border_color : 3; unsigned int border_style : 2; unsigned char : 0; // fill to nearest byte (16 bits) unsigned char width : 4, // Split a byte into 2 fields of 4 bits height : 4; }; ``` 以上程式碼引用自維基百科，可以發現冒號後面 (:) 可以直接指定變數使用的 bit 數，如 `opaque` 只用 1 bit，而 `fill_color` 佔用 3 bits，而因為記憶體對齊 (alignment) 所以會有 `fill to 8 bits` 等操作（這邊不贅述）。而顯然指定記憶體的 bit 數必須是一個非負整數，而再看回 `struct { int:(-!!(e)); }` 可以發現當 condition 為 true 時會指定一個 bit 數為 -1 的結構成員，這會引發 compilation error ; 而厲害的是可以指定 bit 數為 0 的成員，其主要的功能是強制將記憶體對齊至下一個 word 邊界 (Force alignment at the next word boundary)，如上面程式碼的 `unsigned char : 0;`，而且不會佔任何的空間。到這邊似乎對前面程式碼越來越有頭緒了。 2. 接著來看 `__same_type((a), &(a)[0])` ，參考 [linux/compiler_type.h](https://elixir.bootlin.com/linux/latest/source/include/linux/compiler_types.h#L264) ```c=264 /* Are two types/vars the same type (ignoring qualifiers)? */ #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b)) ``` 可以看到 `__same_type()` 呼叫了 GCC 的 built-in function：[`__builtin_types_compatible_p`](https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Other-Builtins.html#Other-Builtins) > You can use the built-in function __builtin_types_compatible_p to determine whether two types are the same. This built-in function returns 1 if the unqualified versions of the types type1 and type2 (which are types, not expressions) are compatible, 0 otherwise. The result of this built-in function can be used in integer constant expressions. 可以得知當 `type1` 和 `type2` 一致會回傳 1 否則為 0 ，而在範例中 `a` 如果是 array([]) 則 `__builtin_types_compatible_p(typeof(a), typeof(&(a)[0]))` 會回傳 0；`a` 如果是 pointer(*) 則回傳 1 。總結：看回 `ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr))` ，顯然 arr 為 array([]) 則 `__must_be_array(arr)` 為 0 不影響結果、而 arr 為 pointer(*) 時會直接導致編譯不通過，而可能有人會問為什麼不用 `assert` 就好，主要是 `assert` 是在 runtime 才會提醒出錯、但發生 runtime error 是有機率性的，若不能儘早發現錯誤會加重 debug 的難度，考量到安全性、可維護性因此必須讓它在 compile time 就過不了。雖然費了很大的心力只為了解釋兩行程式碼，但從過程中也看到了許多 linux 核心的風格，也接觸到了 c 語言的許多技巧，從中也知道自己的不足，可謂相當有收穫。 #### DEFINE_HASHTABLE, DEFINE_READ_MOSTLY_HASHTABLE, DECLARE_HASHTABLE - RCU - Safe ## Project : NLP vector generator ### 動機 [自然語言處理 (NLP)](https://zh.wikipedia.org/wiki/%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86)屬於機器學習的一個分支，主要目的是希望電腦可以對自然語言（中文、英文）等經過訓練而產生認知語言的能力，主要方式是讓電腦把輸入的語言變成有意思的符號和關係，然後根據目的再處理。自然語言生成系統則是把計算機數據轉化為自然語言，常見的應用如我們熟知的 [Siri](https://zh.wikipedia.org/wiki/Siri) ，聊天機器人等。有鑑於此題目和 NLP 都有「將輸入 (key 值) mapping 到某個值域（數字、特徵向量等）」的特性，而事實上 NLP 也有將語言經過 hash function 轉成特徵輸入的作法，而一般而言 hash 的過程大都使用現成的 library，且機器學習大多都以 python 等高階語言入門，能夠接觸到的部分幾乎都是調參數、對資料預處理 (preprocessing) 等很上層的領域。因此藉由本次實作機會想試試看結合 Linux 核心對 hash 的實作和字串特徵擷取，嘗試去實現以字串搭配 fabonacci hashing 的機制產出 NLP 的特徵向量產生器，以不同的面相切入當紅的機器學習領域。 ### 前置作業一：延伸實作出 string based 的 hash function #### 定義新型態、修改原本型態 `map_t` 新增 `type` 去區分 hashtable 做 hash 的型態新增 `strut hash_skey` ，跟 `struct hash_key` 的主要差別在於 key 變成了字串的形式 ```c= typedef struct { char type; // 'i' for integer hashing, 's' for string hashing int bits; struct hlist_head *ht; } map_t; struct hash_key { int key; void *data; struct hlist_node node; }; struct hash_skey { char * skey; void *data; struct hlist_node node; }; ``` (Todo) ```c= // check type #define TYPE_CONSISTENT(map, t) \ (map && map->type == t) ? 1 : 0 void map_adds(map_t *map, char *skey, void *data) { struct hash_skey *kn = find_skey(map, skey); if (kn) return; kn = (struct hash_skey*)malloc(sizeof(struct hash_skey)); if (!kn) // add return; size_t len = strlen(skey) + 1; kn->skey = (char *)malloc(sizeof(char) * len); if (!kn->skey){ free(kn); return; } strncpy(kn->skey, skey, len); kn->data = data; struct hlist_head *h = &map->ht[shash(skey, map->bits)]; struct hlist_node *n = &kn->node, *first = h->first; n->next = first; if (first) first->pprev = &n->next; h->first = n; n->pprev = &h->first; } ``` #### 字串版本的 hash function ```c= static inline unsigned int shash(const char *skey, unsigned int bits) { if (!skey) return 0; unsigned int ret = 1, len = strlen(skey); for (int i = 0; i < len; i++) ret = (ret * skey[i] * GOLDEN_RATIO_32) >> (32 - bits); return ret; } ``` (Todo) ### 前置作業二：檢驗 Scalability 檢驗看是否字串對於 Golden Ratio 為基礎的 hash 也有跟整數一樣的表現 (Todo) ```shell key : "&"8XL'*N3Y#", value : 40 key : "D$SOxkTq'R3Uc+'9", value : 505 key : "GR?w@{qFqr[CP2.:B(dl??", value : 378 key : "O0rL**cdsmTc"3dxgp", value : 195 key : "hbtmwMK6qv)1i*"K;sT&0o:", value : 689 key : "]yx5+!kkZc%25", value : 174 key : ";t`fprd$&hVa?lp,0Vg"QM", value : 347 ... ``` (Todo) ![](https://i.imgur.com/H4Egg1h.png) - number of key = 1024 ![](https://i.imgur.com/uMGWe92.png) - number of key = 2048 ![](https://i.imgur.com/lRoOxIK.png) - number of key = 4096 ![](https://i.imgur.com/t34FJpP.png) ## 參考資料 - 有效率的 hash table 實做 - https://medium.com/@fchern/%E8%A8%AD%E8%A8%88%E9%AB%98%E6%95%88%E8%83%BD%E7%9A%84hash-table-%E4%B8%80-303d9713abab - [Faster Hash Table Probing](https://craftinginterpreters.com/optimization.html#faster-hash-table-probing) - list, hlist, rcu_list - https://ithelp.ithome.com.tw/articles/10219190 - https://danielmaker.github.io/blog/linux/list_hlist_hashtable.html - https://www.programmersought.com/article/34182729928/ - golden ratio - https://stackoverflow.com/questions/38994306/what-is-the-meaning-of-0x61c88647-constant-in-threadlocal-java - https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/ - hashtable.h 相關討論 - http://frankchang0125.blogspot.com/2012/10/linux-kernel-buildbugonzero.html - RCU 機制 - [Linux 核心設計: RCU 同步機制](https://hackmd.io/@sysprog/linux-rcu) - [What is RCU, Fundamentally?](https://lwn.net/Articles/262464/)