String Interning

--- tags: System Software --- # String Interning ## Material [2021q1 第 2 週測驗題](https://hackmd.io/@sysprog/linux2021-quiz2) ## Start From Simple 我們可以先從一個簡化的實作來窺探 [string interning](https://en.wikipedia.org/wiki/String_interning) 究竟葫蘆裡賣甚麼藥。在此之前，注意到在這個範例中，黃敬群老師為了測驗和教學之便而做了許多簡化，想對此主題有更完整的認識，參考專案 [chriso/intern](https://github.com/chriso/intern) 來理解 [string interning](https://en.wikipedia.org/wiki/String_interning) 當是更好的作法。不過我們還是可以透過此範例對 string interning 的基本觀念有一定的掌握，過程中我也會嘗試指出我發現的問題，有時間的話也再深入探討 [chriso/intern](https://github.com/chriso/intern) 以得到更完整的理解。 :::info :notes: 顯然加一些圖解會更容易理解，不過此筆記目前主要是為了幫助自己學習，因此雖然腦中有對此架構的形狀，~~筆者本人則偷懶而沒有附上~~，也許未來有機會再探討此議題時補充 QQ ::: ### Some Structure ```cpp enum { CSTR_PERMANENT = 1, CSTR_INTERNING = 2, CSTR_ONSTACK = 4, }; #define CSTR_INTERNING_SIZE (32) #define CSTR_STACK_SIZE (128) typedef struct __cstr_data { char *cstr; uint32_t hash_size; uint16_t type; uint16_t ref; } * cstring; typedef struct __cstr_buffer { cstring str; } cstr_buffer[1]; ``` * enum 中定義不同的字串類型，除了可以讓函式對不同的類型做不同的相應處理以外，類型也影響 `cstring` 中的 `ref` / `hash_size` 的實質意義，以及 `cstr` 中用到的空間 * `cstr` 指向的空間可能來自 heap 直接分配，或者 `__cstr_node` 中的 `buffer`，或者 stack) * `CSTR_INTERNING_SIZE` 是會做 intern 的字串長度，換句話說，超過此長度的字串會選擇直接分配空間而不是使用 intern 的 reference * `CSTR_STACK_SIZE` 則是 `CSTR_ONSTACK` 類型的字串長度 * `cstring` 結構表示程式中的字串 * `cstr` 可能指向: * `__cstr_node` 中的 buffer(`CSTR_INTERNING` type) * heap 上直接分配的空間(type == 0) * 字串為 literal (`CSTR_PERMANENT` type) * stack 上的空間(`CSTR_ONSTACK` type) * `hash_size` 可能代表: * 字串長度(`CSTR_ONSTACK` type) * hash value(`CSTR_INTERNING` type) * 或者為 0，不被使用(其他) * `type` 為類型，如前所述根據類型各變數會有不同的使用方式 * `ref` 為 reference 數量: * type == 0 時，`ref` 會代表實際的 reference 數量，因此 release 時可以判斷 reference 數量是否歸零才去 free * 其他類型下 `ref` 則無實際用途 ```cpp #define INTERNING_POOL_SIZE 1024 #define HASH_START_SIZE 16 /* must be power of 2 */ struct __cstr_node { char buffer[CSTR_INTERNING_SIZE]; struct __cstr_data str; struct __cstr_node *next; }; struct __cstr_pool { struct __cstr_node node[INTERNING_POOL_SIZE]; }; struct __cstr_interning { int lock; int index; unsigned size; unsigned total; struct __cstr_node **hash; struct __cstr_pool *pool; }; ``` * `__cstr_node` 儲存 intern 字串 * `struct __cstr_data` 即 `cstring` 結構，其中的 `cstr` 會指向 `buffer` 中的儲存字串 * node 會以 linked list 結構被放在某個 `hash` 中的 bucket ，`next` 指向 linked list 的下一個 * `__cstr_interning` 透過 hash 儲存被 intern 的字串 * 考慮到多執行緒下的擴充，`lock` 用來保護對 `__cstr_interning` 的操作 * `index` 儲存目前取用 `pool` 中 `__cstr_node` 數量 * `size` 為 `hash` 的 bucket 數量 * `hash` 為一個 hash table * `pool` 則為預先分配好的 `__cstr_node`，會被連接到某個 `hash` 的 bucket 上 ### `CSTR_BUFFER` ```cpp #define CSTR_BUFFER(var) \ char var##_cstring[CSTR_STACK_SIZE] = {0}; \ struct __cstr_data var##_cstr_data = {var##_cstring, 0, CSTR_ONSTACK, 0}; \ cstr_buffer var; \ var->str = &var##_cstr_data; ``` * `CSTR_ONSTACK` 是字串的一種類型，可以透過 `CSTR_BUFFER` 建立 * 假設呼叫方法是 `CSTR_BUFFER(a)` * 在 stack 上建立一個名為 `a_cstring` 的 `CSTR_STACK_SIZE` 大小 char 陣列 * 宣告一個 `__cstr_data` 結構，名為 `a_cstr_data`，其字串使用的為前面建立的 char 陣列(因此第 3 項為 `CSTR_ONSTACK`)，`hash_size` 和 `ref` 則初始為 0 * 宣告一個 `cstr_buffer` 結構名為 `a`，且其中的 `cstring` 成員即為前面的 `a_cstr_data` ### `CSTR_LITERAL` ```cpp #define CSTR_LITERAL(var, cstr) \ static cstring var = NULL; \ if (!var) { \ cstring tmp = cstr_clone("" cstr, (sizeof(cstr) / sizeof(char)) - 1); \ if (tmp->type == 0) { \ tmp->type = CSTR_PERMANENT; \ tmp->ref = 0; \ } \ if (!__sync_bool_compare_and_swap(&var, NULL, tmp)) { \ if (tmp->type == CSTR_PERMANENT) \ free(tmp); \ } \ } ``` * 透過 `cstr_clone` 得到 `cstring` 結構 * `type == 0` 表示字串配置在 heap 上，但為了建立 literal 類型更改為 `CSTR_PERMANENT` type(此 type 下 `ref` 無用途，設為 0) * 通過 `__sync_bool_compare_and_swap`，如果 `var` 為 NULL，則更新為 `tmp` * 如果 swap 失敗，且為 `CSTR_PERMANENT` 類型，釋放 `tmp` 避免 leak * 這代表 `CSTR_LITERAL` 存在失敗的可能，但是程式沒有明確阻止使用者操作失敗時的 NULL 結果 :::warning 可以發現 `CSTR_LITERAL` 如果是分配在 heap 上的話，不存在對應的釋放函式(`cstr_release` 只作用於 type == 0 的類型上)，這些 literal 類型的字串應該被紀錄起來，並在程式結束時釋放 ::: ### `CSTR_CLOSE` ```cpp #define CSTR_CLOSE(var) \ do { \ if (!(var)->str->type) \ cstr_release((var)->str); \ } while (0) ``` * 檢查字串類型是否 `type == 0`，若是則 `cstr_release` 釋放 ### `cstr_cat` ```cpp cstring cstr_cat(cstr_buffer sb, const char *str) { cstring s = sb->str; if (s->type == CSTR_ONSTACK) { int i = s->hash_size; while (i < CSTR_STACK_SIZE - 1) { s->cstr[i] = *str; if (*str == 0) return s; ++s->hash_size; ++str; ++i; } s->cstr[i] = 0; } cstring tmp = s; sb->str = cstr_cat2(tmp->cstr, str); cstr_release(tmp); return sb->str; } ``` * 從 `sb` 中取出字串，如果該字串為 `CSTR_ONSTACK`，則 `s->hash_size` 可表字串的長度 `i`，接下來只要將 `str` 內容逐一放到 `s->str` 即可 * 非 `CSTR_ONSTACK` 狀況下，使用 `cstr_cat2` 去處理 * 經 `cstr_cat2` 後，`tmp` 這個 `cstring` 結構已經不再被需要了，因此透過 `cstr_release` 釋放(如果其字串空間來自 `xalloc`) ### `cstr_cat2` ```cpp static cstring cstr_cat2(const char *a, const char *b) { size_t sa = strlen(a), sb = strlen(b); if (sa + sb < CSTR_INTERNING_SIZE) { char tmp[CSTR_INTERNING_SIZE]; memcpy(tmp, a, sa); memcpy(tmp + sa, b, sb); tmp[sa + sb] = 0; return cstr_interning(tmp, sa + sb, hash_blob(tmp, sa + sb)); } cstring p = xalloc(sizeof(struct __cstr_data) + sa + sb + 1); if (!p) return NULL; char *ptr = (char *) (p + 1); p->cstr = ptr; p->type = 0; p->ref = 1; memcpy(ptr, a, sa); memcpy(ptr + sa, b, sb); ptr[sa + sb] = 0; p->hash_size = 0; return p; } ``` * 首先，對於輸入的兩個 `char` 字串如果總和小於 `CSTR_INTERNING_SIZE`，直接在 stack 上接起來。並透過 `cstr_interning` 查找是否已存在相同內容的字串被 intern，若無則將其 intern 一個，最後得到被 intern 的字串 * 如果字串長度長於 `CSTR_INTERNING_SIZE`，這個字串不會被 intern，而是直接 `xalloc` 必要的大小並把這個銜接起來的 string copy 到 cstring 結構中 * `(p + 1)` 讓指標前進一個 `cstring` 大小，此時 `ptr` 指向的是當時分配空間時預留給字串的大小 `sa + sb + 1`，因此 `p->cstr` assign 為 `ptr` 即所表示的字串 * `p->type` 被設為 0，而 `p->ref` 被設為 1，且 `hash_size = 0`，代表此字串沒被 intern 的狀態 ### `cstr_clone` 原則上和 `cstr_cat2` 很相似，只是一個是直接建立一個字串另一個是銜接兩個字串，這裡就不多做解釋。 ### `cstr_release` ```cpp void cstr_release(cstring s) { if (s->type || !s->ref) return; if (__sync_sub_and_fetch(&s->ref, 1) == 0) free(s); } ``` * 如果 `type` 被設為 0 且 reference count 大於零，表示此字串的類型為空間分配在 heap 上(且非 heap 上的 `struct` 中的陣列空間) * `__sync_sub_and_fetch` 從 `&s->ref` 减去 1，並返回操作後的數值，因此如果 reference count 因此被減為 0，`s` 會被釋放。 ### `hash_blob` ```cpp static inline uint32_t hash_blob(const char *buffer, size_t len) { const uint8_t *ptr = (const uint8_t *) buffer; size_t h = len; size_t step = (len >> 5) + 1; for (size_t i = len; i >= step; i -= step) h = h ^ ((h << 5) + (h >> 2) + ptr[i - 1]); return h == 0 ? 1 : h; } ``` * JS Hash invented by Justin Sobel? * 對於輸入的字串 `buffer` 長度為 `len`，得到 hash 後的 index ### `CSTR_LOCK` / `CSTR_UNLOCK` ```cpp /* FIXME: use C11 atomics */ #define CSTR_LOCK() \ ({ \ while (__sync_lock_test_and_set(&(__cstr_ctx.lock), 1)) { \ } \ }) #define CSTR_UNLOCK() ({ __sync_lock_release(&(__cstr_ctx.lock)); }) ``` * `CSTR_LOCK` 透過 [Test-and-Set](https://en.wikipedia.org/wiki/Test-and-set) 方法嘗試去 acquire lock，將 1 寫入 `__cstr_ctx.lock` 並返回 lock 原本的值(透過 gcc builtin 整個流程為 atomic operation)。 * `CSTR_UNLOCK` 則透過 `__sync_lock_release` 將寫為 0，解鎖 lock ### `cstr_interning` ```cpp static cstring cstr_interning(const char *cstr, size_t sz, uint32_t hash) { cstring ret; CSTR_LOCK(); ret = interning(&__cstr_ctx, cstr, sz, hash); if (!ret) { expand(&__cstr_ctx); ret = interning(&__cstr_ctx, cstr, sz, hash); } ++__cstr_ctx.total; CSTR_UNLOCK(); return ret; } ``` * 因為 `__cstr_ctx` 是全域的變數，因此需要透過 lock 來保護(雖然目前還沒有 multi thread 的設計 * `interning` 察看是否已經有被使用過的相同字串，詳見下面對 `interning` 的解釋，簡而言之 * 如果有相同的字串，直接從中拿出來用 * 如果沒有 * 當可被 intern 的字串數量超過 threshold，會返回 NULL，因此需要額外 `expand` 然後重新 `interning` * 否則直接把該字串也 intern 然後拿出來用 * `_cstr_ctx.total` + 1 表示被提取的總量 + 1 :::warning 總覺得以被提取的總量來決定是否 expand 不符合邏輯? `_cstr_ctx.total` 應該要用來代表被 intern 的字串總量? ::: ### `expand` ```cpp static void expand(struct __cstr_interning *si) { unsigned new_size = si->size * 2; if (new_size < HASH_START_SIZE) new_size = HASH_START_SIZE; struct __cstr_node **new_hash = xalloc(sizeof(struct __cstr_node *) * new_size); memset(new_hash, 0, sizeof(struct __cstr_node *) * new_size); for (unsigned i = 0; i < si->size; ++i) { struct __cstr_node *node = si->hash[i]; while (node) { struct __cstr_node *tmp = node->next; insert_node(new_hash, new_size, node); node = tmp; } } free(si->hash); si->hash = new_hash; si->size = new_size; } ``` * `expand` 增加 hash 的儲存空間大小 * 第一次呼叫 `expand` 時，至少確保有 `HASH_START_SIZE` 的大小 * 透過 `xalloc` (對 `malloc` 的包裝，確保 malloc 失敗時的後續處理) 建立 `new_size` 個 `__cstr_node *` 空間，並先重設為 0 * 透過 `insert_node` 將舊的 hash 的內容更新到 `new_hash` 中 * `insert_node` 會根據每個字串的原始 hash 值和新的 hash 範圍，重新找到舊的 node 在新的 hash 中應存在的位置(bucket) * 釋放舊的 hash 後，再新的 hash 和 size assign 到 `si` 結構中 ### `interning` ```cpp static cstring interning(struct __cstr_interning *si, const char *cstr, size_t sz, uint32_t hash) { if (!si->hash) return NULL; int index = (int) (hash & (si->size - 1)); struct __cstr_node *n = si->hash[index]; while (n) { if (n->str.hash_size == hash) { if (!strcmp(n->str.cstr, cstr)) return &n->str; } n = n->next; } // 80% (4/5) threshold if (si->total * 5 >= si->size * 4) return NULL; ``` * 對於字串 `cstr`，尋找是否已經有存在的 instance，避免在記憶體中存在多份副本帶來的空間浪費 * 透過 `hash & (si->size - 1)` 得到 hash 的 `index`，透過 `index` 找到對應的 `__cstr_node *n` * 每個 hash 到的 `__cstr_node` 會形成一個 linked list，linear search 之。在 node 中的 `cstring` 之 `hash_size` 用來代表 hash 值，因此首先比較 hash 值如果不同就必為相異字串 * 如果 hash 值相同再用 `strcmp` 進一步確認，相同則直接返回 reference * `si->total` 是從 intern 提取的字串總量，如果其數量多於 hash 的 size 之 4 / 5，先不對 `cstr` 做 intern，返回 NULL * 在 `cstr_interning` 中可以看到若返回 NULL 會先透過 `expand` 增加 hash 大小，再呼叫 `interning` 重新 intern 一次 ```cpp if (!si->pool) { si->pool = xalloc(sizeof(struct __cstr_pool)); si->index = 0; } n = &si->pool->node[si->index++]; ``` * 如果 `si->pool` 不存在則透過 `xalloc` 建立，並重置 `si->index` * `struct __cstr_interning` 中的 `hash` 是指向 `struct __cstr_node *` 的 pointer。所指向的 `struct __cstr_node *` 空間就來自 `pool` * 因此 `index` 代表下一個被使用的 `struct __cstr_node *` :::warning 可以注意到原始的程式沒有對 `pool` 的使用超出 `INTERNING_POOL_SIZE` 做規範，`expand` 也沒有擴充大小的限制，因此當 `index` 累計超出 `INTERNING_POOL_SIZE` 會產生 overflow! ::: ```cpp memcpy(n->buffer, cstr, sz); n->buffer[sz] = 0; cstring cs = &n->str; cs->cstr = n->buffer; cs->hash_size = hash; cs->type = CSTR_INTERNING; cs->ref = 0; n->next = si->hash[index]; si->hash[index] = n; return cs; } ``` * 每個 `struct __cstr_node *` 計錄一個被 intern 的字串。將字串 copy 到 `n->buffer` 後，就可以將 intern 的字串透過 `cstring` 結構返回 * 因為字串來自 interning，因此 type 是 `CSTR_INTERNING` 且 * `hash_size` 在此 type 下代表此字串的 hash value * *`ref` 設為 0 的正確性???* * 最後將 `struct __cstr_node *` n 接到所屬的 `hash[]` 下的 linked list 的開頭，並回傳被 intern 的字串 ### `cstr_equal` ```cpp int cstr_equal(cstring a, cstring b) { if (a == b) return 1; if ((a->type == CSTR_INTERNING) && (b->type == CSTR_INTERNING)) return 0; if ((a->type == CSTR_ONSTACK) && (b->type == CSTR_ONSTACK)) { if (a->hash_size != b->hash_size) return 0; return memcmp(a->cstr, b->cstr, a->hash_size) == 0; } uint32_t hasha = cstr_hash(a); uint32_t hashb = cstr_hash(b); if (hasha != hashb) return 0; return !strcmp(a->cstr, b->cstr); } ``` * 字串的比較部份，因為字串有許多 metadata，因此可以不用直接透過 `strcmp` 先比較 * `a == b` 即同位置上的字串，自然內容相同 * 如果位置不同且兩者皆為 `CSTR_INTERNING`，則必不同(因為 `CSTR_INTERNING` 對同個字串只有一次 intern) * 對於 `CSTR_ONSTACK` 類型，可以先比較長度(`hash_size`)，再用 `memcmp` 比較實際內容 * 如果其他類型，可以先比對 hash 值，最後的情況再用 `strcmp` 確認 ### `cstr_grab` ```cpp cstring cstr_grab(cstring s) { if (s->type & (CSTR_PERMANENT | CSTR_INTERNING)) return s; if (s->type == CSTR_ONSTACK) return cstr_clone(s->cstr, s->hash_size); if (s->ref == 0) s->type = CSTR_PERMANENT; else __sync_add_and_fetch(&s->ref, 1); return s; } ``` * 取得一份 `cstring s`，根據類型回傳的型態會有差異 ### Bonus: `cstr_interning_free` ```cpp void cstr_interning_free() { free(__cstr_ctx.pool); free(__cstr_ctx.hash); } ``` * 在老師原始的設計中沒有這個 function，不過我們應該在程式結束時主動釋放這些空間 ### 總結上面的整理可能會有些紊亂(我很抱歉QQ)，不過我們還是可以歸納出一些 string interning 的相關技巧: * 將可能出現在多處的相同字串簡化為單一空間，透過 hash 方式在使用這些字串時快速的取得 reference * 不同類型的字串(主要是長度差異，並有部份是 user 的選擇) 需要的 meta-data 可能不同，struct 中的變數可以根據需求直接複用(或許可以考慮改成 union 增加可讀性) * hash 的建立所需要的 `node` 並非每次新的字串進來時再去分配，反之，預先分配好一定數量的 `node`(也就是 pool)，需要時再去取用，減少頻繁呼叫 `malloc` 可能產生的時間問題 * 字串帶有 hash 值、長度等 meta data 可以協助字串的比對，有機會更有效率的產生比對結果