Cache-Sensitive Skip List: Efficient Range Queries on modern CPUs

# Cache-Sensitive Skip List: Efficient Range Queries on modern CPUs ###### tags: `homework` - [ ] [Cache-Sensitive Skip List: Efficient Range Queries on modern CPUs](https://www2.informatik.hu-berlin.de/~sprengsz/papers/cssl.pdf) ## Skip list ![](https://i.imgur.com/BkeHAZz.png) > Figure 1 Skip list 由多個不同層級的鍵值通道 (Lane) 組成，如圖 Figure 1。在第 Level i 的通道中，平均具有 $n*p^i$ 個元素，其中 n 為元素總數且 $0<p<1$ ，所以 Skip list 是一個透過機率所決定出來的資料結構，雖然可以很有效地更新資料，但對於確切的結構仍是不可預測的。因此本論文採用 deterministic variant 的 skip list [perfectly balanced skip lists](https://dl.acm.org/doi/pdf/10.5555/139404.139478) 作為骨架， balanced skip list 的第 Level i+1 通道包含 Level i 通道的每 1/p 個元素，如 Figure 1 所示，當 p = 0.5 時， Level 1 的通道包含了 Level 0 的每 1/0.5 = 2 個元素 (1, 3, 5, 7, 9)、Level 2 的通道包含了 Level 1 的每 2 個元素 (1, 5, 9)。當 p 值較小時，通道可視為稀疏的，可一次跳過許多元素，而在 p 值較大時，通道是密集的，需要在通道中比較多次的元素。通道的功用主要用來縮少搜尋的範圍，例如在 Figure 1 搜尋元素 6 ，從最高 Leve 2 開始搜尋 6 ，當遇到比 6 還大的值就停止，往節點的下個 Level 繼續搜尋，因此找到元素 9 比 6 大，接著就從 5 往 Level 1 搜尋，遇到 7 停止，從 5 往 Level 0 搜尋，找到 6 。在 p = 0.5 的 balanced skip list 之最壞情況，搜尋的複雜度需要 $O(logn)$。原始論文中 [Skip lists: a probabilistic alternative to balanced trees](https://15721.courses.cs.cmu.edu/spring2018/papers/08-oltpindexes1/pugh-skiplists-cacm1990.pdf) 是透過 fat keys 來實現，fat keys 包含了指向每個通道下個節點的指針以及數組資料，所有的節點皆是一致的，因此方便實作，但透過這樣的實作方式，會使得指向高 Level 下個節點的指針在大多情形都是指向 NULL ，且至少需要 $O(m*n)$ 的指針空間，其中 m 為通道的總數，故具有較差的空間使用率。而指針所指向的記憶體空間在大多的情況下為不連續，若採用上述的實作方式實現 skip list ，透過大量的指針操作來進行搜尋，勢必會遇到許多的 cache misses ，不利於現代的 cpu 執行。例如，在一個 64-bit architecture 上使用 32-bit 整數鍵值以及具有 5 個通道數的 skip list，每個節點需要佔 4 bytes + (5+1) * 8 bytes = 52 bytes 的空間大小，假設一個 cache line 為 64 bytes，而在遍歷 skip list 的節點時，會幾乎佔用了所有的 cache line 空間 (52 bytes)，因此大部分情況下都會發生 cache misses ，但其實僅需 4 bytes + 8 bytes = 12 bytes (數組+指向下個節點的指針) 的空間來存放即可。 ## Cache-Sensitive Skip List ![](https://i.imgur.com/PGeicPf.png) > Figure 2 為了能充分利用 cache ，最直覺的想法就是將**通道改為陣列表示**，如此一來具有 spatial locality 的特性可減少 cache misses ，也可允許 SIMD 指令的使用來加速操作。 Figure 2 包含了兩個通道 (Linearized Fast Lane Array) 用來管理 32 筆整數資料 (Data List)，其中 p = 0.5，例如搜尋 7 時的遍歷路徑如紅色指示所示，而 CSSL 是基於 balanced skip list 實作，故 Level i 的每 1/0.5 = 2 個元素為 Level i+1 的元素，因此**通道的元素數量是已知的**，也只需透過簡單計算得出元素後續的位置。此篇論文的實現是先假設最大鍵值 t 並根據 t 來預先分配一定的記憶體空間，當遇到插入的鍵值 n 大於 t 時，就得重新建立通道 (詳見更新策略)。和傳統的 skip list 相比，CSSL 僅需要更少的空間配置，假設鍵值的大小為 k 、指針大小為 r ，傳統 skip list 需要 $n*(m*r+r+k)$ 的空間，而 CSSL 只須 $n*(r+k)+\sum^{m}_{i=1}{p^i*n*k}$ ，其中 n 為資料量、m 為通道數。 ### Optimizations ![](https://i.imgur.com/R5H26WN.png) > Figure 3 進一步優化如下 1. 將通道大小調整為 cache line 大小的倍數，如 Figure 3。 2. 在最低 level 的通道與 data list 之間引入了一個附加的代理通道 (Proxy Lane Array)，如 Figure 2，用來存放指向 data 的指針。 3. 於最高 level 的通道上使用二元搜尋法，這是因為當通道數少的時候，最高 level 的通道所含的元素遠超過 1/p ，使得 cpu 在搜索最高 level 通道的成本往往非常高。 ### Updates 為了能夠支援即時更新 1. Inserting keys: 由於通道使用陣列來管理，因此插入新值的時候，需要使用大量的位移操作來使元素保持有序，所以該實作**僅直接將新值插入到 data list (common linked list) 的適當位置，直到 skip list 需要重新建立時再為其分配通道陣列空間**，可透過 [atomic compare-and-swap instruction](https://en.wikipedia.org/wiki/Compare-and-swap) 來實現 latch-free 的插入演算法。（？） 2. Deleting keys: 與插入相反，需先從通道陣列中刪除目標值，但如果將目標值改成 NULL 則需要重新排序通道陣列，因此改用**與目標值相鄰的元素來取代目標值當作刪除**，直到重建時才將重複的值刪除，而在 data list 中將指向目標節點的節點改為指向目標節點的後繼節點。 3. Updating keys: ## Algorithms 闡述查找與搜尋範圍值的演算法 1. Lookups: Pseudocode ``` Algorithm 1: lookup(key) 1: pos = binary_search_top_lane(flanes, key); 2: for (level = MAX_LEVEL - 1; level > 0; level--) { 3: rPos = pos - level_start_pos[level]; 4: while (key >= flanes[++pos]) 5: rPos++; 6: if (level == 1) break; 7: pos = level_start_pos[level-1] + 1/p * rPos; 8: } 9: if (key == flanes[--pos]) return key; 10: proxy = proxy_nodes[pos - level_start_pos[1]]; 11: for (i = 1; i < 1/p; i++) 12: if (key == proxy->keys[i]) return key; 13: return INT_MAX; ``` - 如果成功搜尋則回傳該值，否則回傳 INT_MAX 。 - `line 1` 透過二元搜尋法於最高 level 的通道中搜索目標值 key 。 - 其他層的通道則採循序搜尋，每個通道只須比較 1/p 個元素即可。 - `line 9` 若在最底層通道搜尋到目標值則立即回傳，否則將在 proxy lane 上搜尋 `line 10-12` ，仍搜尋不到的話則回傳 INT_MAX。 2. Range Queries: 搜尋並返回指向指定範圍的第一個和最後一個元素之指針。 Pseudocode ``` Algorithm 2: searchRange(start, end) 1: RangeSearchResult res; 2: pos = binary_search_top_lane(flanes, start); 3: for (level = MAX_LEVEL - 1; level > 0; level--) { 4: rPos = pos - level_start_pos[level]; 5: while (start >= flanes[++pos]) 6: rPos++; 7: if (level == 1) break; 8: pos = level_start_pos[level-1] + 1/p * rPos; 9: } 10: proxy = proxy_nodes[rPos]; 11: res.start = proxy->pointers[1/p - 1]->next; 12: for (i=0; i < 1/p; i++) { 13: if (start <= proxy->keys[i]) { 14: res.start = proxy->pointers[i]; break; 15: } 16: } 17: sreg = _mm256_castsi256_ps(_mm256_set1_epi32(end)); 18: while (rPos < level_items[1] - 8) { 19: creg = _mm256_castsi256_ps( 20: _mm256_loadu_si256((__m256i const *) &flanes[pos])); 21: res = _mm256_cmp_ps(sreg, creg, 30); 22: bitmask = _mm256_movemask_ps(res); 23: if (bitmask < 0xff) break; 24: pos += 8; rPos += 8; 25: } 26: pos--; rPos--; 27: while (end >= flanes[++pos] && rPos < level_items[1]) 28: rPos++; 29: proxy = proxy_nodes[rPos]; 30: res.end = proxy->pointers[1/p - 1]; 31: for (i=1; i < 1/p; i++) { 32: if (end < proxy->keys[i]) { 33: res.end = proxy->pointers[i - 1]; break; 34: } 35: } 36: return res; ``` - 先透過類似 Lookup 的方式尋找範圍中的第一個元素。 - 回到最底層的通道，透過 vectorized instructions 搜尋範圍中的最後一個元素，使用 [AVX] 使 CSSL 可並行處理 8 個 32-bit 的整數 `line 17-25` 。 - 透過 proxy lane 確認範圍中的最後一個元素，並回傳第一個和最後一個元素的指針。 ## Evaluation - 比較 range queries, lookup, mixed workload, space consumption, real-world data from the bioinformatics domain 的性能 - 以 throughput 作為搜尋的評估指標，即每秒處理多少個查詢。 - n 筆 32-bit 的整數鍵值，對 [1, n] 中的每個鍵值進行索引當作密集的分佈資料; 而稀疏的分佈資料則是從 [1, $2^{31}$] 中隨機取 n 筆出來。 - 採用 CSSL2 (p = 1/2) 和 CSSL5 (p = 1/5) 兩種配置來評估，而通道數設定為 9 。 - 與下列方法比較: - the adaptive radix tree (ART) (a recent radix tree variant designed for main memory) - the CSB+-tree (a cache-sensitive variant of the B+-tree) - a binary search (BS) on a static array - a $B^+$-tree [1] as baseline approach - 測試環境: a Intel Xeon E5-2620 CPU with 6 cores, 12 threads, 15 MB Level 3 Cache, 256-bit SIMD registers (AVX) and a clock speed of 2 GHz. The evaluation system runs Linux and has 32 GB RAM. All experiments are single-threaded. All competitors including CSSL were compiled with GCC 4.8.4 using optimization -O3. We use PAPI to collect performance counters. ### Range Queries ![](https://i.imgur.com/lh4ANV4.png) ![](https://i.imgur.com/OeXjsbD.png) ### Lookup ![](https://i.imgur.com/ORrGq5R.png) ### Mixed Workload ![](https://i.imgur.com/KLXQKHO.png) ### Evaluation with Genomic Data ![](https://i.imgur.com/BrQEwiN.png) ### Space Consumption ![](https://i.imgur.com/mVUQQt8.png) ## Conclusions - CSSL linearizes fast lanes to achieve a CPU-friendly data layout, to reduce cache misses, and to enable the usage of SIMD instructions. - CSSL’s search performance and memory consumption is influenced by the number of elements each fast lane skips over (1/p). - Sparse fast lanes show better results regarding memory consumption and range query execution.