Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step

# Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step std::unordered_set -> node_hash_set -> dense_hash_set -> **flat_hash_set (awesome)** Reference --- **CppCon 2017: Matt Kulukundis “Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step”: https://www.youtube.com/watch?v=ncHmEUmJZf4** Abseil's Open Source Hashtables: 2 Years In - Matt Kulukundis - CppCon 2019: https://www.youtube.com/watch?v=JZE3_0qvrMg&t=1517s&ab_channel=CppCon abseil/abseil-cpp: https://github.com/abseil/abseil-cpp google/hashtable-benchmarks: https://github.com/google/hashtable-benchmarks (updated 7/23) https://abseil.io/docs/cpp/guides/container (updated 7/24) greg7mdp/parallel-hashmap: https://github.com/greg7mdp/parallel-hashmap (updated 12/10) Comprehensive C++ Hashmap Benchmarks 2022 https://martin.ankerl.com/2022/08/27/hashmap-bench-01/ --- + table == map == set + Aiming for 98% true (at the time of 2017) + Asume we have a good hash funtion + Not aiming to follow std + E.g., elements **do not get pointer stability**: they move data when they rehash. If you require pointer stability, consider using an `absl::node_hash_map` or `absl::node_hash_set` Background --- Right now, 1% of cpus are computing inside of hash table More than 4% of RAM is own by a hash table and this is just for C++ ### cold or hot? **hot:** entire table fit in L1 cache -> its all about intruntions **cold** -> L1 cache misses ![](https://i.imgur.com/hnbeOso.png) Insight from google's code base --- + Nobody `erase()`. Only two operations really matter: `insert()` and `find()` (Except for the people who need efficient table scans. That's just iterating over the whole table.) + Nobody uses the bucket interface. Literally nobody. + Nobody uses load factor correctly. > I audited every user of load factor in Google repo. They were all wrong. Get Started --- ### std::unordered_set ![](https://i.imgur.com/wzsi70x.png) H: hash code V: value null pointer bucket chain load factor -> 6/10 -> used to decided when to grow ![](https://i.imgur.com/zn8WpGS.png) ![](https://i.imgur.com/qthYwgO.png) What can we learn from this picture? We store the hash and the pointer. That's an extra 16-bytes per entry. `find()` ![](https://i.imgur.com/P8oq0dA.png) 4 different memory accesses, not good. ![](https://i.imgur.com/LIr7MDm.png) Remove memory indirection gain: less memory loss: Iteration requires scanning, not just across our buckets, but across the entire array -> O(capacity) + O(size) Usually not a problem, except for the genius who insert 10 millions elements -> call `clear()` -> insert 3 elements -> do full table scan But maybe we can give up on them 030. ![](https://i.imgur.com/i9Dmrzd.png) Remove hash value, what did we loss? - When table resize, we need to re-calculate hash. - `find()` has to run `== V` more often. What did we gain? About .1% of fleetwide RAM. What if we remove the pointers? ![](https://i.imgur.com/BPrKt9C.png) Its a probing table now. 我們不要chainingㄌ What did we gain? About .1% of fleetwide RAM. Again. How about bucket? Nobody use bucket interface! How `find()` works? ![](https://i.imgur.com/lIkkCaz.png) Why not 4 memory accesses? 3 is right next to 1 in cache, so its free. 3 memory accesses. How about `erase()`? ![](https://i.imgur.com/p2IQg7S.png) How about `erase()`? We just add **sentinels**. But how about **load factor**? Nobody use load factory correctly anyway. Did we just break std compatibility? Yes. **Hyrum's Law** ![](https://i.imgur.com/qlE72Iq.png) This code in the standard is guaranteed never to trigger a rehash. > I've looked in Google. Nobody relies on this, nobody. Let's give it a name: **node_hash_set** ### node_hash_set ![](https://i.imgur.com/j1zylYH.png) Performance of finding 4-byte elements that are present: ![](https://i.imgur.com/6Yg5fmk.png) 2x speedup less memory > We're gonna leave the standard further behind. ### dense_hash_set ![](https://i.imgur.com/yl1H57v.png) ![](https://i.imgur.com/1HKRh5d.png) sentinels: user自己定義不可能出現的value 同時user也可以自己定義什麼是empty ![](https://i.imgur.com/hHwuhtW.png) 4~5x better, totally worth that sentinels Can we get rid of sentinels? ![](https://i.imgur.com/KyHhKuB.png) ### flat_hash_set ![](https://i.imgur.com/F3qYqth.png) full, empty or deleted -> 1.5 bits per element? ![](https://i.imgur.com/SAK2P4j.png) 1 bit for am I a sentinel? empty or deleted. 1 for full and I have 7 bits of hash code. 2-level hash ![](https://i.imgur.com/hoBniXp.png) Note that we can balance/trade-off H1 & H2 collision ![](https://i.imgur.com/HMgE5XU.png) ![](https://i.imgur.com/FI5tX1L.png) `find()` ![](https://i.imgur.com/fdCgUXE.png) ![](https://i.imgur.com/zbSNO9Z.png) 2 memory accesses It still does probing, but most probing happen in control bytes -> cache friendly, fit in L1 cache ![](https://i.imgur.com/Rb8DXGP.png) ![](https://i.imgur.com/cB9A4Q8.png) ![](https://i.imgur.com/byBsrT0.png) _mm_set1_epi8 initiate a 128-bit vector ![](https://i.imgur.com/wSecB3d.png) ![](https://i.imgur.com/3BUwpw1.png) compare msb -> 16-bit vector tell us which of our 16 control bytes have proper H2 hash Put it all together ![](https://i.imgur.com/3QwTDyN.png) 16 probes in 3 instructions. WTF ![](https://i.imgur.com/C5fAm04.png) ![](https://i.imgur.com/QDlq7Lk.png) 16 control bytes and 16 slots is a group 1 table has N groups `find()` ![](https://i.imgur.com/bQtRE74.png) ![](https://i.imgur.com/nnnwgzl.png) Observation --- ![](https://i.imgur.com/au74sPm.png) Remember we have a good hash function? Remember that we can also balance/trade-off H1(inter-groups) & H2(in group/control bytes) collision ![](https://i.imgur.com/gBVFuZM.png) `erase()` ![](https://i.imgur.com/XBF4O1R.png) We don't need to put sentinels unless the group is full, just mark it empty instead. This make it more impossible that a group become full (remember the previous observation), its a win-win. ### conclusions flat_hash_set = metadata (cache friendly) + SSE (less instructions count) Performence --- ![](https://i.imgur.com/2A6KiP9.png) > We've even begun migrating, rolling out these across all of Google. We are sending changes to everybody's code in Google. It's like, hey, you got an unordered set. Switch you to this one. It's pretty awesome. ![](https://i.imgur.com/Q44SPNu.png) blue: **std::unordered_set** red: **dense_hash_set** yellow: **flat_hash_set** green: **__gnu_cxx::hash_map** **Goal: replace them all** ![](https://i.imgur.com/oKgzJUF.png) **Hyrum's Law** ![](https://i.imgur.com/Ork7ySR.png) > Pretty sure somebody is going to start using flat_hash_set to generate random bits. And then they're gonna complain when I change the distribution. Q&A --- Because no one knows how to use max load factor correctly, the set max load factor method is a no-op. You're welcome. On non-SSE platforms, we emulate these SSE instructions. Swiss tables is heavily used in tensorflow --- ``` git clone https://github.com/tensorflow/tensorflow.git grep -nri "absl::node_hash_" tensorflow/ ``` ![](https://i.imgur.com/GwI5Ea3.png) ``` grep -nri "absl::flat_hash_" tensorflow/ ``` ![](https://i.imgur.com/UheZW0o.png) ![](https://i.imgur.com/NVzw7fA.png) ![](https://i.imgur.com/6kJioMo.png) ... However, there are still a lot of `std::unordered_{set, map}` in the repo (Updated 7/23) Migration notes --- This section contains migration notes from `std::unordered_{set, map}` to `absl::{node, flat}_hash_{set, map}` reference: https://abseil.io/docs/cpp/guides/container Tips: grep for `NOTE:` in `{node, flat}_hash_{set, map}.h` for difference between STL 1. **`void erase(const_iterator pos)`** instead of `iterator erase(const_iterator pos)` ``` // void erase(const_iterator pos): // // Erases the element at `position` of the `flat_hash_set`, returning // `void`. // // NOTE: returning `void` in this case is different than that of STL // containers in general and `std::unordered_set` in particular (which // return an iterator to the element following the erased element). If that // iterator is needed, simply post increment the iterator: // // set.erase(it++); ``` https://github.com/abseil/abseil-cpp/blob/701185dbce17a2f49334027ca3cb5788a5d06c6d/absl/container/flat_hash_set.h#L215 2. pointer stability If you require pointer stability of values (but not keys), or if your values are large, consider using `absl::flat_hash_map<Key, std::unique_ptr<Value>>` instead. If pointer stability is required for both key and value, use `absl::node_hash_map` https://abseil.io/docs/cpp/guides/container#recommendation https://github.com/abseil/abseil-cpp/blob/701185dbce17a2f49334027ca3cb5788a5d06c6d/absl/container/flat_hash_map.h#L83 3. depend on iteration order https://abseil.io/docs/cpp/guides/container#iteration-order-instability **Conclusion** - Replace old code with `node_hash_{set, map}` first if you are not sure (when it’s difficult to figure out whether the code relies on pointer stability). - Use `flat_hash_{set, map}` in new code by default (Updated 7/24) The Parallel Hashmap --- github: https://github.com/greg7mdp/parallel-hashmap 解說: https://greg7mdp.github.io/parallel-hashmap/ Bases on flat_hash_map, 把hash table切成16個submaps: - reduce peak memories on resize -> ideally one one of the 16 submaps need to resize at a time - insertion could be parallel -> faster ![](https://i.imgur.com/Ve9wQtx.png) ![](https://i.imgur.com/RkesfZj.png) (Updated 12/10) Comprehensive C++ Hashmap Benchmarks --- https://martin.ankerl.com/2022/08/27/hashmap-bench-01 flat map和node map在memory usage表現不好，可能是因為: > The number presented here calculates the geometric mean of the two very memory-heavy benchmarks, **Insert then Erase 100M int and Random Insert & Erase uint64_t**. Google說沒人在erase的另外table scan也表現不好

Read more

MTK issues