<style> /* :root { --r-heading-color: #ebebd7; --r-main-color: #ebebd7; } */ .slide-background { background: #222 /* #010123 !important; */ } .reveal { font-size: 24px; } .reveal section img { /*filter: invert(85%) sepia(20%) saturate(100%) hue-rotate(29deg) brightness(85%) background:none; border:none; box-shadow:none; */ } code { font-size: 16px; line-height: 1.2em; } /*h1, h2, h3 { text-transform: none !important; }*/ </style> # Optimizing Binary Search C++ Zero Cost Conf 2022 Sergey Slotin --- ## About me - Author of [Algorithms for Modern Hardware](https://en.algorithmica.org/hpc/) (en.algorithmica.org/hpc) - Designed the world's fastest **binary search**, B-tree, integer parsing, factorization, prefix sum, array searching, argmin… (github.com/sslotin/amh-code) - Email: me@sereja.me; Twitter: [@sergey_slotin](https://twitter.com/sergey_slotin); Telegram: [@bydlokoder](https://t.me/bydlokoder) ---- ## Outline - Vanilla binary search (Babylonians, 200BC; Mauchly, 1946) - Branchless binary search: 3x on small arrays (LLVM contributors, early 2010s–…) - Eytzinger binary search: 2x on large arrays (Khuong & Morin, 2015) - S-tree: 5-8x (Intel 2010; Slotin, 2020) - S+ tree: 7-15x (Slotin, 2022) --- ## The problem ```cpp // perform any O(n) precomputation with a sorted array void prepare(int *a, int n); // return the first element y ≥ x in O(log n) int lower_bound(int x); ``` Random keys & queries, all 32-bit integers, optimize for throughput ---- **Comparison-based search only** (no lookup tables, tries, interpolation searches, etc.) <!-- https://miro.medium.com/max/1400/1*GvdVNpUC_d6n80ZJNqrG1A.png =500x) --> ---- ```cpp int lower_bound(int x) { int l = 0, r = n - 1; while (l < r) { int m = (l + r) / 2; if (t[m] >= x) r = m; else l = m + 1; } return t[l]; } ``` As found in the first 50 pages of any CS textbook ---- ```cpp template <class _Compare, class _ForwardIterator, class _Tp> _LIBCPP_CONSTEXPR_AFTER_CXX17 _ForwardIterator __lower_bound(_ForwardIterator __first, _ForwardIterator __last, const _Tp& __value_, _Compare __comp) { typedef typename iterator_traits<_ForwardIterator>::difference_type difference_type; difference_type __len = _VSTD::distance(__first, __last); while (__len != 0) { difference_type __l2 = _VSTD::__half_positive(__len); _ForwardIterator __m = __first; _VSTD::advance(__m, __l2); if (__comp(*__m, __value_)) { __first = ++__m; __len -= __l2 + 1; } else __len = __l2; } return __first; } ``` github.com/llvm-mirror/libcxx/blob/master/include/algorithm#L4169 ---- ```nasm │35: mov %rax,%rdx 0.52 │ sar %rdx 0.33 │ lea (%rsi,%rdx,4),%rcx 4.30 │ cmp (%rcx),%edi 65.39 │ ↓ jle b0 0.07 │ sub %rdx,%rax 9.32 │ lea 0x4(%rcx),%rsi 0.06 │ dec %rax 1.37 │ test %rax,%rax 1.11 │ ↑ jg 35 ``` perf record & perf report ($n=10^6$) <!-- (comment: maybe live demo?) --> ---- ![](https://en.algorithmica.org/hpc/data-structures/img/search-std.svg =500x) Zen 2 @ 2GHz, DDR4 @ 2667MHz $\frac{100}{12} \times 2 \approx 17$ cycles per iteration seems too high for L1 <!-- .element: class="fragment" data-fragment-index="1" --> --- ## CPU pipeline To execute *any* instruction, a lot of preparatory work is needed: - **fetch** a chunk of machine code from memory - **decode** it and split into instructions - **execute** these instructions (possibly involving some **memory** operations) - **write** the results back into registers This whole sequence takes up to 15-20 CPU cycles ---- *Pipelining*: on each stage, the next instruction is processed the right away, without waiting for the previous one to fully complete ![](https://en.algorithmica.org/hpc/pipelining/img/pipeline.png =350x) Doesn't reduce *actual* latency but makes it *seem* like it's only execution & memory ---- *Pipeline hazard:* the next instruction can't execute on the following clock cycle ![](https://simplecore-ger.intel.com/techdecoded/wp-content/uploads/sites/11/figure-6-3.png =350x) - *Structural hazard*: two or more instructions need the same part of the CPU - *Data hazard*: need to wait for operands to be computed from a previous step - *Control hazard*: the CPU can't tell which instructions to execute next **^ requires a pipeline flush on branch misprediction** <!-- .element: class="fragment" data-fragment-index="1" --> --- ## Removing branches ```cpp int lower_bound(int x) { int l = 0, r = n - 1; while (l < r) { // <- this branch can be easily predicted int m = (l + r) / 2; if (t[m] >= x) // <- this branch is what we care about r = m; else l = m + 1; } return t[l]; } ``` ---- ```cpp int lower_bound(int x) { int *base = t, len = n; while (len > 1) { int half = len / 2; if (base[half - 1] < x) { base += half; len = len - half; } else { len = half; } } return *base; } ``` ---- ```cpp int lower_bound(int x) { int *base = t, len = n; while (len > 1) { int half = len / 2; if (base[half - 1] < x) base += half; len -= half; // = ceil(len / 2) } return *base; } ``` **Exactly** $\lceil \log_2 n \rceil$ iterations, so not strictly equivalent to binary search <!-- .element: class="fragment" data-fragment-index="1" --> ---- ```cpp int lower_bound(int x) { int *base = t, len = n; while (len > 1) { int half = len / 2; base += (base[half - 1] < x) * half; // will be replaced with a "cmov" len -= half; } return *base; } ``` ---- ![](https://en.algorithmica.org/hpc/data-structures/img/search-branchless.svg =500x) Why worse on larger arrays? <!-- .element: class="fragment" data-fragment-index="1" --> *Prefetching (speculation)* <!-- .element: class="fragment" data-fragment-index="2" --> ---- ```cpp int lower_bound(int x) { int *base = t, len = n; while (len > 1) { int half = len / 2; len -= half; __builtin_prefetch(&base[len / 2 - 1]); __builtin_prefetch(&base[half + len / 2 - 1]); base += (base[half - 1] < x) * half; } return *base; } ``` ---- ![](https://en.algorithmica.org/hpc/data-structures/img/search-branchless-prefetch.svg =500x) Still grows slightly faster as the branchy version prefetches further than one step ahead --- ## Cache locality ![](https://en.algorithmica.org/hpc/data-structures/img/binary-search.png =500x) - *Temporal locality*: only okay for the first dozen or so requests <!-- .element: class="fragment" data-fragment-index="1" --> - *Spatial locality*: only okay for the last 3-4 requests <!-- .element: class="fragment" data-fragment-index="2" --> (Memory is fetched in 64B blocks called *cache lines*) <!-- .element: class="fragment" data-fragment-index="2" --> ---- ![](https://en.algorithmica.org/hpc/data-structures/img/binary-search.png =500x) ![](https://en.algorithmica.org/hpc/data-structures/img/binary-heat.png =500x) Hot and cold elements are grouped together (the opposite of what we want) ---- <!-- **Michaël Eytzinger** is a 16th-century Austrian nobleman known for work on genealogy In particular, for a compact ancestor numbering system known as *ahnentafel* (German for "ancestor table") --> ![](https://upload.wikimedia.org/wikipedia/commons/5/5c/Eytzinger_-_Thesaurus_principum.jpg =500x) **Michaël Eytzinger** is a 16th-century Austrian nobleman known for his geneology work; in particular, for a compact ancestor numbering system known as *ahnentafel* ---- ![](https://upload.wikimedia.org/wikipedia/commons/5/5c/Eytzinger_-_Thesaurus_principum.jpg =500x) - The subject of the ahnentafel is listed as number 1 - The father of person $k$ is listed as $2k$, and their mother is listed as $(2k+1)$ ---- ![](https://en.algorithmica.org/hpc/data-structures/img/eytzinger.png =500x) Used in heaps, segment trees, and other binary tree structures (Note: the tree is not strictly equivalent to binary search) ---- ![](https://en.algorithmica.org/hpc/data-structures/img/eytzinger-search.png =600x) ![](https://en.algorithmica.org/hpc/data-structures/img/eytzinger-heat.png =600x) Temporal locality is much better ---- ### Building ```cpp int a[n], t[n + 1]; // the original sorted array and the eytzinger array we build // ^ we need one element more because of one-based indexing void eytzinger(int k = 1) { static int i = 0; if (k <= n) { eytzinger(2 * k); t[k] = a[i++]; eytzinger(2 * k + 1); } } ``` The sorted array can be permuted in $O(n)$ time ---- ### Searching <!-- Instead of ranges, start with $k=1$ and execute until $k < n$: - If $x < a_k$, go left: $k := 2k$ - Otherwise, go right: $k := 2k + 1$ --> ```cpp int k = 1; while (k <= n) k = 2 * k + (t[k] < x); ``` But how do we get the lower bound itself? <!-- .element: class="fragment" data-fragment-index="1" --> ---- ![](https://i.imgur.com/94CmCA7.png =700x) $k$ does not necessarily point to the lower bound …but we compare against it at some point and keep going "right" ever after <!-- .element: class="fragment" data-fragment-index="1" --> ---- ```cpp int lower_bound(int x) { int k = 1; while (k <= n) k = 2 * k + (t[k] < x); k >>= __builtin_ffs(~k); return t[k]; } ``` - Right-turns are recorded in the binary notation of $k$ as 1-bits - To "cancel" them, we can right-shift $k$ by the number of trailing ones ---- ![](https://en.algorithmica.org/hpc/data-structures/img/search-eytzinger.svg =500x) Worse than branchless on large arrays because the last 3-4 iterations are not cached ---- ```cpp int lower_bound(int x) { int k = 1; while (k <= n) { __builtin_prefetch(&t[k * 2]); __builtin_prefetch(&t[k * 2 + 1]); k = 2 * k + (t[k] < x); } k >>= __builtin_ffs(~k); return t[k]; } ``` ---- ```cpp int lower_bound(int x) { int k = 1; while (k <= n) { __builtin_prefetch(&t[k * 2]); //__builtin_prefetch(&t[k * 2 + 1]); k = 2 * k + (t[k] < x); } k >>= __builtin_ffs(~k); return t[k]; } ``` ---- ```cpp int lower_bound(int x) { int k = 1; while (k <= n) { __builtin_prefetch(&t[k * 4]); __builtin_prefetch(&t[k * 4 + 1]); __builtin_prefetch(&t[k * 4 + 2]); __builtin_prefetch(&t[k * 4 + 3]); k = 2 * k + (t[k] < x); } k >>= __builtin_ffs(~k); return t[k]; } ``` ---- ```cpp int lower_bound(int x) { int k = 1; while (k <= n) { __builtin_prefetch(&t[k * 4]); //__builtin_prefetch(&t[k * 4 + 1]); //__builtin_prefetch(&t[k * 4 + 2]); //__builtin_prefetch(&t[k * 4 + 3]); k = 2 * k + (t[k] < x); } k >>= __builtin_ffs(~k); return t[k]; } ``` ---- ```cpp alignas(64) int t[n + 1]; // allocate the array on the beginning of a cache line int lower_bound(int x) { int k = 1; while (k <= n) { __builtin_prefetch(&t[k * 16]); k = 2 * k + (t[k] < x); } k >>= __builtin_ffs(~k); return t[k]; } ``` ---- ![](https://en.algorithmica.org/hpc/data-structures/img/eytzinger.png =500x) - When we fetch $k \cdot 16$, we also fetch its 15 cache line neighbors - If the array is **aligned**, they are guaranteed to be $[k \cdot 16, k \cdot 16 + 15]$ - As I/O is pipelined, we are always prefetching 4 levels in advance ---- ![](https://en.algorithmica.org/hpc/data-structures/img/search-eytzinger-prefetch.svg =500x) - In Eytzinger binary search, we trade off **bandwidth** for **latency** with prefetching <!-- .element: class="fragment" data-fragment-index="1" --> - Doesn't work well in bandwidth-constrained environments (multi-threading) <!-- .element: class="fragment" data-fragment-index="1" --> - Instead, we can fetch fewer cache lines <!-- .element: class="fragment" data-fragment-index="2" --> --- ## B-trees ![](https://en.algorithmica.org/hpc/data-structures/img/b-tree.jpg =500x) $B \ge 2$ keys in each node, increasing branching factor and reducing tree height $O(\log_B n \cdot B)$ comparisons in total with linear search in each node ---- - For RAM, set $B$ to be the cache line size so that we don't waste reads - $B=16$ reduces the tree height by $\frac{\log_2 n}{\log_{17} n} = \frac{\log 17}{\log 2} = \log_2 17 \approx 4.09$ times ---- We can make static B-trees **implicit** by generalizing the Eytzinger numeration: - The root node is numbered $0$ - Node $k$ has $(B + 1)$ child nodes numbered $\{k \cdot (B + 1) + i + 1\}$ for $i \in [0, B]$ ---- ```cpp const int B = 16; int nblocks = (n + B - 1) / B; int btree[nblocks][B]; int go(int k, int i) { return k * (B + 1) + i + 1; } void build(int k = 0) { static int t = 0; if (k < nblocks) { for (int i = 0; i < B; i++) { build(go(k, i)); btree[k][i] = (t < n ? a[t++] : INT_MAX); } build(go(k, B)); } } ``` ---- <!-- todo: validate, replace ffs with ctz, and tranfer code to the article --> ```cpp // compute the "local" lower bound in a node int rank(int x, int *node) { for (int i = 0; i < B; i++) if (node[i] >= x) return i; return B; } ``` $B/2$ comparisons on average, unpredictable branching ---- ```cpp int rank(int x, int *node) { int mask = (1 << B); for (int i = 0; i < B; i++) mask |= (btree[k][i] >= x) << i; return __builtin_ffs(mask) - 1; } ``` No branching, but even more comparisons ---- ![](https://i.imgur.com/F4pL8MU.png =600x) SIMD: Single Instruction, Multiple Data ---- ```cpp typedef __m256i reg; // compute a 8-bit mask corresponding to "<" elements int cmp(reg x_vec, int* y_ptr) { reg y_vec = _mm256_load_si256((reg*) y_ptr); // load 8 sorted elements reg mask = _mm256_cmpgt_epi32(x_vec, y_vec); // compare against the key return _mm256_movemask_ps((__m256) mask); // extract the 8-bit mask } ``` Zen 2 supports AVX2, so we can compare $256/32=8$ keys at a time ---- ```cpp int rank(reg x_vec, int *node) { int mask = ~( cmp(x, node) + (cmp(x, node + 8) << 8) ); return __builtin_ffs(mask) - 1; // alternative: popcount } ``` ---- ```cpp int lower_bound(int _x) { int k = 0, res = INT_MAX; reg x = _mm256_set1_epi32(_x); while (k < nblocks) { int i = rank(x, btree[k]); if (i < B) // a local lower bound doesn't have to exist in the leaf node res = btree[k][i]; k = go(k, i); } return res; } ``` ---- ![](https://en.algorithmica.org/hpc/data-structures/img/search-btree.svg =500x) "S-tree": <u>s</u>tatic, <u>s</u>uccinct, <u>s</u>peedy, <u>S</u>IMDized ---- ```cpp int lower_bound(int _x) { int k = 0, res = INT_MAX; reg x = _mm256_set1_epi32(_x); while (k < nblocks) { int i = rank(x, btree[k]); if (i < B) res = btree[k][i]; k = go(k, i); } return res; } ``` 1. The update is unnecessary 16 times out of 17 <!-- .element: class="fragment" data-fragment-index="1" --> 2. Non-constant number of iterations causes branch mispredictions <!-- .element: class="fragment" data-fragment-index="2" --> <!-- #1: Hugepages ![](https://en.algorithmica.org/hpc/data-structures/img/search-btree-hugepages.svg =500x) (No significant effect for implementations with some form of prefetching) #2: Make the # of iterations compile-time constant ```cpp constexpr int height(int n) { // grow the tree until its size exceeds n elements int s = 0, // total size so far l = B, // size of the next layer h = 0; // height so far while (s + l - B < n) { s += l; l *= (B + 1); h++; } return h; } const int H = height(N); ``` If size is not constant, we can `switch` between several fixed-height implementations #3: Combine masks on the SIMD register level (`movemask` is relatively slow and better be used only once) ```cpp unsigned rank(reg x, int* y) { reg a = _mm256_load_si256((reg*) y); reg b = _mm256_load_si256((reg*) (y + 8)); reg ca = _mm256_cmpgt_epi32(a, x); reg cb = _mm256_cmpgt_epi32(b, x); reg c = _mm256_packs_epi32(ca, cb); // combines vector masks int mask = _mm256_movemask_epi8(c); // we need to divide the result by two because we call movemask_epi8 on 16-bit masks: return __tzcnt_u32(mask) >> 1; } ``` ![](https://en.algorithmica.org/hpc/data-structures/img/search-btree-optimized.svg =500x) ```cpp int lower_bound(int _x) { int k = 0, res = INT_MAX; reg x = _mm256_set1_epi32(_x - 1); for (int h = 0; h < H - 1; h++) { unsigned i = rank(x, &btree[k]); update(res, &btree[k], i); k = go(k, i); } // the last branch: if (k < nblocks) { unsigned i = rank(x, &btree[k]); update(res, &btree[k], i); } return res; } ``` --> --- ## B+ trees ![](https://en.algorithmica.org/hpc/data-structures/img/bplus.png =350x) - *Data nodes* or *leaves* store up to $B$ keys and the pointer to the next leaf node - *Internal nodes* store up to $B$ **copied** keys and $(B + 1)$ pointers to child nodes (the structure is **not** succinct: 6-7% more memory is needed for $B=16$) <!-- .element: class="fragment" data-fragment-index="1" --> <!-- // the height of a balanced n-key B+ tree constexpr int height(int n) { /* ... */ } // where the layer h starts (data layer is numbered 0) constexpr int offset(int h) { /* ... */ } const int H = height(N); const int S = offset(H); // ... quite a bit of code to build the tree skipped ... --> ---- ```cpp int lower_bound(int _x) { unsigned k = 0; reg x = _mm256_set1_epi32(_x - 1); for (int h = H - 1; h > 0; h--) { unsigned i = rank(x, btree + offset(h) + k); k = k * (B + 1) + i * B; } unsigned i = rank(x, btree + k); return btree[k + i]; } ``` 1. The last node or its next neighbor has the local lower bound (no update) <!-- .element: class="fragment" data-fragment-index="1" --> 2. The depth is constant because B+ trees grow at the root (no branching) <!-- .element: class="fragment" data-fragment-index="2" --> Also, we can now *reuse* the sorted array and easily return the *index* of the lower bound <!-- .element: class="fragment" data-fragment-index="3" --> ---- ![](https://en.algorithmica.org/hpc/data-structures/img/search-bplus.svg =500x) (Spikes at the end due to L1 TLB spill) ---- ![](https://en.algorithmica.org/hpc/data-structures/img/search-all.svg =500x) ---- ![](https://en.algorithmica.org/hpc/data-structures/img/search-relative.svg =500x) --- ![](https://en.algorithmica.org/hpc/data-structures/img/search-all.svg =500x) Article: en.algorithmica.org/hpc/data-structures/binary-search Code: github.com/sslotin/amh-code/tree/main/binsearch --- ## Bonus slides ---- ```cpp int *tree; // internal nodes store (B - 2) keys and (B - 1) pointers // leafs store (B - 1) keys int lower_bound(int _x) { unsigned k = root; reg x = _mm256_set1_epi32(_x); for (int h = 0; h < H - 1; h++) { unsigned i = rank(x, &tree[k]); k = tree[k + B + i]; // fetch next pointer } unsigned i = rank(x, &tree[k]); return tree[k + i]; } ``` "B− Tree" ---- ![](https://en.algorithmica.org/hpc/data-structures/img/btree-absolute.svg =700x) ---- ![](https://en.algorithmica.org/hpc/data-structures/img/btree-relative.svg =700x) ---- ![](https://en.algorithmica.org/hpc/data-structures/img/btree-absl.svg =700x) ---- ```cpp clock_t start = clock(); for (int i = 0; i < m; i++) checksum ^= lower_bound(q[i]); float seconds = float(clock() - start) / CLOCKS_PER_SEC; printf("%.2f ns per query\n", 1e9 * seconds / m); ``` This measures **reciprocal throughput** ---- ```cpp int last = 0; for (int i = 0; i < m; i++) { last = lower_bound(q[i] ^ last); checksum ^= last; } ``` This measures **latency** ---- ![](https://en.algorithmica.org/hpc/data-structures/img/search-relative-latency.svg =500x) ---- ![](https://en.algorithmica.org/hpc/data-structures/img/search-bplus-other.svg =500x) ---- ![](https://en.algorithmica.org/hpc/data-structures/img/search-latency-bplus.svg =500x)
{"metaMigratedAt":"2023-06-17T05:08:47.763Z","metaMigratedFrom":"YAML","title":"Optimizing Binary Search","breaks":true,"slideOptions":"{\"theme\":\"white\"}","contributors":"[{\"id\":\"045b9308-fa89-4b5e-a7f0-d8e7d0b849fd\",\"add\":37160,\"del\":16334}]"}
    870 views