<style>
/*
:root {
--r-heading-color: #ebebd7;
--r-main-color: #ebebd7;
} */
.slide-background {
background: #222 /* #010123 !important; */
}
.reveal {
font-size: 24px;
}
.reveal section img {
/*filter: invert(85%) sepia(20%) saturate(100%) hue-rotate(29deg) brightness(85%)
background:none;
border:none;
box-shadow:none; */
}
code {
font-size: 16px;
line-height: 1.2em;
}
/*h1, h2, h3 {
text-transform: none !important;
}*/
</style>
# Optimizing Binary Search
C++ Zero Cost Conf 2022
Sergey Slotin
---
## About me
- Author of [Algorithms for Modern Hardware](https://en.algorithmica.org/hpc/) (en.algorithmica.org/hpc)
- Designed the world's fastest **binary search**, B-tree, integer parsing, factorization, prefix sum, array searching, argmin… (github.com/sslotin/amh-code)
- Email: me@sereja.me; Twitter: [@sergey_slotin](https://twitter.com/sergey_slotin); Telegram: [@bydlokoder](https://t.me/bydlokoder)
----
## Outline
- Vanilla binary search (Babylonians, 200BC; Mauchly, 1946)
- Branchless binary search: 3x on small arrays (LLVM contributors, early 2010s–…)
- Eytzinger binary search: 2x on large arrays (Khuong & Morin, 2015)
- S-tree: 5-8x (Intel 2010; Slotin, 2020)
- S+ tree: 7-15x (Slotin, 2022)
---
## The problem
```cpp
// perform any O(n) precomputation with a sorted array
void prepare(int *a, int n);
// return the first element y ≥ x in O(log n)
int lower_bound(int x);
```
Random keys & queries, all 32-bit integers, optimize for throughput
----
**Comparison-based search only**
(no lookup tables, tries, interpolation searches, etc.)
<!-- https://miro.medium.com/max/1400/1*GvdVNpUC_d6n80ZJNqrG1A.png =500x) -->
----
```cpp
int lower_bound(int x) {
int l = 0, r = n - 1;
while (l < r) {
int m = (l + r) / 2;
if (t[m] >= x)
r = m;
else
l = m + 1;
}
return t[l];
}
```
As found in the first 50 pages of any CS textbook
----
```cpp
template <class _Compare, class _ForwardIterator, class _Tp>
_LIBCPP_CONSTEXPR_AFTER_CXX17 _ForwardIterator
__lower_bound(_ForwardIterator __first, _ForwardIterator __last,
const _Tp& __value_, _Compare __comp) {
typedef typename iterator_traits<_ForwardIterator>::difference_type difference_type;
difference_type __len = _VSTD::distance(__first, __last);
while (__len != 0) {
difference_type __l2 = _VSTD::__half_positive(__len);
_ForwardIterator __m = __first;
_VSTD::advance(__m, __l2);
if (__comp(*__m, __value_)) {
__first = ++__m;
__len -= __l2 + 1;
}
else
__len = __l2;
}
return __first;
}
```
github.com/llvm-mirror/libcxx/blob/master/include/algorithm#L4169
----
```nasm
│35: mov %rax,%rdx
0.52 │ sar %rdx
0.33 │ lea (%rsi,%rdx,4),%rcx
4.30 │ cmp (%rcx),%edi
65.39 │ ↓ jle b0
0.07 │ sub %rdx,%rax
9.32 │ lea 0x4(%rcx),%rsi
0.06 │ dec %rax
1.37 │ test %rax,%rax
1.11 │ ↑ jg 35
```
perf record & perf report ($n=10^6$)
<!-- (comment: maybe live demo?) -->
----

Zen 2 @ 2GHz, DDR4 @ 2667MHz
$\frac{100}{12} \times 2 \approx 17$ cycles per iteration seems too high for L1 <!-- .element: class="fragment" data-fragment-index="1" -->
---
## CPU pipeline
To execute *any* instruction, a lot of preparatory work is needed:
- **fetch** a chunk of machine code from memory
- **decode** it and split into instructions
- **execute** these instructions (possibly involving some **memory** operations)
- **write** the results back into registers
This whole sequence takes up to 15-20 CPU cycles
----
*Pipelining*: on each stage, the next instruction is processed the right away, without waiting for the previous one to fully complete

Doesn't reduce *actual* latency but makes it *seem* like it's only execution & memory
----
*Pipeline hazard:* the next instruction can't execute on the following clock cycle

- *Structural hazard*: two or more instructions need the same part of the CPU
- *Data hazard*: need to wait for operands to be computed from a previous step
- *Control hazard*: the CPU can't tell which instructions to execute next
**^ requires a pipeline flush on branch misprediction** <!-- .element: class="fragment" data-fragment-index="1" -->
---
## Removing branches
```cpp
int lower_bound(int x) {
int l = 0, r = n - 1;
while (l < r) { // <- this branch can be easily predicted
int m = (l + r) / 2;
if (t[m] >= x) // <- this branch is what we care about
r = m;
else
l = m + 1;
}
return t[l];
}
```
----
```cpp
int lower_bound(int x) {
int *base = t, len = n;
while (len > 1) {
int half = len / 2;
if (base[half - 1] < x) {
base += half;
len = len - half;
} else {
len = half;
}
}
return *base;
}
```
----
```cpp
int lower_bound(int x) {
int *base = t, len = n;
while (len > 1) {
int half = len / 2;
if (base[half - 1] < x)
base += half;
len -= half; // = ceil(len / 2)
}
return *base;
}
```
**Exactly** $\lceil \log_2 n \rceil$ iterations, so not strictly equivalent to binary search <!-- .element: class="fragment" data-fragment-index="1" -->
----
```cpp
int lower_bound(int x) {
int *base = t, len = n;
while (len > 1) {
int half = len / 2;
base += (base[half - 1] < x) * half; // will be replaced with a "cmov"
len -= half;
}
return *base;
}
```
----

Why worse on larger arrays? <!-- .element: class="fragment" data-fragment-index="1" --> *Prefetching (speculation)* <!-- .element: class="fragment" data-fragment-index="2" -->
----
```cpp
int lower_bound(int x) {
int *base = t, len = n;
while (len > 1) {
int half = len / 2;
len -= half;
__builtin_prefetch(&base[len / 2 - 1]);
__builtin_prefetch(&base[half + len / 2 - 1]);
base += (base[half - 1] < x) * half;
}
return *base;
}
```
----

Still grows slightly faster as the branchy version prefetches further than one step ahead
---
## Cache locality

- *Temporal locality*: only okay for the first dozen or so requests <!-- .element: class="fragment" data-fragment-index="1" -->
- *Spatial locality*: only okay for the last 3-4 requests <!-- .element: class="fragment" data-fragment-index="2" -->
(Memory is fetched in 64B blocks called *cache lines*) <!-- .element: class="fragment" data-fragment-index="2" -->
----


Hot and cold elements are grouped together (the opposite of what we want)
----
<!--
**Michaël Eytzinger** is a 16th-century Austrian nobleman known for work on genealogy
In particular, for a compact ancestor numbering system known as *ahnentafel*
(German for "ancestor table")
-->

**Michaël Eytzinger** is a 16th-century Austrian nobleman known for his geneology work;
in particular, for a compact ancestor numbering system known as *ahnentafel*
----

- The subject of the ahnentafel is listed as number 1
- The father of person $k$ is listed as $2k$, and their mother is listed as $(2k+1)$
----

Used in heaps, segment trees, and other binary tree structures
(Note: the tree is not strictly equivalent to binary search)
----


Temporal locality is much better
----
### Building
```cpp
int a[n], t[n + 1]; // the original sorted array and the eytzinger array we build
// ^ we need one element more because of one-based indexing
void eytzinger(int k = 1) {
static int i = 0;
if (k <= n) {
eytzinger(2 * k);
t[k] = a[i++];
eytzinger(2 * k + 1);
}
}
```
The sorted array can be permuted in $O(n)$ time
----
### Searching
<!--
Instead of ranges, start with $k=1$ and execute until $k < n$:
- If $x < a_k$, go left: $k := 2k$
- Otherwise, go right: $k := 2k + 1$
-->
```cpp
int k = 1;
while (k <= n)
k = 2 * k + (t[k] < x);
```
But how do we get the lower bound itself? <!-- .element: class="fragment" data-fragment-index="1" -->
----

$k$ does not necessarily point to the lower bound
…but we compare against it at some point and keep going "right" ever after <!-- .element: class="fragment" data-fragment-index="1" -->
----
```cpp
int lower_bound(int x) {
int k = 1;
while (k <= n)
k = 2 * k + (t[k] < x);
k >>= __builtin_ffs(~k);
return t[k];
}
```
- Right-turns are recorded in the binary notation of $k$ as 1-bits
- To "cancel" them, we can right-shift $k$ by the number of trailing ones
----

Worse than branchless on large arrays because the last 3-4 iterations are not cached
----
```cpp
int lower_bound(int x) {
int k = 1;
while (k <= n) {
__builtin_prefetch(&t[k * 2]);
__builtin_prefetch(&t[k * 2 + 1]);
k = 2 * k + (t[k] < x);
}
k >>= __builtin_ffs(~k);
return t[k];
}
```
----
```cpp
int lower_bound(int x) {
int k = 1;
while (k <= n) {
__builtin_prefetch(&t[k * 2]);
//__builtin_prefetch(&t[k * 2 + 1]);
k = 2 * k + (t[k] < x);
}
k >>= __builtin_ffs(~k);
return t[k];
}
```
----
```cpp
int lower_bound(int x) {
int k = 1;
while (k <= n) {
__builtin_prefetch(&t[k * 4]);
__builtin_prefetch(&t[k * 4 + 1]);
__builtin_prefetch(&t[k * 4 + 2]);
__builtin_prefetch(&t[k * 4 + 3]);
k = 2 * k + (t[k] < x);
}
k >>= __builtin_ffs(~k);
return t[k];
}
```
----
```cpp
int lower_bound(int x) {
int k = 1;
while (k <= n) {
__builtin_prefetch(&t[k * 4]);
//__builtin_prefetch(&t[k * 4 + 1]);
//__builtin_prefetch(&t[k * 4 + 2]);
//__builtin_prefetch(&t[k * 4 + 3]);
k = 2 * k + (t[k] < x);
}
k >>= __builtin_ffs(~k);
return t[k];
}
```
----
```cpp
alignas(64) int t[n + 1]; // allocate the array on the beginning of a cache line
int lower_bound(int x) {
int k = 1;
while (k <= n) {
__builtin_prefetch(&t[k * 16]);
k = 2 * k + (t[k] < x);
}
k >>= __builtin_ffs(~k);
return t[k];
}
```
----

- When we fetch $k \cdot 16$, we also fetch its 15 cache line neighbors
- If the array is **aligned**, they are guaranteed to be $[k \cdot 16, k \cdot 16 + 15]$
- As I/O is pipelined, we are always prefetching 4 levels in advance
----

- In Eytzinger binary search, we trade off **bandwidth** for **latency** with prefetching <!-- .element: class="fragment" data-fragment-index="1" -->
- Doesn't work well in bandwidth-constrained environments (multi-threading) <!-- .element: class="fragment" data-fragment-index="1" -->
- Instead, we can fetch fewer cache lines <!-- .element: class="fragment" data-fragment-index="2" -->
---
## B-trees

$B \ge 2$ keys in each node, increasing branching factor and reducing tree height
$O(\log_B n \cdot B)$ comparisons in total with linear search in each node
----
- For RAM, set $B$ to be the cache line size so that we don't waste reads
- $B=16$ reduces the tree height by $\frac{\log_2 n}{\log_{17} n} = \frac{\log 17}{\log 2} = \log_2 17 \approx 4.09$ times
----
We can make static B-trees **implicit** by generalizing the Eytzinger numeration:
- The root node is numbered $0$
- Node $k$ has $(B + 1)$ child nodes numbered $\{k \cdot (B + 1) + i + 1\}$ for $i \in [0, B]$
----
```cpp
const int B = 16;
int nblocks = (n + B - 1) / B;
int btree[nblocks][B];
int go(int k, int i) { return k * (B + 1) + i + 1; }
void build(int k = 0) {
static int t = 0;
if (k < nblocks) {
for (int i = 0; i < B; i++) {
build(go(k, i));
btree[k][i] = (t < n ? a[t++] : INT_MAX);
}
build(go(k, B));
}
}
```
----
<!-- todo: validate, replace ffs with ctz, and tranfer code to the article -->
```cpp
// compute the "local" lower bound in a node
int rank(int x, int *node) {
for (int i = 0; i < B; i++)
if (node[i] >= x)
return i;
return B;
}
```
$B/2$ comparisons on average, unpredictable branching
----
```cpp
int rank(int x, int *node) {
int mask = (1 << B);
for (int i = 0; i < B; i++)
mask |= (btree[k][i] >= x) << i;
return __builtin_ffs(mask) - 1;
}
```
No branching, but even more comparisons
----

SIMD: Single Instruction, Multiple Data
----
```cpp
typedef __m256i reg;
// compute a 8-bit mask corresponding to "<" elements
int cmp(reg x_vec, int* y_ptr) {
reg y_vec = _mm256_load_si256((reg*) y_ptr); // load 8 sorted elements
reg mask = _mm256_cmpgt_epi32(x_vec, y_vec); // compare against the key
return _mm256_movemask_ps((__m256) mask); // extract the 8-bit mask
}
```
Zen 2 supports AVX2, so we can compare $256/32=8$ keys at a time
----
```cpp
int rank(reg x_vec, int *node) {
int mask = ~(
cmp(x, node) +
(cmp(x, node + 8) << 8)
);
return __builtin_ffs(mask) - 1; // alternative: popcount
}
```
----
```cpp
int lower_bound(int _x) {
int k = 0, res = INT_MAX;
reg x = _mm256_set1_epi32(_x);
while (k < nblocks) {
int i = rank(x, btree[k]);
if (i < B) // a local lower bound doesn't have to exist in the leaf node
res = btree[k][i];
k = go(k, i);
}
return res;
}
```
----

"S-tree": <u>s</u>tatic, <u>s</u>uccinct, <u>s</u>peedy, <u>S</u>IMDized
----
```cpp
int lower_bound(int _x) {
int k = 0, res = INT_MAX;
reg x = _mm256_set1_epi32(_x);
while (k < nblocks) {
int i = rank(x, btree[k]);
if (i < B)
res = btree[k][i];
k = go(k, i);
}
return res;
}
```
1. The update is unnecessary 16 times out of 17 <!-- .element: class="fragment" data-fragment-index="1" -->
2. Non-constant number of iterations causes branch mispredictions <!-- .element: class="fragment" data-fragment-index="2" -->
<!--
#1: Hugepages

(No significant effect for implementations with some form of prefetching)
#2: Make the # of iterations compile-time constant
```cpp
constexpr int height(int n) {
// grow the tree until its size exceeds n elements
int s = 0, // total size so far
l = B, // size of the next layer
h = 0; // height so far
while (s + l - B < n) {
s += l;
l *= (B + 1);
h++;
}
return h;
}
const int H = height(N);
```
If size is not constant, we can `switch` between several fixed-height implementations
#3: Combine masks on the SIMD register level
(`movemask` is relatively slow and better be used only once)
```cpp
unsigned rank(reg x, int* y) {
reg a = _mm256_load_si256((reg*) y);
reg b = _mm256_load_si256((reg*) (y + 8));
reg ca = _mm256_cmpgt_epi32(a, x);
reg cb = _mm256_cmpgt_epi32(b, x);
reg c = _mm256_packs_epi32(ca, cb); // combines vector masks
int mask = _mm256_movemask_epi8(c);
// we need to divide the result by two because we call movemask_epi8 on 16-bit masks:
return __tzcnt_u32(mask) >> 1;
}
```

```cpp
int lower_bound(int _x) {
int k = 0, res = INT_MAX;
reg x = _mm256_set1_epi32(_x - 1);
for (int h = 0; h < H - 1; h++) {
unsigned i = rank(x, &btree[k]);
update(res, &btree[k], i);
k = go(k, i);
}
// the last branch:
if (k < nblocks) {
unsigned i = rank(x, &btree[k]);
update(res, &btree[k], i);
}
return res;
}
```
-->
---
## B+ trees

- *Data nodes* or *leaves* store up to $B$ keys and the pointer to the next leaf node
- *Internal nodes* store up to $B$ **copied** keys and $(B + 1)$ pointers to child nodes
(the structure is **not** succinct: 6-7% more memory is needed for $B=16$) <!-- .element: class="fragment" data-fragment-index="1" -->
<!--
// the height of a balanced n-key B+ tree
constexpr int height(int n) { /* ... */ }
// where the layer h starts (data layer is numbered 0)
constexpr int offset(int h) { /* ... */ }
const int H = height(N);
const int S = offset(H);
// ... quite a bit of code to build the tree skipped ...
-->
----
```cpp
int lower_bound(int _x) {
unsigned k = 0;
reg x = _mm256_set1_epi32(_x - 1);
for (int h = H - 1; h > 0; h--) {
unsigned i = rank(x, btree + offset(h) + k);
k = k * (B + 1) + i * B;
}
unsigned i = rank(x, btree + k);
return btree[k + i];
}
```
1. The last node or its next neighbor has the local lower bound (no update) <!-- .element: class="fragment" data-fragment-index="1" -->
2. The depth is constant because B+ trees grow at the root (no branching) <!-- .element: class="fragment" data-fragment-index="2" -->
Also, we can now *reuse* the sorted array and easily return the *index* of the lower bound <!-- .element: class="fragment" data-fragment-index="3" -->
----

(Spikes at the end due to L1 TLB spill)
----

----

---

Article: en.algorithmica.org/hpc/data-structures/binary-search
Code: github.com/sslotin/amh-code/tree/main/binsearch
---
## Bonus slides
----
```cpp
int *tree; // internal nodes store (B - 2) keys and (B - 1) pointers
// leafs store (B - 1) keys
int lower_bound(int _x) {
unsigned k = root;
reg x = _mm256_set1_epi32(_x);
for (int h = 0; h < H - 1; h++) {
unsigned i = rank(x, &tree[k]);
k = tree[k + B + i]; // fetch next pointer
}
unsigned i = rank(x, &tree[k]);
return tree[k + i];
}
```
"B− Tree"
----

----

----

----
```cpp
clock_t start = clock();
for (int i = 0; i < m; i++)
checksum ^= lower_bound(q[i]);
float seconds = float(clock() - start) / CLOCKS_PER_SEC;
printf("%.2f ns per query\n", 1e9 * seconds / m);
```
This measures **reciprocal throughput**
----
```cpp
int last = 0;
for (int i = 0; i < m; i++) {
last = lower_bound(q[i] ^ last);
checksum ^= last;
}
```
This measures **latency**
----

----

----

{"metaMigratedAt":"2023-06-17T05:08:47.763Z","metaMigratedFrom":"YAML","title":"Optimizing Binary Search","breaks":true,"slideOptions":"{\"theme\":\"white\"}","contributors":"[{\"id\":\"045b9308-fa89-4b5e-a7f0-d8e7d0b849fd\",\"add\":37160,\"del\":16334}]"}