<style>
.reveal {
font-size: 24px;
}
.reveal section img {
background:none;
border:none;
box-shadow:none;
}
pre.graphviz {
box-shadow: none;
}
.slide-background-content {
background: #fff;
}
code {
color: #900 !important;
background: #eee !important;
border: 1px solid #ddd;
border-radius: 3px;
/* padding: 0 2px; */
}
.code-wrapper {
font-size: 13px !important;
box-shadow: none !important;
}
a {
font-size: 0.9em;
}
h1, h2 {
text-transform: none !important;
}
.code-wrapper {
font-size: 18px !important;
}
.hljs{display:block;overflow-x:auto;padding:.5em;color:#333;background:#eee}.hljs-comment,.hljs-quote{color:#998;font-style:italic}.hljs-keyword,.hljs-selector-tag,.hljs-subst{color:#333;font-weight:700}.hljs-literal,.hljs-number,.hljs-tag .hljs-attr,.hljs-template-variable,.hljs-variable{color:teal}.hljs-doctag,.hljs-string{color:#d14}.hljs-section,.hljs-selector-id,.hljs-title{color:#900;font-weight:700}.hljs-subst{font-weight:400}.hljs-class .hljs-title,.hljs-type{color:#458;font-weight:700}.hljs-attribute,.hljs-name,.hljs-tag{color:navy;font-weight:400}.hljs-link,.hljs-regexp{color:#009926}.hljs-bullet,.hljs-symbol{color:#990073}.hljs-built_in,.hljs-builtin-name{color:#0086b3}.hljs-meta{color:#999;font-weight:700}.hljs-deletion{background:#fdd}.hljs-addition{background:#dfd}.hljs-emphasis{font-style:italic}.hljs-strong{font-weight:700}
</style>
# CPU Cache Effects
Sergey Slotin
Meeting C++ 2022
Note: Hello everyone, my name is Sergey Slotin…
---
## About me
- Author of *Algorithms for Modern Hardware* (en.algorithmica.org/hpc)
- Designed the world's fastest binary search, B-tree, integer parsing,
integer factorization, prefix sum, array searching, argmin…
- Twitter: [@sergey_slotin](https://twitter.com/sergey_slotin); Telegram: [@bydlokoder](https://t.me/bydlokoder); Email: me@sereja.me
<!-- I have some delays in printed version because people seem to not like the color of my passport right now, but it is freely available online. -->
Note:
…and like many of you C++ enjoyers, I like writing efficient code. I also like teaching and writing, and recently I combined these passions and wrote a book on the topic of performance engineering.
It is called Algorithms for Modern Hardware. It is hosted on github and freely available online so take a look if you're interested.
While working on it, I tried to optimize various classic algorithms and data structures, the ones that you read about in textbooks, both out of curiosity and for pedagogical reasons (to use them later as case studies on the book), and succeded in speeding up quite a few of them.
But in this talk, I'm not going to pick just one of these algorithms. Instead, I will discuss a broad topic that is crucial for algorithm design and especially for data structure design and for programming in general: memory, and more specifically the CPU cache system.
So, why should you care about memory?
----
$$c = a + b$$
Note:
Consider this fundamental question: how long does it take to add two numbers, two integers together? Well, that depends on what you mean. More specifically, on where these numbers are stored.
----
```cpp
void add(int a, int b) {
return a + b;
}
```
```nasm
add eax, ebx
```
<!-- .element: class="fragment" data-fragment-index="1" -->
Note: If the values are stored in CPU registers, then we only need one instruction.
(click)
Not all arithetic instructions take equal time, but being the simplest and also one of the most frequently used instructions, the addition only takes a single CPU cycle to execute. So, if the data is already loaded into registers, it takes just one cycle.
----
```cpp
int add(int *a, int *b, int *c) {
*c = *a + *b;
}
```
```nasm
mov eax, DWORD PTR [rsi]
add eax, DWORD PTR [rdi]
mov DWORD PTR [rdx], eax
```
<!-- .element: class="fragment" data-fragment-index="1" -->
`add` = one cycle, `mov` = depends
<!-- .element: class="fragment" data-fragment-index="1" -->
Note:
But in the general case, we need to first fetch the operands of this instruction from the memory, perform the addition, and write the results back where they are needed.
(click)
And this can get much more complicated. When you fetch anything from memory, there is always some latency before the data arrives. And this latency depends on many factors, but mainly on where the data is stored.
----

<!--On top of that, the request doesn’t just go directly to its ultimate storage location, but first through a complex system of address translation units and caching layers designed to both help in memory management and reduce latency.-->
Note: Modern computer memory is hierarchical. It consists of multiple cache layers of varying speed and size, where the higher levels typically store most frequently accessed data from the lower levels to reduce latency: each next level is usually an order of magnitude faster, but also smaller or more expensive.
This is a mental model of the cache hierarcy, it neglects a lot of important details, but it is a good starting point.
----
- Registers: 1 cycle
- <!-- .element: class="fragment" data-fragment-index="4" --> L1: ~5 cycles
- <!-- .element: class="fragment" data-fragment-index="3" --> L2: ~10 cycles
- <!-- .element: class="fragment" data-fragment-index="2" --> L3: ~40 cycles
- <!-- .element: class="fragment" data-fragment-index="1" --> RAM: ~100ns (~200 cycles)
- <!-- .element: class="fragment" data-fragment-index="5" --> SDD: ~0.1ms (~10⁵ cycles)
- <!-- .element: class="fragment" data-fragment-index="6" --> HDD: ~10ms (~10⁷ cycles)
- <!-- .element: class="fragment" data-fragment-index="7" --> Internet: 100ms (~10⁸ cycles)
Note:
So, the answer to the question — how long it takes to add two numbers together — depends on where the operands are stored:
- If the data is already in registers, it takes one CPU cycle to execute the instruction.
- If the data is stored in the main memory (RAM), it will typically take around ~100ns, or about 200 CPU cycles, to fetch it, and then probably another 200 cycles to write it back.
- But if it was accessed recently, it is probably cached and will take less than that to fetch, depending on how long ago it was accessed — it could be around 40 cycles for the slowest, largest layer of cache and around 4-5 cycles for the fastest, closest to the processor.
- But it could also be stored on some type of external memory such as a solid state drive, a hard drive, or some network-backed storage, that is memory-mapped so that it appears to the programmer like the data is in the main memory, but when you access it, the operating system will interrupt the program execution and talk to the device that stores it. In this case, the delays are so much larger that they are be psychologically perceivable, measured in milliseconds of real time and millions of CPU cycles.
So because of these huge timing differences, before optimizing anything else in your program, you really need to optimize its memory efficiency. And to perform more fine-grained optimization, simple awareness of the cache hierachy is now enough, you also need to take into account many specific details of how the CPU cache system works.
And in this talk, over the next 50-something minutes, I will discuss these finer details and hopefully teach you how to make use of them in real applications. And instead of giving you dry facts from boring spec sheets and talking about theoretically achievable limits, I will showcase the many features of the memory system through experiments, by running small benchmark programs with very specific access patterns that often resemble the ones that occur in practical code, and give a few examples of how this knowledge might be useful in the real world.
---
## "fast"
- <!-- .element: class="fragment" data-fragment-index="1" --> <b>Latency</b>: the delay between read/write initiation and the data arrival
- <!-- .element: class="fragment" data-fragment-index="2" --> <b>Bandwidth</b>: the amount of data that can be transferred per unit of time
Note:
So… there are many different cache layers between the CPU registers and the RAM, and the layers closer to the processor are smaller in size, but also faster. And the word “faster” can apply to two closely related but separate timings:
- The delay between the moment when a read or a write is initiated and when the data arrives (latency).
- The amount of data that can be transferred per unit of time (bandwidth).
It's important to undertand that one is not just the reciprocal of the other, far from it. In the a+b example, we cared about latency, but for many algorithms…
----
## 1. Bandwidth
Note:
…the most important characteristic of the cache system is the memory bandwidth. And at the same time, it is also the easiest to measure, so we are going to start with the bandiwdth.
----
```cpp
int a[N];
for (int t = 0; t < enough_times_to_be_statistically_significant; t++)
for (int i = 0; i < N; i++)
a[i]++;
```
`$(CC) -O3 -march=native -funroll-loops`
Zen 2 @ 2GHz, DDR4 @ 2667MHz
<!-- .element: class="fragment" data-fragment-index="1" -->
Note: To measure it, we write a little program where we create an array and iterate over it, incrementing its values.
We time the whole thing and run this loop many times to mitigate cold start effects and to make measurements more precise.
(click)
To reproduce this and all other experiment in this talk, you can compile it yourself with your favorite compiler, but make sure to set the highest level of optimization, importantly enabling vectorization, and unroll the loops so that the cost of loop maintenance itself does not significantly affect the measurements.
All benchmakrs are run on AMD Zen 2, a particular CPU microarchitecture, but the conclusions that we will draw are generally applicable
----

<!--
const __m256i ones = _mm256_set1_epi32(1);
for (int i = 0; i + 7 < N; i += 8) {
__m256d x = _mm256_load_si256(&a[i]);
x = _mm256_add_pd(x, ones);
_mm256_store_si256(&a[i], x);
}
-->
Note:
After running the incrementation benchmark with many array sizes and expressing the total running time in “operations per second,” we get a graph like this
You can clearly see the cache sizes on this graph
- When the whole array fits into the lowest layer of cache, the program is bottlenecked by the CPU rather than the L1 cache bandwidth. As the array becomes larger, the overhead associated with the first iterations of the loop becomes smaller, and the performance gets closer to its theoretical maximum
- But then the performance drops: first to 12-13 GFLOPS when it exceeds the L1 cache, and then gradually to about 2 GFLOPS when it can no longer fit in the L3 cache and has to be read from the random access memory, which is an order of magnitude slower.
This situation is typical for many lightweight loops.
----
Tip: fuse loops together
Note: The broad lesson here is that, when the data set is large, what you really want to optimize is not the number of arithmetic operations, but rather the amount of data read from the main memory. One common way to do it is to fuse operations together as much as you can, especially in the context of loops — it is beneficial to join two loops together…
----
```python
# what user wants (Python API)
a = numpy.random.random(10**6)
a = 2 * a + 1
```
```cpp
// what should happen
for (int i = 0; i < n; i++)
a[i] = 2 * a[i] + 1;
```
```cpp
// what actually happens
for (int i = 0; i < n; i++)
a[i] *= 2;
for (int i = 0; i < n; i++)
a[i] += 1;
```
Note: …not only because it reduces the overhead of maintaining the loop itself and helps with optimization (for example, if the operations in two initial loops are using different parts of the CPU, they can be executed concurrently), but also, and crucially for big inputs that don't fit into cache, it helps with bandwidth as you can read the array just once and not multiple times
This issue is particularly painful when you are working with high-level abstractions, because the separation of concerns principle dictates that we should implement separate operations that do not know about each other, but this naturally conflicts with optimizations such as loop fusion
----

Note:
But let's get back to the experimentats. In this run, turbo boost was turned off, and the CPU was running at fixed 2GHz…
----

Are we really measuring bandwidth?
<!-- .element: class="fragment" data-fragment-index="1" -->
Note:
…with turbo boost on, it can run at up to 4.1GHz.
All CPU cache layers are placed on the same microchip as the processor, so the bandwidth, latency, and all its other characteristics scale with the clock frequency. The RAM, on the other side, is separate from the CPU and lives on its own fixed clock, and its timings remain constant — which we can observe on the RAM portion of this graph.
This detail often comes into play when comparing different algorithm implementations. When the working dataset does not fit in the cache, the relative performance of the two implementations may be different depending on the CPU clock rate because the RAM remains unaffected by it (while everything else scales with it). So you might have two implementations, and one of them is faster at one clock rate but slower at another clock rate. For this reason, it is advised to keep the clock rate fixed, and so we run most of the benchmarks in this talk at plain 2GHz.
(click)
There is actually a problem with this experiment. We are not really measuring bandwidth. In our benchmark we are incrementing array cells, that is, we are reading values AND writing them back, while in many applications we may only need to do reading or only writing.
---
## 2. Directional bandwidth
Note: Let's try to measure unidirectional bandwidth.
----
Read only:
```cpp
for (int i = 0; i < N; i++)
s += a[i];
```
Write only:
```cpp
// replaced with memset
for (int i = 0; i < N; i++)
a[i] = 0;
```
Note: To make the CPU generate only read requests, we can calculate, say, the sum of the array.
And zeroing out an array (or filling it with any other constant value) only requires memory writes (it doesn't need any reeds)
Same as with the incrementing loop, these two loops can be easily vectorized by the compiler, and the second one is actually replaced with a memset, so the CPU is also not the bottleneck here
----

Why is write-only slower than read-only?
<!-- .element: class="fragment" data-fragment-index="1" -->
Write-only needs to update cache, requesting a read and doubling the RAM-CPU traffic
<!-- .element: class="fragment" data-fragment-index="2" -->
Note:
If we run all three, we get a graph like this.
- first, all three loops perform the same, but then the performance of the incrementing loop drops, but not dramatically, when we cross the L2 boundary because it needs to perform both reads and writes, and only-read and only-write loops perform the same -- this is expected
- but then, something interesting happens; unlike the caches, the cpu and ram are connected with a single dual-purpose bus that at any point in time can be used for either read or write requests. Like a single-lane road, and the direction of this memory bus is periodically switched by the memory controller depending on the number of pending read and write operations. So the program that does both reads and writes naturally consumes twice as much RAM bandiwdth as the one that only does reads
(click)
But the interesting anomaly is that the write-only loop has the same performance as the incrementing loop. In fact this starts in the L3 cache. Why?
(click)
This is because the CPU moves the data to the highest level of cache on each access, whether it is a read or a write — which is typically a good optimization, as in many use cases we will be needing it soon. When reading data, this isn’t a problem, as the data travels through the cache hierarchy anyway, but when writing, this causes another implicit read of the same data to be dispatched right after a write — thus requiring twice the bus bandwidth.
---
## 3. Bypassing the cache
Note: We can prevent the CPU from prefetching the data that we just have written by using so-called non-temporal memory accesses.
----
```cpp
const __m256i zeros = _mm256_set1_epi32(0);
for (int i = 0; i + 7 < N; i += 8)
_mm256_store_si256((__m256i*) &a[i], zeros);
```
Note:
To do this, we need to re-implement the zeroing loop directly without relying on compiler auto-vectorization. And ignoring some special cases, the optimal way to implement memset is to move 32-byte chucks of zeros into the destination with vector operations which we can implement manaully with SIMD intrinsics.
----

8 elements/vector × 1 write/cycle × 2B cycles/second = 16 GFLOPS (64GB/s)
Note: So this CPU can execute one write operation per cycle. And with each write operation, we can write a vector of 8 zeros to the memory. At 2GHz, we are executing 2B cycles per second, and multiplying these numbers together, we get the theoretical upper limit of 16B iterations per second, or theoretical memory bandwidth of 64GB/s (because each number written is a 4-byte integer), which we actually almost reach.
By the way, the compiler also uses SIMD operations for the read only and read plus write loops, but it is a bit more complicated, and this talk is not about SIMD, so I won't explain how exactly.
----
```cpp
const __m256i zeros = _mm256_set1_epi32(0);
for (int i = 0; i + 7 < N; i += 8)
_mm256_stream_si256((__m256i*) &a[i], zeros);
// ^ non-temporal operations bypass cache
```
Note: Now that we've reimplemented memset ourselves, we can replace the usual vector store intrinsic with a non-temporal one
Non-temporal memory reads or writes are essentially a way to tell the CPU that we won’t be needing the data that we have just accessed in the future, so in our case, there is no need to read the data back after a write, and so the memory system won't execute the implicit read-back after this non-temporal write operation
----

You can speed up `memset` and `memcpy`
(if you don't need the output right away)
<!-- .element: class="fragment" data-fragment-index="1" -->
Note:
If we add it to the benchmark, we get a picture like this.
At the left side of the graph, if the array is small enough to fit into the cache, and we actually access it some short time after, this has a negative effect because we have to write it into the RAM each time instead of using a locally cached version.
And on the right side of the graph, when the array too large to fit into the cahce anyway, this prevents read-backs and lets us use the memory bus more efficiently.
In fact, the performance increase in the case of the RAM is even more than twofold and is faster than the read-only loop, despite both request types using the same bandwidth. This happens because the memory controller doesn't have to switch the between reading and writing, and also because writes are simpler: the memory controller can just "fire and forget" non-temporal write requests, without tracking them, which allows more concurrent write operations compared to the read-only loop
(click)
Also, if you didn't notice, we just sped up memset by a factor of 3, although only for a very specific case when we don't need the data we wrote right away — and most of the time, we actually do, which is the reason why memset doesn't do it. And we can do the same with memcopy, there is also a non-temporal read, and if we don't need the source or destination data right away, you can get similar performance improvements by replacing memocopy with a non-temporal memcopy
----
Tip: RAM and CPU caches are different
Note: The lessons are, RAM and CPU caches are different. L1 and L2 caches are like two-lane roads, but shared L3 caches and the RAM are like one-lane roads. This is important for performance estimation: in many algorithms, you just calculate how much data transfer you need and divide by the memory bandwidth to figure out the running time — but here you need to take into account if it's read or write bandwidth.
----
Tip: cache only the data you need
Note: The second tip is, use non-temporal reads and writes for the data that you know you won't be needing. First, it can reduce memory bandiwdth as we've seen, and also it doesn't kick out the data already stored in the cache, which can also be benficial
---
## 4. Latency
Note: Next, let's try and measure latency — the time it takes to fetch just one byte from the moment of its request to the moment you can use it.
Despite that bandwidth is a more complicated concept, it is much easier to observe and measure than latency: you can simply execute a long series of independent read or write operations, and the scheduler, having access to them in advance, reorders and overlaps them, hiding their latency and maximizing the total throughput. But measuring latency is harder.
----
```cpp
int p[N], q[N];
// generate a random permutation
std::iota(p, p + N, 0);
std::random_shuffle(p, p + N);
// make a guaranteed single-cycle permutation out of it
int k = p[N - 1];
for (int i = 0; i < N; i++)
k = q[k] = p[i];
for (int i = 0; i < N; i++)
k = q[k];
```
"Pointer chasing" anti-pattern
Note:
To measure latency, we need to design an experiment where the CPU can’t cheat by knowing the memory locations we will request in advance. One way to ensure this is to generate a random permutation of size N that corresponds to a cycle and then repeatedly follow the permutation
(stop here a bit)
----

Note:
We run it, and compared to sequential iteration, it is much slower to visit all elements of an array this way — by about two orders of magnitude (note that the scales are different). Not only does this access pattern make vectorization impossible, but it also stalls the pipeline, creating a traffic jam of instructions, all waiting for a single piece of data, the next pointer, to be fetched from the memory.
This performance anti-pattern is known as pointer chasing, and it is very frequent in data structures, especially those written high-level languages that use lots of heap-allocated objects and pointers to them that are necessary for dynamic typing. This is one of the main reasons why C++ is fast and languages like Java and Python aren't
When talking about latency, it makes more sense to use cycles or nanoseconds rather than throughput units…
----

Note: …so we can replace this graph with its reciprocal, and what we get roughly corresponds to latency. Note that the cliffs on this graph aren’t as distinctive as they were for the bandwidth. This is because we are doing random queries, and so there is still some chance of hitting the previous layer of cache even if the array can’t fit into it entirely.
We can infer latency of each layer using math, or use more direct ways of measuring latency, for example using non-temporal reads to make sure that we are fetching from a specific cache layer, but this benchmark is more representative of practical access patterns — because this is what happens when we, say, query a hash table: we read some random cell which may or may not be cached, with the chance depending on the structure size.
----

Note: Similar to bandwidth, the latency of all CPU caches proportionally scales with its clock frequency, while the RAM timings don't. We can observe this difference if we change the frequency by turning turbo boost on, similar how we did with bandwidth before.
The graph makes more sense if we plot it as a relative speedup…
----

Note: You would expect the relative speedup to be twofold for array sizes that fit into the cache entirely, but then roughly equal for arrays stored in the RAM. But the later is not quite what is happening: there is a small, fixed-latency delay on lower clocked run even for RAM accesses. This happens because the CPU first has to check its cache before sending a read request to the main memory — to save the RAM bandwidth for other processes that may potentially need it, and this is what we see here when we increase the clock speed.
---
## 5. Cache lines
Note: Another very important feature is that the basic units of data transfer in the memory system are not individual bits and bytes, but cache lines. On most CPU architectures, the size of a cache line is 64 bytes, meaning that all memory is divided in blocks of 64 bytes, and whenever you fetch a single byte, you are also fetching all its 63 cache line neighbors whether your want them or not.
----
```cpp
// tweakable parameter ↓
for (int i = 0; i < N; i += D)
a[i]++;
```
Note: To demonstrate this, we add a “step” parameter to our incrementing loop. Now we only touch every D-th element
----

$D=16$ needs **one** scalar instruction
$D=1$ needs **two** vector instructions
Note: If we run it with D=1 and D=16, we can observe something interesting
When the array fits into the L1 cache, the strided version completes faster — although not 16 but just two times as fast. This is because it only needs to do half the work: it only executes a single instruction for every 16th element, while the original loop needed two 8-element vector instructions to process the same block of 16 elements. Both computations are bottlenecked by writing the result back: the CPU can only write one word per cycle — regardless of whether it is composed of one integer or eight
As the problem size grows, the graphs of the two loops meet, despite one doing 16 times less work than the other. This is because, in terms of cache lines, we are fetching the exact same memory in both loops (16 integers is exactly the size of a single cache line), and the fact that the strided loop only needs one-sixteenth of it is irrelevant because you can't request one machine word, you need to request the entire cache line
----
Tip: measure memory in terms of cache lines
Note: The important practical lesson when designing and analyzing memory-bound algorithms is to count the number of cache lines accessed and not just the total number of memory reads and writes. It is not entirely accurate, but it is much more accurate than counting memory operations or individual bytes
----
Tip: use memory alignment to minimize cache line fetches
Note: Also, the fact that the memory is partitioned into 64B cache lines makes it difficult to operate on data words that cross a cache line boundary. Because when you need to retrieve some primitive type, such as a 32-bit integer, you really want to have it located on a single cache line — both because retrieving two cache lines requires more memory bandwidth and because stitching the results in hardware wastes transistor space.
This aspect influences algorithm design and how compilers choose the memory layout of data structures.
----
```cpp
alignas(32) float a[n];
```
```cpp
void *a = std::aligned_alloc(32, 4 * n);
```
<!-- .element: class="fragment" data-fragment-index="1" -->
```cpp
struct alignas(64) Data {
// ...
};
```
<!-- .element: class="fragment" data-fragment-index="2" -->
Note:
By default, when you allocate an array of some primitive type, you are guaranteed that the addresses of all elements are multiples of their size, which ensures that they only span a single cache line. For example, the address of the first and every other element of an integer array is guaranteed to be a multiple of 4 bytes (sizeof int).
Sometimes you need to ensure that this minimum alignment is higher. For example, SIMD instructions read and write data in blocks of 32 bytes, and it is critical for performance that these 32 bytes belong to the same cache line. In such cases, you can use the alignas specifier when defining a static array variable -- it guarantees that the beginning of the array will have address divisible by the alignment, 32 in this case, and it can be easily read with SIMD instructions
(click)
To allocate a memory-aligned array dynamically, you can use std::aligned_alloc, which takes the alignment value and the size of an array in bytes and returns a pointer to the allocated memory — just like the new operator does:
You can also align memory to sizes larger than the cache line. The only restriction is that the size parameter must be an integral multiple of alignment.
(click)
You can also use the alignas specifier when defining a structure
Here, whenever an instance of Data is allocated, it will be at the beginning of a cache line. So, say, if it is smaller than 64, when you request any of its fields, you fetch the entire cache line and the entire structure. The downside is that the effective size of the structure will be rounded up to the nearest multiple of 64 bytes. This has to be done so that, for example, when allocating an array of Data, not just the first element is properly aligned, but all of them, and so some holes will be inserted in the memory layout.
----
```cpp
struct Data {
char a;
short b;
int c;
char d;
};
// sizeof(Data) = ?
```
1 + 2 + 4 + 1 = 8?
<!-- .element: class="fragment" data-fragment-index="1" -->
`sizeof(Data) = 12`
<!-- .element: class="fragment" data-fragment-index="2" -->
Note: This issue becomes more complicated when we need to allocate a group of non-uniform elements, which is the case for structure members. Instead of playing Tetris trying to rearrange the members of a struct so that each of them is within a single cache line — which isn’t always possible — C/C++ compilers rely on the mechanism of memory alignment.
Consider the following toy example
(click)
When stored succinctly, this structure needs a total of 1 + 2 + 4 + 1 = 8 bytes per instance, but even assuming that the whole structure is guaranteed the alignment of 4 bytes (its largest member, int), it is possible that it is allocated on the last 4 bytes of a cache line, and the c variable would be split between two cache lines.
(clcik)
To fix this, the compiler inserts some unnamed members so that each next member gets the right minimum alignment, and also so that the next instance of this structure, when stored sequentially gets the right minimum alignment
----
```cpp
struct Data {
char a; // 1 byte
char x[1]; // 1 byte to align the next "short" member
short b; // 2 bytes
int c; // 4 bytes (sets the alignment for the whole structure)
char d; // 1 byte
char y[3]; // 3 bytes to make total size 12 bytes (divisible by 4)
};
// sizeof(Data) = 12
// alignof(Data) = alignof(int) = sizeof(int) = 4
```
Note: This potentially wastes space but saves a lot of CPU cycles. This trade-off is mostly beneficial, so structure alignment is enabled by default in most compilers.
Padding is only inserted before a not-yet-aligned member or at the end of the structure. By changing the order of members in a structure to align the structure itself, it is possible to reduce the required number of padding bytes and the total size of the structure.
----
```cpp
struct Data {
int c;
short b;
char a;
char d;
};
```
Note: So we could reorder the structure members like this
Now, each of them is aligned without any padding, and the size of the structure is just 8 bytes. It seems stupid that the size of a structure and consequently its performance depends on the order of definition of its members, but this is required for binary compatibility.
As a rule of thumb, place your type definitions from largest data types to smallest — this greedy algorithm is guaranteed to work unless you have some weird non-power-of-two type sizes such as the 10-byte long double or something.
----
```cpp
struct __attribute__ ((packed)) Data {
long long a;
bool b;
};
```
Note: Also, if you know what you are doing, you can disable structure padding and pack your data as tight as possible.
You have to ask the compiler to do it, as such functionality is not a part of neither C nor C++ standard yet. In GCC and Clang, this is done with the packed attribute
This makes the instances of Data take just 9 bytes instead of the 16 required by alignment, at the cost of possibly fetching two cache lines when reading its elements.
---
## 6. Memory-level parallelism
Note:
Next, memory requests can overlap in time: while you wait for a read request to complete, you can send a few others, which will be executed concurrently with it. This is the main reason why sequential iteration is so much faster than pointer chasing: the CPU knows which memory locations it needs to fetch next and sends memory requests far ahead of time.
The number of concurrent memory operations is large but limited, and it is different for different types of memory. When designing algorithms and especially data structures, you may want to know this number, as it limits the amount of parallelism your computation can achieve.
----

Note:
To find this limit theoretically for a specific memory type, you can multiply its latency (time to fetch a cache line) by its bandwidth (number of cache lines fetched per second), which gives you the average number of memory operations in progress
The latency of the L1/L2 caches is small anyway, so there is no need for a long pipeline of pending requests, but larger memory types can sustain up to 25-40 concurrent read operations.
----
```cpp
const int M = N / D;
int p[M], q[D][M];
for (int d = 0; d < D; d++) {
std::iota(p, p + M, 0);
std::random_shuffle(p, p + M);
k[d] = p[M - 1];
for (int i = 0; i < M; i++)
k[d] = q[d][k[d]] = p[i];
}
for (int i = 0; i < M; i++)
for (int d = 0; d < D; d++)
k[d] = q[d][k[d]];
```
Note: We can also measure available memory parallelism more directly by modifying our pointer chasing benchmark so that we loop around D separate cycles in parallel instead of just one. In the experiemnt, we will fix the sum of the cycle lengths constant at a few select sizes corresponding to different cache levels, and try different Ds.
----

Note: We get mostly similar results for all memory types, the performance scales linearly with the amount of parallel memory lanes, but they all max out between 13 and 17, because if you use more iterators than that, there would have to be a register spill
You don’t always get to the maximum possible level of memory parallelism, register spill being the one of the reasons why, but for most applications, a dozen concurrent requests is more than enough
----
Tip: exploit memory-level parallelism
(e.g., avoid explicit pointers)
Note: Lesson is, make use of memory-level parallelism. The CPU can easily have dozens of pending memory operations and it is an very important consideration that comes up in many data structures, especially those that are easier to implement with pointers but faster with something else
----
- vector-backed binary heap, stack, queue, dequeue
- <!-- .element: class="fragment" data-fragment-index="1" --> chaining vs. open addressing hash tables
- <!-- .element: class="fragment" data-fragment-index="2" -->binary trees vs. B-trees
- <!-- .element: class="fragment" data-fragment-index="3" --> pointers to data vs. data
Note: This is why it is much much more efficient to implement heaps, stacks, queues and so on, on top of a growable array as opposed to pointer-based structures like they teach you in colleges.
This is also an important issue in hash tables, search trees, and also in various places where you need to store some potentially large objects: you can either store the objects themselves directly, or store the objects somewhere else (on the heap), and store a pointer to these objects. Storing the objects themselves makes access to them faster as you don't need to chase pointers, but makes it problematic to move them if the objects are large (at least, larger than the pointer). So there is a natural trade off here
---
## 7. Pointers vs. indices
Note: Another thing that contributes to latency is how you make the memory access.
----
```cpp
for (int i = 0; i < N; i++)
k = q[k];
```
3ns or 6 cycles for the L1 cache
Note: In the benchmark we didn’t use actual pointers, but integer indices relative to a base address, the beginning of the array. If the array is small enough to fit into the L1 cache, it measures 3ns or 6 CPU cycles per iteration.
----
```nasm
mov rax, DWORD PTR q[0+rax*4] # k = q[k]
```
`mov` takes 4 cycles in the "simple" case and 5 when address computation is required
<!-- .element: class="fragment" data-fragment-index="1" -->
Note: The memory addressing operator on x86 is fused with the address computation, so the k = q[k] line folds into just a single terse instruction that also does multiplication by 4 and addition under the hood
(click)
Although fully fused, these additional computations add some delay to memory operations. The latency of an L1 fetch is either 4 or 5 cycles — the latter being the case if we need to perform a complex computation of the address, like we do in our benchmark
----
```cpp
struct node { node* ptr; };
// ...
node* k = q + p[N - 1];
for (int i = 0; i < N; i++)
k = k->ptr;
```
2ns or 4 cycles
Note: We can make our benchmark run slightly faster if we replace “fake pointers” — indices — with actual pointers.
This code now runs in 2ns or 4 cycles for arrays that fit in the L1 cache. Here we save 2 cycles because of a feature of this specific microarchitecture, but on most CPUs we would save 1
----

Note: Unfortunately, there is a problem with it on 64-bit systems as the pointers become twice as large as integer indices, making the array spill out of cache much sooner compared to using a 32-bit index. The latency-versus-size graph looks like if it was shifted by one power of two to the left — exactly like it should
----

Note: Although this problem can be mitigated by switching to 32-bit mode, with its own downsides
----
Tip: prefer direct pointers if the data fits L1
Note: The lesson is to use pointers if the data set is small enough, to save a few cycles accessing it. Otherwise, use indices, maybe wrapped in iterators, to save cache space — it may look like a bad software practice, but there is really nothing wrong with them
---
## 8. Memory sharing
<!--
https://en.algorithmica.org/hpc/cpu-cache/img/lstopo.png =300x
`lstopo` output
-->
Note: Starting at some level of the hierarchy, the cache becomes shared between different cores…
----

"Haswell" die shot
Note: This lets you add more cores on a single chip but also poses some “noisy neighbor” problems as it limits the effective cache size and bandwidth available to a single execution thread.
On most CPUs, only the last layer of cache, L3, is shared, and not always in a uniform manner.
On Linux, the topology of the memory system can be retrieved with lstopo
(lstopo)
For example, on my machine, there are 8 physical cores, each has access to its own 32 KB of L1 cache, 512KB of L2 cache, but the L3 cache is shared. Moreover, it is not shared uniformly: its total size is 8M, but it is split into two halves: two groups of 4 cores have access to their own 4M region of the L3 cache, and not all of it.
There are also more complex topologies, where accessing certain regions of memory takes non-constant time, different for each core, which is the case for multi-socket systems that have several separate CPU chips installed
And this has some important implications for parallel computing: the performance of multi-threaded memory accesses depends on which cores are running the execution threads. To demonstrate this, we can run the bandwidth benchmarks in parallel. I like writing code, but more than writing code I like not writing code…
----
- `parallel` runs multiple instances of a command in parallel
- `taskset` restricts them to specific cores (CPU affinity)
`parallel taskset -c 0,1,2,3 ./run ::: {0..3}`
<!-- .element: class="fragment" data-fragment-index="1" -->
Note: …so instead of modifying the source code to run on multiple threads, we simply run multiple identical processes with GNU parallel, a Linux utility. And to control which cores are executing which processes, we set their processor affinity with taskset.
(click)
This combined command runs 4 processes that can run on the first 4 cores of the CPU:
----

Note: Here is what we get when we change the number of processes running simultaneously
The L1 and L2 cache are private to each core, so the performance is unaffected there, but as the cores start competing for the shared L3 cache and the RAM, the performance degrades.
When looking at the RAM section of the graph, it may seem that with more cores, the per-process throughput goes ½, ⅓, ¼, and so on, and the total bandwidth remains constant. But this isn’t quite true: the contention hurts, but a single CPU core usually can’t saturate all of the RAM bandwidth.
----

Note: If we plot it more carefully, we see that the total bandwidth actually increases with the number of cores — although not proportionally, with diminishing returns, but it eventually approaches its theoretical maximum of ~42.4 GB/s:
The jump of performance between 4 and 5 cores is not a fluke. We specifically set all processes in each experiment to run on the first n cores — and the first and the second halves of these cores have separate L3 caches and memory controllers. If some of the processes were to be scheduled on the other half of the cores, there would be less contention for the L3 cache and also, in part, memory bandwidth.
----
```bash
parallel taskset -c 0,1 ./run ::: {0..1} # L3 cache sharing
parallel taskset -c 0,4 ./run ::: {0..1} # no L3 cache sharing
```
Note: To show this. Let’s run another benchmark, but now with pinning the processes to different 4-core groups that don’t share L3 cache:
----

Note: You can see that when you pin the threads to different core groups, they perform better — as if there were twice as much L3 cache avaialble:
----
Tip: bandwidth is a shared and limited resource
Note: So, bandwidth is a shared resource. It is limited, but a single CPU core usually can't saturate it fully. If you have a memory-intensive task, it makes sense to add more cores, but the returns are usually diminishing. This is especially hard to manage in the cloud setting. Say, in Kubernetes or similar orchestration systems, there is control in terms of which cores you get and how much memory you can allocate, but the memory bandwidth can't be separated between user applications, so increasing the requirement for cores or the memory requirement does not necessarily help.
Also, since there is competition for memory bandiwidth, it is important to remove all interference when benchmarking to get accurate results.
----
Tip: CPU affinity matters in NUMA systems
(if threads communicate frequenty, put them close; otherwise, put them apart)
<!-- .element: class="fragment" data-fragment-index="1" -->
Note: Also, unless the system is perfectly symmetrical, and it rarely is, it matters which cores run which threads.
(click)
If the threads use a lot of memory bandwidth or the shared L3 cache, put them apart so that they do not compete. On the other hand, if two threads need to communicate, say take locks, they do so via the least common ancestor in the topology. If that ancestor is RAM, this communication would take around five times longer than if the ancestor were the L3 cache. So, if they need to communicate, it may make sense to do the opposite and put them in the same group.
---
## 10. Prefetching
(software & hardware)
<!-- .element: class="fragment" data-fragment-index="1" -->
Note: Let's now get back to the question of memory-level parallelism. Taking advantage of the free concurrency available in memory hardware, it can be beneficial to prefetch data that is likely to be accessed next if its location can be predicted.
This is easy to do when the processor executes a fixed stream of independent instructions as it can just peek ahead, but sometimes the memory locations aren’t in the instruction stream, and yet they can still be predicted with high probability. In these cases, they can be prefetched by other means:
(click)
- either explicitly, by separately reading any byte in the cache line we need to uplift in the cache hierrarchy. This is called software prefetching.
- or implicitly, by using simple access patterns such as sequential iteration, which are detectable by the memory hardware that can start prefetching automatically. This is called hardware prefetching
----
```cpp
int q[N];
std::iota(q, q + N, 1);
q[N - 1] = 0;
int k = i;
for (int i = 0; i < N; i++)
k = q[k];
```
Flat 3ns (as if in L1) regardless of the array size
Note: Let’s modify the pointer chasing benchmark to show the effect of hardware prefetching. Instead of a random permutation, we cycle over a permutation that always points to the next element and loops around at the end. And we get the performance as if the array was in the L1 cache regardless of its size. The processor doesn't know for sure that we will be reading the next element, but based on the access patterns, the memory controller can speculate that we will need the next cache line and prefetch it ahead of time.
Hardware prefetching only detects simple patterns. You can iterate forward and backward over multiple arrays in parallel, perhaps with some small strides, but that’s about it. For anything more complex, the prefetcher won’t figure out what’s happening, and we need to help it out ourselves with explicit software prefetching. The simplest way to do software prefetching is to load any byte in the cache line with the mov or any other memory instruction…
----
`__builtin_prefetch(ptr)`
Note: …but CPUs have a separate prefetch instruction that lifts a cache line without doing anything with it. This instruction isn’t a part of the C or C++ standard, but is available in most compilers as the __builtin_prefetch intrinsic
It’s quite hard to come up with a simple example when it can be useful. To make the pointer chasing benchmark benefit from software prefetching, we need to construct a permutation that at the same time loops around the entire array, can’t be predicted by hardware prefetcher, and has easily computable next addresses
----
```cpp
const int n = find_prime(N); // largest prime not exceeding N
for (int i = 0; i < n; i++)
q[i] = (2 * i + 1) % n;
```
If $n$ is a prime, the period will be exactly $n$
(a property of linear congruential generators)
Note: And we can get such permutation, for example, if we generate it with linear congruential generator, which is a well known and simple random number generator where we multiply the previous value by some constant, add another constant, and take modulo. It has the property that if the modulus, N, is a prime number, then the period of the generator will be exactly N.
So we get all the properties we need: it loops around the entire array, it is too hard to be predicted by hardware prefetcher, and we can compute it in advance.
----
```cpp
int k = 0;
for (int t = 0; t < K; t++) {
for (int i = 0; i < n; i++) {
__builtin_prefetch(&q[(2 * k + 1) % n]);
k = q[k];
}
}
```
Note: If we generate the permutation using LCG, we get the ability to peek ahead, and if we do prefetch one iteration in advance…
----

Note: …the latency becomes almost half of what it was for large arrays.
----
$$
\begin{aligned}
f(x) &= 2 \cdot x + 1
\\ f^2(x) &= 4 \cdot x + 2 + 1
\\ f^3(x) &= 8 \cdot x + 4 + 2 + 1
\\ &\ldots
\\ f^k(x) &= 2^k \cdot x + (2^k - 1)
\end{aligned}
$$
```cpp
// concurrently fetch D elements ahead
__builtin_prefetch(&q[((1 << D) * k + (1 << D) - 1) % n]);
```
<!-- .element: class="fragment" data-fragment-index="1" -->
Note: Interestingly, we can prefetch more than just one element ahead, making use of this pattern in the LCG function when we expand it
(click)
So, to load the D-th element ahead, we can do this, and if we execute this request on every iteration, we will be simultaneously prefetching D elements ahead on average, increasing the throughput by D times
----

Note: And this way, we can reduce the average latency arbitrarily close to the cost of computing the next index
----
Tip: when you know which elements you may need, prefetch them
(although most of the time, the CPU already does it)
<!-- .element: class="fragment" data-fragment-index="1" -->
Note: So, when you know which elements you will need, prefetch them. This is an artificial example, but there are cases when it is useful, they are just too complicated for this talk.
(click)
But it is also worth noting that a lot of cases are already handled by hardware prefething, and you actually fail more often than not when trying to insert software prefetching into practical programs. This is largely because you need to issue a separate memory instruction that may compete for resources with the others. At the same time, hardware prefetching is 100% harmless as it only activates when the memory and cache buses are not busy.
---
## 11. Cache associativity
Note: Prefetching is a mechanism that helps the performance, and now I'm going to showcase a really weird but quite frequently occuring feature that hurts it.
----
```cpp
const int N = 2e6;
// loop A
for (int i = 0; i < N; i += 256)
a[i]++;
// loop B
for (int i = 0; i < N; i += 257)
a[i]++;
```
Which one is faster? <span><!-- .element: class="fragment" data-fragment-index="1" --> **B, by 10x!**</span>
Note: Consider the same strided incrementing loop over an array of about 2 million integers that we used for bandiwdth measurements, but now with a step size of 256 and a step size 257.
Which one do you think will be faster to finish? There are several considerations that come to mind:
- At first, you think that there shouldn’t be much difference, or maybe that the second loop is 1/257 or so faster because it does fewer iterations in total
- But then you recall that 256 is a nice round number, which may have something to do with SIMD or the memory system, so maybe the first one is faster.
But the right answer is very counterintuitive
(click)
the second loop is faster — and by a lot, a factor of 10. This isn’t just a single bad step size…
----

Note: The performance degrades for all indices that are multiples of large powers of two
There is no vectorization or anything, and the two loops produce the same assembly except for the step size. This effect is due only to the memory system, in particular to a feature called cache associativity, which is a peculiar artifact of how CPU caches are implemented in hardware.
In software, the most common caching strategy is the least recently used policy, where we have some fixed-size capacity for cached data, and when we need to add something to the cache, we need to make space for it and we kick out the element that we have not accessed for the longest time — hence the name, least recently used cache.
----

Fully associative cache
Note: …in the context of hardware, such scheme is called fully associative cache: we have M cells, each capable of holding a cache line corresponding to any of the N total memory locations, and in case of contention, the cache line not accessed the longest gets kicked out and replaced with the new one.
The problem with fully associative cache is that implementing the “find the oldest cache line among millions” operation is already pretty hard to do in software and just unfeasible in hardware. You can make a fully associative cache that has 16 entries or so, but managing hundreds of cache lines already becomes either prohibitively expensive or so slow that it’s not worth it.
We can resort to another, much simpler approach…
----

Direct-mapped cache
Note: ….just map each block of 64 bytes in RAM to a single cache line which it can occupy. Say, if we have 4096 blocks in memory and 64 cache lines for them, then each cache line at any time stores the contents of one of 4096/64 = 64 different blocks.
A direct-mapped cache is easy to implement as it doesn’t require storing any additional meta-information associated with a cache line except its tag (the actual memory location of a cached block). The disadvantage is that the entries can be kicked out too quickly — for example, when we are repeatedly jumping between two addresses that map to the same cache line — leading to lower overall cache utilization.
For that reason, we settle for something in-between direct-mapped and fully associative caches…
----

Set-associative cache (2-way associative)
Note: …the set-associative cache. It splits the address space into equal groups, which separately act as small fully-associative caches. It's like sharding in distributed computing.
Associativity is the size of these sets, or, in other words, how many different cache lines each data block can be mapped to. Higher associativity allows for more efficient utilization of cache space but also increases the cost.
For example, on my CPU, the L3 cache is 16-way set-associative, while L1 and L2 caches are 8-way set-associative. Most other CPU caches are also set-associative, except for the small ones that house very few entries and can afford full associativity.
----

- *offset:* the index of the word within a 64B cache line ($\log_2 64 = 6$ bits)
- *index:* the index of the cache line set (the next $12$ bits as there are $2^{12}$ cache lines in L3)
- *tag:* the rest of the memory address used to tell the memory blocks apart
Note: There is only one ambiguity remaining: how exactly the cache line mapping is done.
If we implemented set-associative cache in software, we would compute some hash function of the memory block address and then use its value as the cache line index. In hardware, we can’t really do that because it is too slow: for example, for the L1 cache, the latency requirement is 4 or 5 cycles, and even taking a modulo takes about a dozen cycles, let alone something more sophisticated.
Instead, the hardware uses the lazy approach. It takes the memory address that needs to be accessed and splits it into three parts — from lower bits to higher:
- offset — the index of the word within a 64B cache line (there are 64 bytes to address, so you need to reserve log 64 = 6 bits);
- index — the index of the cache line set (the next 12 bits as there are 2^12 cache lines in the L3 cache);
- tag — the rest of the memory address, which is used to tell the memory blocks stored in the cache lines apart.
In other words, all memory addresses with the same “middle” part, these 12 bits, map to the same set. This makes the cache system simpler and cheaper to implement but also susceptible to certain bad access patterns.
----
```cpp
for (int i = 0; i < N; i += 256)
a[i]++;
```
When we iterate in steps of 256, accessed elements map to the same cache lines
<!-- .element: class="fragment" data-fragment-index="1" -->
Note: Now, where were we? Oh, yes: the reason why iteration with strides of 256 causes such a terrible slowdown.
When we jump over 256 integers, the pointer always increments by 1024, that is 256 by 4. This means that the last 10 bits of the address remain the same. The cache system uses the lower 6 bits for the offset and the next 12 for the cache line index, and since the last 10 bits remain the same, we are effectively using only 8 bits of the available 12 for L3 cache, which has the effect of shrinking the usable cache space space by 16, and so it spills to the RAM, that is much slower.
(click)
Performance issues caused by cache associativity effects arise with remarkable frequency in algorithms because, for multiple reasons, programmers just love using powers of two when indexing arrays:
- It is easier to calculate the address for multi-dimensional array accesses if the last dimension is a power of two, as it only requires a binary shift instead of a multiplication.
- It is easier to calculate modulo a power of two, as it can be done with a single bitwise and.
- It is convenient and often even necessary to use power-of-two problem sizes in divide-and-conquer algorithms.
- It is the smallest integer exponent, so using the sequence of increasing powers of two as problem sizes are a popular choice when benchmarking memory-bound algorithms.
- Also, more natural powers of ten are by transitivity divisible by a slightly lower power of two.
----
Matrix multiplication is ~30% slower for matrices of size $256$ than for matrices of size $257$
----
```cpp
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i * n + j] += a[i * n + k] * b[k * n + j];
```
Note: because we are doing jumps of step 256, which again map to the same set
----
Binary search is ~20% slower for arrays of size $2^{20}$ than for arrays of size $(2^{20} + 123)$
----
```cpp
int lower_bound(int x) {
int l = 0, r = n - 1;
while (l < r) {
int m = (l + r) / 2;
if (t[m] >= x)
r = m;
else
l = m + 1;
}
return t[l];
}
```
Note: this is slightly more complicated.
When the array size is a multiple of a large power of two, then the indices of the “hottest” elements, the ones we likely request on the first dozen or so iterations, will also be divisible by some large power of two and map to the same cache lines — kicking each other out and causing a ~20% performance decrease
----
Tip: avoid iterating in powers of two, insert “holes” in the memory layout
(better: avoid all "jumps" altogether)
<!-- .element: class="fragment" data-fragment-index="1" -->
Note: Luckily, such issues are more of an anomaly rather than serious problems. The solution is usually simple: avoid iterating in powers of two, make the last dimensions of multi-dimensional arrays a slightly different size or use any other method to insert “holes” in the memory layout, or create some seemingly random mapping between the array indices and the locations where the data is actually stored.
(click)
Or better yet, make it so that you don't have jumps in the first place. Because when you are doing jumps, you are probably doing something wrong anyway. For example, the optimal algorithms for matrix multiplication and binary search both permute the input elements so that the reads are as sequential as possible.
---
## 12. Paging
Note: Now, apart from cache lines, the memory system has another unit of granularity: memory pages. Understanding how they work is also important for performance.
----
```cpp
const int N = (1 << 13);
int a[D * N];
for (int i = 0; i < D * N; i += D)
a[i] += 1;
```
Note: Consider the same experiment with a strided incrementing loop, but now we will consider larger strides and also increase the array size proportionally so that the total number of iterations N remains constant. As the total number of memory accesses also remains constant, for all step sizes, we should be fetching exactly N, about 8 thousand, cache lines — and this is precisely the amount that fits into the L2 cache. The number of iterations does not depend on the step size, so the graph should mostly look more or less flat, right?
----

Note: Well, It starts flat, but starting from around step size 256, it is not: the performance degrades
This anomaly is also due to the cache system, although the standard L1-L3 data caches have nothing to do with it. Virtual memory is at fault.
----

Note: First, some background. Virtual memory is a mechanism gives each process the impression that it fully controls a contiguous region of memory, which in reality may be mapped to multiple smaller blocks of the physical memory — which includes both the main memory (RAM) and external memory (HDD, SSD).
Some devices, say GPUs, don't have virtual memory: all pointers that the programs have are the real ones. But CPUs are used to run operating systems that need to offer some degree of isolation between user processes and flexibility in terms of memory allocation to the processes, therefore it uses virtual memory.
To achieve this, the memory address space is divided into pages (typically 4KB in size), which are the base units of memory that the programs can request from the operating system. The memory system maintains a special hardware data structure called the page table, which contains the mappings of virtual page addresses to the physical ones. When a process accesses data using its virtual memory address, the memory system calculates its page number (by dividing it by 4096, the page size), looks up in the page table what its physical address is, and forwards the read or write request to where that data is actually stored.
----
16G RAM / 4K page size = 4M pages
We can **increase page size** and/or introduce a separate **cache for the page table**
<!-- .element: class="fragment" data-fragment-index="1" -->
Note: Since the address translation needs to be done for each memory request, and the number of memory pages itself may be large (e.g., 16G RAM / 4K page size = 4M pages), address translation poses a difficult problem in itself: keeping track of 4M entries while still being able to read it with minimal latency is hard. So there are two solutions
(click)
- First, we can increase the page size, so that the total number of pages is smaller, but this comes at hte cost of granularity. We don't want the minimum amount of memory a process can allocate to be in the megabytes.
- Second, we can speed the page table lookup using a special cache for the page table itself called translation lookaside buffer (TLB).
----
- The default page size is 4K
- The L1 TLB has 64 entries, so it can handle $64 \times 4K = 512K$ of active memory
- The L2 TLB has 2048 entries, and it can handle $2048 \times 4K = 8M$ of active memory
Note: On my CPU, there are two levels of TLB:
- The L1 TLB has 64 entries, and if the page size is 4K, then it can handle 64 x 4K = 512K64×4K=512K of active memory without going to the L2 TLB.
- The L2 TLB has 2048 entries, and it can handle 2048 x 4K = 8M2048×4K=8M of memory without going to the page table.
----

$$
\underbrace{8K}_{N} \times \underbrace{256}_{step} \times \underbrace{4B}_{sizeof(int)} = 8M
$$
<!-- .element: class="fragment" data-fragment-index="1" -->
Note: Let's get back to our experiment.
How much memory is allocated when step size becomes equal to 256, the critical point? 8K x 256 x 4B = 8M, exactly the limit of what the L2 TLB can handle. When step size gets larger than that, some requests start getting redirected to the main page table, which has a large latency and very limited throughput, which bottlenecks the entire computation.
That 8MB of slowdown-free memory seems like a very tight restriction. While we can’t change the characteristics of the hardware to lift it, we can increase the page size, which would in turn reduce the pressure on the TLB capacity.
Modern operating systems allow us to set the page size both globally and for individual allocations. CPUs only support a defined set of page sizes — mine, for example, can use either 4K or 2M pages. Anything over the default 4K is called huge pages on Linux and large pages on Windows.
One way of doing it is to enable huge pages globally. On Linux, there is a special system file that governs the allocation of huge pages.
sudo su
cat /sys/kernel/mm/transparent_hugepage/enabled
echo always > /sys/kernel/mm/transparent_hugepage/enabled
But this is a bad practice: I don't want all new allocations to be that large.
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
Another option is the madvise which allows the process itself to request a hugepage on allocation.
----
```cpp
#include <sys/mman.h>
const int PAGE_SIZE = (1 << 21);
void *ptr = std::aligned_alloc(PAGE_SIZE, array_size);
madvise(ptr, array_size, MADV_HUGEPAGE);
```
```cpp
#include "memoryapi.h"
void *ptr = VirtualAlloc(NULL,
array_size,
MEM_RESERVE | MEM_COMMIT | MEM_LARGE_PAGES,
PAGE_READWRITE);
```
<!-- .element: class="fragment" data-fragment-index="1" -->
Note: Here is how to do it in C++. Here, you allocate some memory, the size of which needs to be a multiple of the page size, and then you do an madvise system call on that memory. When this page will be first accessed, the operating system will allocate a huge page.
(click)
Windows and other operating systems have similar mechanisms of requesting larger pages.
----

Note: Both ways of allocating huge pages immediately flatten the curve, as expected. Now all memory requests go to L1 or L2 TLB instead of the main page table
----

Note: Enabling huge pages also improves latency by up to 10-15% for arrays that don’t fit into the L2 cache
----
Tip: units of locality are not only cache lines
Note: The lessons to learn is that units of space locality are not only cache lines, but also, for example, memory pages. Keep your data spacially local
----
Tip: use HugePages in memory-intensive applications
Note: And also, enabling huge pages is a good idea when you have any sort of sparse reads over large data sets, as they usually slightly improve performance and almost never hurt it.
That said, you shouldn’t rely on hugepages if possible, as they aren’t always available due to either hardware or computing environment restrictions. Also, there are many other reasons why grouping data accesses spatially may be beneficial, which automatically solves the paging problem
---
## What else
- <!-- .element: class="fragment" data-fragment-index="1" --> Tools (perf, cachegrind)
- <!-- .element: class="fragment" data-fragment-index="2" --> Instruction caches
- <!-- .element: class="fragment" data-fragment-index="3" --> RAM-specific timings
- <!-- .element: class="fragment" data-fragment-index="4" --> Non-uniform memory access (NUMA)
- <!-- .element: class="fragment" data-fragment-index="5" --> MESI protocol, atomic operations
- <!-- .element: class="fragment" data-fragment-index="6" --> Non-volatile memory (SSDs, HDDs, NFS, S3…)
- <!-- .element: class="fragment" data-fragment-index="7" --> File systems, virtualization
- <!-- .element: class="fragment" data-fragment-index="8" --> Error detection and correction
- <!-- .element: class="fragment" data-fragment-index="9" --> Memory management
Note: I initially wanted to name this talk "what you need to know about memory" or somewhere along these lines, but I realized that this talk is very, very far from being enough. There are many related cool things that I intentionally did not cover. And I would like to at least mention them so that you can read up on them later:
- First, there are tools for memory analysis. For more complicated programs, you may want ot use tools that let you measure how much cache misses or pagefaults you program produces. Perf and similar statistical profilers can do that, and cachegrind can also tell you where these cache misses are happening
- To execute any sequence of instructions in a program, the processor needs to first load it, and to speed things up, there is a system if instruction caches and different program layout optimizations, for example that try to group hot code together with hot code, so that they occupy the same cache lines and memory pages
- RAM is also quite different from the CPU caches and has its own timings and quirks.
- Performance becomes harder to assess on NUMA systems when you have mutliple CPU sockets, or multiple RAM sticks, especially RAM different ram sticks
- There are also complicated things happening when multiple cores want the same cache line or do other communication
- External memory also has some very specific properties that are especially important for implementing high-performant storage systems
- How the operating system interacts with these storage devices
- How it ensures that the stored data is correct and not corrupted by colliding cosmic neutronos and such
- And last but definitely not least, how memory allocatoion and garbage collection happens or should happen
These are all large, very important topics with separate books written about them. So no, you do not know everything you need to know about memory. Nobody does. I highly encourage you to read up on them.
---
**Algorithms for Modern Hardware, Chapter 9: "RAM & CPU Caches"**
en.algorithmica.org/hpc/cpu-cache
**"What Every Programmer Should Know About Memory"** by Ulrich Drepper
akkadia.org/drepper/cpumemory.pdf
**"Want fast C++? Know your hardware!"** by Timur Doumler
youtu.be/BP6NxVxDQIs
Note:
In terms of more immediate reading recommendations:
- First, there is my book, that covers everything in this talk and a lot more.
- There is manual titled "What Every Programmer Should Know About Memory" by ulrich drepper. It is 114 pages of two-column text, it was written in 2007, but it is still relevant, still great, and highly recommended.
- And also Timur Doumler, who is also a speaker at this conference by the way, did a talk titled "want fast c++ know your hardware" in 2016 that is very similar in style to this one but covers other things too. So if you watched to this point, you probably liked this talk, and if you liked this talk, you will probably like Timour's talk.
That's about it. Thank you everyone.
{"metaMigratedAt":"2023-06-17T14:20:31.482Z","metaMigratedFrom":"YAML","title":"CPU Cache Effects","breaks":true,"slideOptions":"{\"theme\":\"white\",\"transition\":\"none\"}","contributors":"[{\"id\":\"045b9308-fa89-4b5e-a7f0-d8e7d0b849fd\",\"add\":139391,\"del\":67360}]"}