data-oriented design

# data-oriented design While working on the [Roc compiler](https://github.com/roc-lang/roc), we regularly dive deep on computer science topics. A recurring theme is speed, both the runtime performance of the code that we generate, as well as the performance of our compiler itself. One extremely useful technique that we have been playing with is data-oriented design: the idea that the actual data you have should guide how code is structured. DoD is commonly used in game programming, where runtime speed defines what is possible. Recently it's seen more use in other domains, for instance in the Zig and Rust compilers, and projects that focus on runtime speed like [Mold](https://github.com/rui314/mold) and [Bun](https://bun.sh/). The talk [Practical Data-oriented design](https://vimeo.com/649009599) by Zig creator Andrew Kelly is an excellent introduction to the ideas behind DoD. In this post, I'll show a change we made to the roc compiler using some of these ideas. ## The collection is the abstraction The goal of DoD is faster programs, but it also provides an interesting lens for program design in general. The most common example of DoD is Array of Structs vs. Struct of Arrays. In most programming traditions, it is natural to have values of some shape, the structs, and if you have many of them, you put them into an array. The unit of thought is the individual struct: it is what methods are defined on. ```rust struct Point2d { x: f32, y: f32, } type Polygon = Vec<Point2d>; ``` This is intuitive, but DoD takes the opposite perspective, the collection as a whole is the unit of thought: ```rust struct Polygon { xs: Vec<f32>, ys: Vec<f32>, } ``` While this example illustrates the transformation, it does not really show the advantages that DoD has. Let's try to apply the DoD hammer to some real problems. ## DoD and compilers Compilers are usually considered as CPU-bound: they transform input (the source code) into output (executables), and are limited by how fast the machine can process the data. However in practice the problem is a bit more nuanced. Many modern applications are not limited by the speed of the CPU -- how long it takes to perform an instruction -- but rather by getting the instructions and the data to the CPU. Compilers are mostly memory-bound. Data-oriented design takes this as a starting point, and provides ideas for how to structure code such that your hardware can perform your computation efficiently. ## Example problem The roc parser looks at a roc source file, and produces a list of top-level definitions in that file. These are stored as a list of `Def`s combined with their start and end position in the file: ```rust /// A span in the input file #[derive(Clone, Eq, Copy, PartialEq, PartialOrd, Ord, Hash)] pub struct Region { start_position: u32, end_position: u32, } #[derive(Clone, Eq, Copy, PartialEq, PartialOrd, Ord, Hash)] pub struct Loc<T> { pub region: Region, pub value: T, } #[derive(Debug, Clone, Copy, PartialEq)] pub enum Def<'a> { Type(TypeDef<'a>), Value(ValueDef<'a>), // Blank Space (e.g. comments, spaces, newlines) before or after a def. // We preserve this for the formatter; canonicalization ignores it. SpaceBefore(&'a Def<'a>, &'a [CommentOrNewline<'a>]), SpaceAfter(&'a Def<'a>, &'a [CommentOrNewline<'a>]), } type Defs<'a> = Vec<Loc<Def<'a>>>; ``` The parsed representation is used both to generate an executable (a program you can run), but also to format roc code. For the formatter, the order of definitions matters, and also comments must be preserved: we don't want the formatter to re-order top-level definitions and throw away all comments! The `Def` type is at first sight a nice representation: we use `enum`s so we can later match on all the cases. This is close to how one might represent this idea in haskell or elm. But when working with this type, there are some problems: there are illegal states that can be represented. For instance, we could in theory see SpaceAfter(SpaceAfter(..., &[ ... ]), &[ ... ]) We should never generate such a value, but structurally the type allows it to exist, and rust makes us deal with such technically-possible cases. How can data-oriented design help? ## Feeding the CPU Data-oriented design is commonly used as an optimization technique. The goal is to do the actual work that needs to be performed as fast as possible on actual hardware. This in turn means that the data must be represented in a form that the hardware can deal with efficiently. In university courses we typically learn about performance as big-O classes. But good algorithm design is not enough to make a program run fast in the real world. We must also play to the strengths of the hardware. The limiting factor in our compiler, and most data-intensive programs, is usually not the CPU, but how much data we can feed the CPU. Most of the time, the CPU is waiting for new inputs to become available. But we can design our programs in a way that utilizes the CPU more effectively. ### Loading data To execute an instruction, two things must be done. First, the instruction and its arguments must be loaded, and second the CPU actually performs the instruction. On modern hardware, loading the instruction and its arguments usually has a higher latency than performing the instruction. That counter-intuitively means it can be faster to perform a computation repeatedly rather than caching it. In other words, the CPU is not actually the bottleneck. This was not always the case, but over the decades we've seen much larger improvements in CPU speed than memory speed. To remedy the (relative) slowness of retrieving data from memory, CPUs cache the most recent data that they've used: there is a good chance it will be used again. CPUs have several layers of caches: ``` > lscpu | grep "cache:" L1d cache: 256 KiB L1i cache: 512 KiB L2 cache: 4 MiB L3 cache: 16 MiB ``` The level 1 (L1) cache has 2 flavors: one cache for the instructions of your program, and one for data used by your program. Levels 2 and 3 don't make this distinction. These caches make a speed versus size tradeoff: L1 is tiny but loading data stored there is very fast, L3 is significantly bigger but somewhat slower (still faster than main memory though). The memory in the caches is not just a collection of bytes: it is partitioned into cache lines. Today these are usually 64 bytes wide. When a value is loaded from memory, what is really loaded is its surrounding cache line. For instance, `my_vector[0]` will load not just the element, but a full cache line from main memory. If we subsequently use `my_vector[1]`, the value is likely in that same cache line, which is already loaded and stored in a CPU cache. Consequently the second lookup is very, very fast. But, the cache has finite space, so putting something new in the cache means something old needs to go: an existing cache line must be evicted. In summary: to go fast, the data you are using should be in cache. To take advantage of the cache, all data you'll use should be adjacent in memory, so one cache line contains as much useful data as possible. Finally, your data should be compact, such that loading it evicts as few cache lines as possible. ## Struct of Arrays Struct of Arrays is a label for an approach that tries to achieve these things: group similar data (likely to be used together) adjacent in memory. Usually when we iterate over a collection, we only use a couple of fields from the elements. But the whole element is loaded, and so the amount of useful information per cache line is low. Let's look at our example: ```rust #[derive(Debug, Clone, Copy, PartialEq)] pub enum Def<'a> { Type(TypeDef<'a>), Value(ValueDef<'a>), // Blank Space (e.g. comments, spaces, newlines) before or after a def. // We preserve this for the formatter; canonicalization ignores it. SpaceBefore(&'a Def<'a>, &'a [CommentOrNewline<'a>]), SpaceAfter(&'a Def<'a>, &'a [CommentOrNewline<'a>]), } type Defs<'a> = Vec<Loc<Def<'a>>> ``` The idea is to represent the `Defs` as several arrays. ```rust struct Defs<'a> { defs: Vec<Def<'a>>, regions: Vec<Region>, } ``` The `Def`s and `Region`s are now stored separately, in parallel arrays. For instance, the `Region` at index 12 corresponds to the `Def` at index 12. We now have to refer to definitions by their index. A rust reference would no longer work because a reference to a `Def` does not contain the `Region` any more. This approach stores the data more densely, and unused data is never loaded. For instance when formatting definitions, the regions are not important. With this representation, during formatting the regions are never loaded into a CPU cache, and hence never evict other useful information. But why stop here: we can add more arrays to store data that we only use some of the time, data that is usually in the way. Let's pull out the spaces: ```rust #[derive(Debug, Clone, Copy, PartialEq)] pub enum Def<'a> { Type(TypeDef<'a>), Value(ValueDef<'a>), } struct Defs<'a> { defs: Vec<Def<'a>>, regions: Vec<Region>, space_before: Vec<&'a [CommentOrNewline<'a>]>, space_after: Vec<&'a [CommentOrNewline<'a>]>, } ``` In the type checker, where spaces are irrelevant, we just don't look at the `space_*` arrays, and never have to worry about whether there are spaces. In the formatter, we can easily look them up given the index of the `Def` that we're processing. We made this transformation based on a performance argument, but actually, this representation eliminates the invalid double `SpaceAfter` case: our code got better! ## One more step Let's make one more data structure change. In the formatter, only the ordering of definitions is important. But in most other places, we want to treat type and value definitions separately. ```rust #[derive(Debug, Clone, Copy, PartialEq)] pub enum DefTag<'a> { Type { type_defs_index : u32 }, Value { value_defs_index: u32 }, } struct Defs<'a> { tags: Vec<DefTag<'a>>, regions: Vec<Region>, space_before: Vec<&'a [CommentOrNewline<'a>]>, space_after: Vec<&'a [CommentOrNewline<'a>]>, value_defs: Vec<ValueDef<'a>> type_defs: Vec<TypeDef<'a>> } ``` We could go further still, by storing the tag as an `i32` and using the sign to indicate if we're dealing with a type or value. Maybe the `space_before` and `space_after` vectors could be combined. DoD can be a really fun puzzle. ## Results We've moved a lot of code around, but did it actually make a difference? Measuring performance is always tricky, but I have run some benchmarks. The code for them is [here]([todo](https://github.com/tweedegolf/data-oriented-design-blog-benchmarks)). The benchmark parses a very simple sequence of tokens into our two data structures. The first thing I tried is to run a `cargo bench` test with criterion. That did not work, because the `standard::parser` runs out of memory. The `dod::parser` runs just fine, showing that at least the DoD approach uses less memory. The easiest way I know to display memory usage is valgrind: ``` > valgrind ./target/release/standard ... ==197038== total heap usage: 46 allocs, 46 frees, 75,499,477 bytes allocated ... > valgrind ./target/release/dod ... ==197006== total heap usage: 99 allocs, 99 frees, 67,110,853 bytes allocated ... ``` For this example we're using roughly 8Mb, or 12% less memory. Nice! In terms of runtime, we can fall back to `hyperfine`: ``` > hyperfine target/release/standard target/release/dod Benchmark #1: target/release/standard Time (mean ± σ): 26.6 ms ± 1.6 ms [User: 14.9 ms, System: 11.8 ms] Range (min … max): 24.1 ms … 33.8 ms 91 runs Benchmark #2: target/release/dod Time (mean ± σ): 23.6 ms ± 1.1 ms [User: 14.3 ms, System: 9.5 ms] Range (min … max): 21.3 ms … 27.8 ms 108 runs Summary 'target/release/dod' ran 1.12 ± 0.09 times faster than 'target/release/standard' ``` Not a big difference here. ### Compare with boxed We're helped in the above benchmark by already using arena allocation in the starting code. If we had used the more standard `Box<Def>` instead of `&'a Def` ```rust #[derive(Debug, Clone)] pub enum Def<'a> { Type(TypeDef<'a>), Value(ValueDef<'a>), // was // SpaceBefore(&'a Def<'a>, &'a [CommentOrNewline<'a>]), // SpaceAfter(&'a Def<'a>, &'a [CommentOrNewline<'a>]), SpaceBefore(Box<Def<'a>>, &'a [CommentOrNewline<'a>]), SpaceAfter(Box<Def<'a>>, &'a [CommentOrNewline<'a>]), } ``` Using `Box` is significantly worse, because every type, value or space wrapper requires its own small allocation: ``` ==196475== total heap usage: 399,862 allocs, 399,862 frees, 71,516,421 bytes allocated ``` As a result it is roughly 70% slower: ``` > hyperfine target/release/standard target/release/boxed Benchmark #1: target/release/standard Time (mean ± σ): 25.1 ms ± 1.8 ms [User: 14.3 ms, System: 10.8 ms] Range (min … max): 23.0 ms … 35.4 ms 82 runs Benchmark #2: target/release/boxed Time (mean ± σ): 42.5 ms ± 2.3 ms [User: 28.3 ms, System: 14.2 ms] Range (min … max): 38.9 ms … 49.9 ms 70 runs Summary 'target/release/standard' ran 1.69 ± 0.15 times faster than 'target/release/boxed' ``` ## Conclusion We've seen that the ideas from data-oriented design can make code both faster and better. Like Andrew Kelly [mentions in the talk](https://vimeo.com/649009599), playing with these ideas gives me the feeling that my programming abilities are improving (again). DoD is not always worth it, it can create a bit of a mess of arrays and indices, and there is no safety around lifetimes and mutability. But it is an amazing tool to have in the toolbox.