--- title: "Optimizing Ruby's Memory Layout: Variable Width Allocation" date: 2022-12-12 description: Learn more about Variable Width Allocation, coming in Ruby 3.2. --- Ruby is a garbage collected language, meaning that the language automatically allocates memory when objects are created and releases it when the object is no longer used. This means that the garbage collector is involved in the whole lifecycle of objects, so optimizing the memory layout of the garbage collector has the potential to improve the overall performance of the Ruby interpreter. In this blog post, I'll be introducing how Shopify is improving CRuby's performance by optimizing the memory layout in the garbage collector through the Variable Width Allocation project. If you're interested in how the garbage collector works for Ruby 3.0 (which is still largely applicable today), check out [my earlier article](/notes-on-ruby-gc). In this article, I'll first provide [background on how Ruby's memory was structured](#object-allocation) and [its limitations](#disadvantages), then I'll introduce [Variable Width Allocation](#variable-width-allocation), [benchmark results](#benchmark-results), [how to collect metrics](#statistics-collection), and finally [how you can disable Variable Width Allocation](#escape-hatch) if you're encountering issues. # Background ## Object allocation <figure> <img src="https://i.imgur.com/vxrJDDA.png"> <figcaption>Illustration of the Ruby heap with 3 heap pages and each page with 16 slots</figcaption> </figure> All Ruby objects (with the exception of immediates like `nil`, `true`, `false`, small integers, and floats) live in 40 byte slots called `RVALUE`. This means that Ruby's garbage collector only allocates memory of a single size. This is done for simplicity, as the garbage collector only needs to manage chunks of memory of a single size. However, the disadvantages are clear: different types of objects have vastly different memory characteristics. For example, arrays are dynamic in size and could change in size during runtime while classes are static in size. This means that any data that cannot fit in the `RVALUE` have to be allocated externally, such as through the system using `malloc`. These slots are managed in Ruby heap pages, with each page being 16KB in size. This results in each page having 409 slots. Ruby uses heap pages as a unit to manage metadata, such as a counter to keep track of how many live objects are on that page. It also improves allocation performance since allocating memory from the system comes with a performance cost, so we want to minimize the number of times we allocate memory from the system by allocating large chunks at a time. ## Disadvantages Although managing only a single sized slot in the Ruby garbage collector simplifies the implementation and avoids issues such as [memory fragmentation](https://en.wikipedia.org/wiki/Fragmentation_(computing)), there are several limitations it introduces. I'll introduce the limitations here and later discuss how we're solving each of them through our project, Variable Width Allocation. ### Poor data locality We often think of our systems as just having one level of memory, the main RAM. However, there are actually multiple teirs of memory inside of your CPU, called caches. Each level of cache is larger in capacity than the last, but also with higher latency and lower bandwidth. For example, on my computer, there's only 384KB of level 1 cache compared to 32MB of level 3 cache, but the level 1 cache is much faster (1500GB/s vs. 500GB/s) and of lower latency (0.9ns vs. 10ns). The main RAM is much slower (50GB/s) and has much higher latency (80ns) compared to the caches. From this data, it's clear that we want to use data in the caches rather than fetching from memory to improve performance. Whenever a piece of data is fetched from the main memory, it fetches it in units of cache lines from the main RAM and stores it in the caches. The size of the cache line varies by platform, it's 64 bytes on x86 and 128 bytes on Apple silicon. When the cache is full, it will evict the least recently used entries to make space for new entries. However, a 40 byte `RVALUE` does not take advantage of the size cache line meaning that some of the fetched memory is unused. Additionally, since the `RVALUE` doesn't provide enough space, objects may allocate memory in other locations, which increases the amount of space used in the caches. ### Performance overhead of `malloc` Allocating memory from the system using `malloc` isn't free, it comes with a performance overhead. By minimizing the number of `malloc` calls, we can decrease this overhead. Reducing the number of malloc calls can improve boot time and speed up object creation. ### Inefficient use of memory Since different data types in Ruby have vastly different memory characteristics, forcing all objects in a 40 byte slot means that some objects won't be able to take advantage of all the space while other objects that require more space will need to store pointers to other regions of memory. Storing pointers will require additional space and some implementations of `malloc` will require additional memory to store metadata about the allocated memory, both increasing the amount of memory used. ### Limitations to extensibility Other languages like Java have support for multiple types of garbage collectors. This allows researchers to experiment and develop new garbage collection techniques. Additionally, this allows users to change garbage collectors depending on their use case (e.g. lower memory usage, faster boot time, faster performance, etc.). In Ruby, since all objects are of the same size and most of the data for objects are externally allocated through `malloc`, it's difficult for other garbage collector implementations to perform any optimizations. In the ["enabling cutting-edge research on Ruby" section](#enabling-cutting-edge-research-on-ruby), you'll see an example of how this work enables academic researchers to perform cutting-edge garbage collector research on Ruby. # Variable Width Allocation Introducing Variable Width Allocation, a feature I co-authored with [Matt Valentine-House](https://www.eightbitraptor.com/) and [Aaron Patterson](https://tenderlovemaking.com/). The goal of Variable Width Allocation is to introduce APIs to allocate dynamic sized objects and add support for variable sized slots inside the Ruby garbage collector. Having variable size slots would allow types with different memory characteristics to allocate a size most optimal for it. ## Size pools <figure> <img src="https://i.imgur.com/CIFEnp9.png"> <figcaption>Illustration of the Ruby heap after Variable Width Allocation with 5 size pools and heap pages in the size pools</figcaption> </figure> With Variable Width Allocation, we introduce a new concept into the garbage collector called "size pools". Each size pool has slots of a particular size. By keeping fixed sized slots in the size pool, we can continue to avoid issues with memory fragmentation and lets us keep the fast allocation speed. We currently have 5 size pools, with the slot sizes being powers of two multiples of `RVALUE` size, meaning they are 40 bytes, 80 bytes, 160 bytes, 320 bytes, and 640 bytes. We chose powers of two multiples since this ensures that we have 75% utilization on average, which is more performant as it avoids frequent resizing when an object expands or shrinks in size. We've measured this hypothesis to be true by performing memory dumps of Ruby processes and determined that around 75% of the memory in the slots was utilized. We chose to have multiples of `RVALUE` size so that we can fit a single `RVALUE` without any wastage in memory for types that are not yet on Variable Width Allocation. Another solution we considered to solve this problem was to create a special size pool that's of `RVALUE` size, which would allow us to make the other size pools have slot sizes in powers of two (e.g. 32 bytes, 64 bytes, 128 bytes, etc.). However, we determined that the small efficiency gains were not worth the extra complexity of having a special size pool. ## Increasing heap page size As part of this work, we increased heap page size from 16KB to 64KB (see [this pull request](https://github.com/ruby/ruby/pull/5749)). We did this because for heap pages with larger slot sizes (especially with the 640 byte slot), there weren't many slots per page (around 25 slots per page). Since there were not many slots per page, more pages were needed to be allocated, costing performance from the frequent page allocation. Additionally, since each heap page has a header used to store metadata, the increase in allocated metadata increased memory overhead. However, to do this, we had to decouple page size from how the garbage collector worked because otherwise changing the page size would affect performance (see [this pull request](https://github.com/ruby/ruby/pull/5732)). Increasing page sizes to 64KB also improves performance because it decreases the number of page allocations. It also has another advantage: on systems with 64KB system memory pages (e.g. PowerPC), it allows the Ruby page size to be equal to the system page size which increases features (e.g. compaction) that can be used on those systems. ## Resizing and compaction Up until now, we've only discussed allocation of objects through Variable Width Allocation. However, we know that objects can change in size during runtime. Objects could be resized upwards to the point where it doesn't fit in the slot or could be resized downwards to the point where a significant amount of memory is wasted. So how do we deal with these two cases? For both cases, we currently rely on the compaction feature of the garbage collector to bring us to the most optimal state. For when an object needs to be increased in size, we first fall back to allocating through the system using `malloc` before using compaction to bring us to the optimal state. Compaction is a phase during garbage collection that moves objects close together to reduce fragmentation. We modified compaction to move objects to the most optimal size pool (see [this pull request](https://github.com/ruby/ruby/pull/5986)). For objects that were resized up, we free the memory allocated from `malloc` and place the contents adjacent to the object again, regaining data locality. ## Types There are currently four types on Variable Width Allocation: classes, arrays, strings, and objects. We'll talk about how these are implemented. ### Class Classes were the first type we introduced to Variable Width Allocation (see [this ticket](https://bugs.ruby-lang.org/issues/17816)). We chose classes first because classes are of a fixed size, meaning we know the exact size of classes at allocation time and so we could implement Variable Width Allocation without needing to implement any logic for resizing. For classes, we now store the `rb_classext_struct` adjacent to the class object. The `rb_classext_struct` stores data for the class, including instance variables, the superclass, and constants. Before Variable Width Allocation, since the `rb_classext_struct` was too big to fit in the `RVALUE`, it was separately allocated using `malloc` when a class is created. Since it's now allocated along with the class object, we no longer need to allocate any memory from the system when a class is allocated. ### String Strings were the second type we introduced to Variable Width Allocation (see [this ticket](https://bugs.ruby-lang.org/issues/18239)). We chose strings because they are dynamic in size and a common type in Ruby programs. Strings already supported storing the contents adjacent to the object, referred to as "embedded". However, since `RVALUE` was previously limited to 40 bytes in size, only strings up to 23 bytes in size could be embedded. After Variable Width Allocation, strings up to 615 bytes can now be embedded. Strings that are too large will still have their contents stored in memory allocated via the system using `malloc`. We saw significant speedups in microbenchmarks (up to a 35% speedup). ### Array We then implemented Variable Width Allocation for arrays (see [this ticket](https://bugs.ruby-lang.org/issues/18634)). Just like strings, there were already embedded arrays. However, they were limited to just 3 elements or fewer. After Variable Width Allocation, arrays up to 78 elements in length can be embedded. Arrays that are too large will still have their contents stored in memory allocated via the system using `malloc`. ### Object Our latest type on Variable Width Allocation is objects (see [this PR](https://github.com/ruby/ruby/pull/6117)). Just to avoid confusion, even though everything in Ruby are objects, we're talking specifically about objects created from classes defined in your code (i.e. not one of the core types in Ruby such as arrays, classes, strings, hashes, integers, etc.). Objects store a list of their instance variables, so we chose to implement that on Variable Width Allocation. Just like arrays, there were already embedded objects, but they were limited to just 3 instance variables or fewer. After Variable Width Allocation, objects up to 78 instance variables can be embedded. Objects that have more than 78 instance variables will still have their contents stored in memory allocated via the system using `malloc`. We saw significant speedups in microbenchmarks when reading and writing to instance variables (around a 20% speedup). ## Enabling cutting-edge research on Ruby This work enables researchers to apply [MMTk (Memory Management Toolkit)](https://www.mmtk.io/) to Ruby. MMTk is a cutting-edge garbage collector research project from Professor Steve Blackburn at the Australian National University. It's exciting because MMTk provides a standard set of memory management APIs that will allow us to try different garbage collection algorithms without modifying the Ruby source code. Different garbage collectors have different characteristics and will allow users to switch depending on their use case. For example, the mark-and-sweep garbage collector (which is the one currently in CRuby) optimizes for lower memory usage while a semispace copying garbage collector optimizes for better performance (at the cost of using twice as much memory). This work is partially sponsored by Shopify and you can learn more about the various Ruby research projects Shopify is sponsoring [on this page](https://shopify.engineering/shopify-ruby-at-scale-research-investment). # Benchmark results <figure> <img src="https://i.imgur.com/XXVo45O.png"> </figure> We can see performance improvements in almost all benchmarks, with significant improvements in workloads that are heavy on reading and writing to data structures, such as parsing YAML files (psych-load) or generating PDF files (hexapdf). In railsbench, we can see a respectable 2.6% increase in performance. You can find the [benchmarking setup](#benchmark-setup) and the [raw data](#raw-benchmark-results) in the appendix. # Statistics collection Variable Width Allocation introduces a new API to collect statistics from the garbage collector. [`GC.stat_heap`](https://docs.ruby-lang.org/en/master/GC.html#method-c-stat_heap) is a new method that returns statistics about the size pools. This method is useful for determining the memory characteristics of your Ruby program or debugging whether a particular size pool is causing performance issues. Since this method returns internal information about the garbage collector, the structure of the data returned is subject to change. Refer to the method documentation for more information. # Escape hatch If you run into performance or compatibility issues (particularly with native gems), you can disable Variable Width Allocation by passing `CPPFLAGS='-DUSE_RVARGC=0'` to the [configure step when building Ruby](https://docs.ruby-lang.org/en/master/contributing/building_ruby_md.html#label-Quick+start+guide). Please also let us know of this issue by filing a bug report to [bugs.ruby-lang.org](https://bugs.ruby-lang.org/projects/ruby-master/issues). Note that this escape hatch will only be available on Ruby 3.2 and support for this will be dropped in the future. # Appendix ## Benchmark setup Benchmarking was done on a Ubuntu 22.04.1 machine with an AMD Ryzen 5 3600X and 32GB of RAM. Ruby was benchmarked on version 3.2.0 RC 1 (commit [81e274c](https://github.com/ruby/ruby/commit/81e274c9907c9ddb8fbf8ad0c28cd2b39d6e1639)). All benchmarks were run using [yjit-bench](https://github.com/Shopify/yjit-bench) on commit [e19e8cf](https://github.com/Shopify/yjit-bench/commit/e19e8cf098418e75fe451a1b3a1cd5448bb672ad). ## Raw benchmark results All benchmark results are in milliseconds for a single iteration, so lower is better. | | VWA enabled | VWA disabled | Speedup | |--------------------|-------------|--------------|---------| | activerecord (ms) | 127 | 129 | 1.016x | | hexapdf (ms) | 2190 | 2298 | 1.049x | | liquid-render (ms) | 147 | 146 | 0.993x | | mail (ms) | 121 | 121 | 1.000x | | psych-load (ms) | 1786 | 1948 | 1.091x | | railsbench (ms) | 1820 | 1868 | 1.026x | | ruby-lsp (ms) | 61 | 63 | 1.033x |