Proposal: optimized bitlist

# Proposal: optimized bitlist ###### tags: `data structures` **Author(s):** Victor Farazdagi (Prysmatic Labs) **Reviewer(s):** Nishant Das (Prysmatic Labs) *Last Updated: Jan 19, 2021* [TOC] ## Overview This proposal introduces optimized version of [go-bitfield](https://github.com/prysmaticlabs/go-bitfield), which is a bitmap implementation used in Prysm. The optimized version should be a drop-in replacement featuring extended API (with minimal to no breaking changes), and improved performance. Optimized bitlist is proposed as replacement for the current `[]byte` based implementation for a certain use cases only. This proposal **does NOT** propose removal of the original implementation, which still has its uses. ## Issues with the current implementation Below are issues that found in our current implementation, and serve as rationale the proposed changes. For each issue, I provide "Proposed fix" paragraph, where I elaborate what can be done to resolve the issue. :::warning **How hard will be to implement all these fixes?** Not hard, and all the "Proposed fix" items have been already implemented in [Bitlist64 PR](https://github.com/prysmaticlabs/go-bitfield/pull/36). ::: ### Word size The current implementation relies on array of bytes (`[]byte`), so the word size is 8 bits. For small bitlists (<64 bits) that might be effective, but most of the bitlists we use in Prysm are of bigger size (64, 128, and even 2048 bits, say for `MAX_VALIDATORS_PER_COMMITTEE`). And in that situation granularity of 1 byte is undesired. :::info **Illustrative example** Consider we have a bitlist of size 256 bits, and we want to find number of 1s in that bitlist. The most optimized way will be to rely on [POPCNT](https://www.felixcloutier.com/x86/popcnt) instruction, and indeed Golang 1.9+ has support for that optimization in its [bits package](https://golang.org/pkg/math/bits/#example_OnesCount64). Now, if we are relying on `[]byte` as an underlying array, to traverse 256 bits we will need to iterate 32 times, and call [bits.OnesCount8()](https://golang.org/pkg/math/bits/#example_OnesCount8) 32 times. If, however, we were to rely on a bigger word size, say `[]uint64`, then a single call to `bits.OnesCount64()` would have been enough! NB: see benchmarks for [Count()](#Count) which on a single switch to a bigger word is capable of producing a ==x4.5== performance boost on bitlists of size 2048. ::: **Proposed fix:** So, the first optimization that will improve performance for bitlists of size bigger or equal to 8 bytes, is to switch from `[]byte -> []uint64` (`uint64` is proposed as the biggest integral type, allowing us to have the biggest word possible). ### Size is the part of type In our current implementation actual size of bitlist is kept inside bitlist bytes (the last byte of `[]byte` slice). This decision requires quite special circumstances to shine: when we need bitlists that of size that is not multiple of 8 (or 64, if we switch to `[]uint64`). Then, with additional complexity of implementation to pay for it, we are using the least bits possible, and such implementation can be considered as the most storage-efficient. Such implementation has a significant drawbacks, however: - Unless, we are talking about bitlists of size that is not multiple of 8 (how often do we use them?), it is hard to argue why size is not part of a type, but of its data (it is too C-way). Something that is known on initialization, can be saved and reused, and that extra variable holding size can always be thrown away when bitlist needs to be persisted w/o any extra storage wasted, that something screams for its own variable. Additionally, bitlists can be initialized from the data (see [NewBitlistFrom()](https://github.com/prysmaticlabs/go-bitfield/pull/36/files#diff-8b9f34228893d8f47f5ded6b5f83cd0658f347fddfe7e28dc8365d136cda936fR39)) -- size is calculated automatically, so it seems we can safely split data from its size. - Not only it is hard to maintain that size bit, it also adds extra complexity to almost any operation (we always need to keep attention whether we are working with data bits or size bit). Thus, some (albeit limited) performance gains are expected as well. :::info **Is it really that bad?** Well, size bit adds redundant calculations to almost every operation. Almost every time we interact with a bitlist, we need to calculate MSB and take care of it. This might seem like a small amount of work, up until you are under heavy load of, say, MaxCover algorithm where you greedy-search through 128 bitlist of size 2048, and you constantly do all kinds of bitwise operations (`Or`, `And`, `Xor()` etc), so eventually you can save couple of percents on performance. However, the main gain is mental: code is greatly simplified, for no additional cost. ::: **Proposed fix:** Convert type from ```golang type Bitlist []byte ``` to ```golang type Bitlist struct { size int data []uint64 } ``` ### No variants of methods w/o redundant allocations Quite often the usage pattern is the following: you are doing some operation, say `Or()` (that's union of two bitlists), and you are not interested in preserving the result for long, you're only interested in actual bits and consume them immediately (say by finding cardinality of such a union, on do sth depending on it). In the current implementation, new memory will **always** be allocated to put results into, which is very unoptimized way of doing things, especially for bigger *sparse* bitlists. The issue is exacerbated if we have many operations (MaxCover I am looking at you!), then for every bitwise operation you always get that costly memory allocation, even if all bitlists are of the same size (so that you could have pre-allocated some buffer to write into and use that same buffer for *all* those operations). **Proposed fix:** We should allow for non-alloc variants of our bitwise functions, ones that will allow you to pass in result array, and function will simply write to it without allocation, e.g: ```golang // Xor returns the XOR result of the two bitfields (symmetric difference). func (b *Bitlist64) Xor(c *Bitlist64) *Bitlist64 { ret := b.Clone() b.NoAllocXor(c, ret) return ret } // NoAllocXor returns the XOR result of the two bitfields (symmetric difference). // Result is written into provided variable, so no allocation takes place inside the function. func (b *Bitlist64) NoAllocXor(c, ret *Bitlist64) { for idx, word := range b.data { ret.data[idx] = word ^ c.data[idx] } } ``` ### Future work - Usability of the package can be improved. For example, method that takes union of two bitlists and immediately calculates cardinality of the result (can be named `OrCount()`) can be quite handy (and can be separately optimized to be more efficient than successive calls to `Or()` and `Count()`). This operation is very common (quite often when we do bitwise operations we are interested in some summary result -- number of 1s, are there any bit set etc). So, this has quite good space for both usability improving and optimization. - In addition to `NonAlloc*()` variants we can have `InPlace*()` functions, which will write result directly into the receiver, thus will also skip allocation, but for the cost of amending the caller. ## Reference implementation and benchmarks :::warning **No breaking changes** The [go-bitfield/pull/36](https://github.com/prysmaticlabs/go-bitfield/pull/36) introduces new type `Bitlist64`, which is a bitlist implementation backed by `[]uint64` underlying array. This PR will not introduce any backward incompatible breaking changes, as it doesn't amend the current implementation of `Bitlist`. Both bitlists have their use cases (see benchmarks), as a rule of thumb: - The current implementation `Bitlist` is more storage efficient, and should be preferred if bitlist size is not aligned to word length. It should also be preferred if you are relying on calling `Byte()` method often (as data stored in this type is already in `[]byte` form). You can consider using this implementation whenever you are working with smaller bitlists (say, less than 128 bits). - The new implementation `Bitlist64` is geared towards longer bitlists, on which you are expecting to apply many operations (as those operations are highly optimized in this implementation). ::: While I was pretty sure that proposed fixes will really improve performance of the bitlist operations, for a given worksets, I still wanted to come up with full-fledged implementation that will outperform the current `Bitlist` in benchmarks. That resulted in: [prysmaticlabs/go-bitfield/pull/36](https://github.com/prysmaticlabs/go-bitfield/pull/36). For all bitwise operations I wanted to make sure that my implementation: - Performs correctly (thus I needed to make sure it passes **all** the `Bitlist` unit tests) - Performs efficiently (thus I needed to code benchmarks comparing `Bitlist` vs `Bitlist64`). All this has been done in that PR. Now, to the most interesting part: benchmark results. I will first list every method's benchmarks and then provide [summary table](#Summary-Table). ### NewBitlist() This benchmark measures performance of new bitlist creation. Test params: - Bitlists of different size are being created (see `size` below). - `byte_new` is our current implementation - `uint64` is optmized implementation - `uint64 new+from` benchmarks new method `BitlistFrom([]uint64)` (that's the new way for creating bitlist from data, as `Bitlist` is a structure now). ``` ± go test . -run NONE -benchmem -bench BenchmarkBitlist_New <INSERT< goos: darwin goarch: amd64 pkg: github.com/prysmaticlabs/go-bitfield size:0/[]byte_new-16 129551409 9.33 ns/op 1 B/op 1 allocs/op size:0/[]uint64_new-16 340874362 3.51 ns/op 0 B/op 0 allocs/op size:0/[]uint64_new+from-16 340636495 3.51 ns/op 0 B/op 0 allocs/op size:256/[]byte_new-16 54717733 18.3 ns/op 48 B/op 1 allocs/op size:256/[]uint64_new-16 69607699 15.9 ns/op 32 B/op 1 allocs/op size:256/[]uint64_new+from-16 70782352 16.1 ns/op 32 B/op 1 allocs/op size:512/[]byte_new-16 50221123 21.8 ns/op 80 B/op 1 allocs/op size:512/[]uint64_new-16 55918971 19.0 ns/op 64 B/op 1 allocs/op size:512/[]uint64_new+from-16 58759789 19.1 ns/op 64 B/op 1 allocs/op size:768/[]byte_new-16 44646360 25.0 ns/op 112 B/op 1 allocs/op size:768/[]uint64_new-16 50138484 22.4 ns/op 96 B/op 1 allocs/op size:768/[]uint64_new+from-16 51037354 22.4 ns/op 96 B/op 1 allocs/op size:1024/[]byte_new-16 40371014 27.6 ns/op 144 B/op 1 allocs/op size:1024/[]uint64_new-16 45049539 24.8 ns/op 128 B/op 1 allocs/op size:1024/[]uint64_new+from-16 48649016 24.6 ns/op 128 B/op 1 allocs/op size:1280/[]byte_new-16 35707808 30.5 ns/op 176 B/op 1 allocs/op size:1280/[]uint64_new-16 39399187 28.1 ns/op 160 B/op 1 allocs/op size:1280/[]uint64_new+from-16 38245248 28.1 ns/op 160 B/op 1 allocs/op size:1536/[]byte_new-16 33921061 33.6 ns/op 208 B/op 1 allocs/op size:1536/[]uint64_new-16 36879320 31.2 ns/op 192 B/op 1 allocs/op size:1536/[]uint64_new+from-16 36201370 31.2 ns/op 192 B/op 1 allocs/op size:1792/[]byte_new-16 31028469 36.8 ns/op 240 B/op 1 allocs/op size:1792/[]uint64_new-16 33048739 34.3 ns/op 224 B/op 1 allocs/op size:1792/[]uint64_new+from-16 33383349 34.1 ns/op 224 B/op 1 allocs/op size:2048/[]byte_new-16 28105516 40.8 ns/op 288 B/op 1 allocs/op size:2048/[]uint64_new-16 30972069 37.3 ns/op 256 B/op 1 allocs/op size:2048/[]uint64_new+from-16 31061538 37.1 ns/op 256 B/op 1 allocs/op PASS ok github.com/prysmaticlabs/go-bitfield 34.070s ``` :::info **Comparison** - Comparable performance, our optimized version is only slightly better. - It is interesting that for a zero-sized bitlist, our current implementation still does an allocation (see `size:0/[]byte_new-16`). That's actually expected: we are preserving size bit. ::: ### Len() The `Len()` operation gives us length of the bitlist, the optimized version caches the value into struct's variable. ``` ± go test . -run NONE -benchmem -bench BenchmarkBitlist_Len <NORMAL< goos: darwin goarch: amd64 pkg: github.com/prysmaticlabs/go-bitfield size:0/[]byte-16 1000000000 0.215 ns/op 0 B/op 0 allocs/op size:0/[]uint64-16 1000000000 0.214 ns/op 0 B/op 0 allocs/op size:512/[]byte-16 1000000000 0.214 ns/op 0 B/op 0 allocs/op size:512/[]uint64-16 1000000000 0.214 ns/op 0 B/op 0 allocs/op size:1024/[]byte-16 1000000000 0.213 ns/op 0 B/op 0 allocs/op size:1024/[]uint64-16 1000000000 0.214 ns/op 0 B/op 0 allocs/op size:1536/[]byte-16 1000000000 0.214 ns/op 0 B/op 0 allocs/op size:1536/[]uint64-16 1000000000 0.214 ns/op 0 B/op 0 allocs/op size:2048/[]byte-16 1000000000 0.214 ns/op 0 B/op 0 allocs/op size:2048/[]uint64-16 1000000000 0.222 ns/op 0 B/op 0 allocs/op PASS ok github.com/prysmaticlabs/go-bitfield 2.490s ``` :::info **Comparison** - Comparable performance, due to very limited number of instructions that are needed to perform. ::: ### SetBitAt() and BitAt() The `SetBitAt()` sets bit at position, and `BitAt()` gets bit at position. ``` ± go test . -run NONE -benchmem -bench BenchmarkBitlist_SetBitAt <INSERT< goos: darwin goarch: amd64 pkg: github.com/prysmaticlabs/go-bitfield size:0/[]byte-16 621557503 1.92 ns/op 0 B/op 0 allocs/op size:0/[]uint64-16 1000000000 0.426 ns/op 0 B/op 0 allocs/op size:512/[]byte-16 267398437 4.45 ns/op 0 B/op 0 allocs/op size:512/[]uint64-16 402505532 2.98 ns/op 0 B/op 0 allocs/op size:1024/[]byte-16 268923798 4.44 ns/op 0 B/op 0 allocs/op size:1024/[]uint64-16 403903170 2.99 ns/op 0 B/op 0 allocs/op size:1536/[]byte-16 262672713 4.45 ns/op 0 B/op 0 allocs/op size:1536/[]uint64-16 403740098 2.98 ns/op 0 B/op 0 allocs/op size:2048/[]byte-16 273036231 4.47 ns/op 0 B/op 0 allocs/op size:2048/[]uint64-16 393301538 2.98 ns/op 0 B/op 0 allocs/op PASS ok github.com/prysmaticlabs/go-bitfield 14.631s ``` :::success **Comparison** - Improved performance: around ==x1.5== increase in number of operations. ::: ### Count() The `Count()` calculates number of 1s in a bitlist. ``` ± go test . -run NONE -benchmem -bench BenchmarkBitlist_Count <INSERT< goos: darwin goarch: amd64 pkg: github.com/prysmaticlabs/go-bitfield size:0/[]byte-16 624794442 1.92 ns/op 0 B/op 0 allocs/op size:0/[]uint64-16 700838688 1.72 ns/op 0 B/op 0 allocs/op size:512/[]byte-16 46502336 25.9 ns/op 0 B/op 0 allocs/op size:512/[]uint64-16 179014315 6.65 ns/op 0 B/op 0 allocs/op size:1024/[]byte-16 20807616 57.0 ns/op 0 B/op 0 allocs/op size:1024/[]uint64-16 100000000 11.8 ns/op 0 B/op 0 allocs/op size:1536/[]byte-16 14512810 81.5 ns/op 0 B/op 0 allocs/op size:1536/[]uint64-16 66983558 16.9 ns/op 0 B/op 0 allocs/op size:2048/[]byte-16 11194689 107 ns/op 0 B/op 0 allocs/op size:2048/[]uint64-16 53831868 22.2 ns/op 0 B/op 0 allocs/op PASS ok github.com/prysmaticlabs/go-bitfield 13.556s ``` :::success **Comparison** - Improved performance: up to ==x4.8== increase in number of operations. ::: ### Bytes() The `Bytes()` exposes underlying array of `[]byte`. Since our current implementation already holds data as `[]byte`, our updated implementation relying on `[]uint64` was expected to perform worse (as we need to slice `uint64s` into `byte`s). :::info The original implementation (using `binary.Write()` was ==x5== times worse than the `Bitlist`). Then, [Nishant](https://github.com/nisdas), suggested a simple way to improve the performance (for interested [here](https://github.com/prysmaticlabs/go-bitfield/pull/36/commits/acc8e6beb85c0f06a79b675ca0cd8917acd39d89) is the relevant commit). ::: Test params: - `non_empty` bitlist with every 10th bit set (relatively non-sparse). - `half_non_empty` for half of bit list length, every 10th bit is set on (half of the bitlist is populated, another half is empty) - `single_bit_set` bitlist with only a single (in the middle) bit set on. ``` ± go test . -run NONE -benchmem -bench BenchmarkBitlist_Bytes <INSERT< goos: darwin goarch: amd64 pkg: github.com/prysmaticlabs/go-bitfield # Size: 0 non_empty/size:0/[]byte-16 94270675 12.1 ns/op 1 B/op 1 allocs/op non_empty/size:0/[]uint64-16 803240937 1.49 ns/op 0 B/op 0 allocs/op half_non_empty/size:0/[]byte-16 100000000 12.3 ns/op 1 B/op 1 allocs/op half_non_empty/size:0/[]uint64-16 799485600 1.49 ns/op 0 B/op 0 allocs/op single_bit_set/size:0/[]byte-16 100000000 11.1 ns/op 1 B/op 1 allocs/op single_bit_set/size:0/[]uint64-16 800681074 1.52 ns/op 0 B/op 0 allocs/op # Size: 512 non_empty#01/size:512/[]byte-16 42149912 24.6 ns/op 80 B/op 1 allocs/op non_empty#01/size:512/[]uint64-16 41451386 27.7 ns/op 64 B/op 1 allocs/op half_non_empty#01/size:512/[]byte-16 31529318 35.5 ns/op 80 B/op 1 allocs/op half_non_empty#01/size:512/[]uint64-16 39250246 29.4 ns/op 64 B/op 1 allocs/op single_bit_set#01/size:512/[]byte-16 25351473 46.2 ns/op 80 B/op 1 allocs/op single_bit_set#01/size:512/[]uint64-16 36364435 31.0 ns/op 64 B/op 1 allocs/op # Size: 1024 non_empty#02/size:1024/[]byte-16 33537550 34.1 ns/op 144 B/op 1 allocs/op non_empty#02/size:1024/[]uint64-16 28934690 40.0 ns/op 128 B/op 1 allocs/op half_non_empty#02/size:1024/[]byte-16 21160369 54.5 ns/op 144 B/op 1 allocs/op half_non_empty#02/size:1024/[]uint64-16 25636191 45.0 ns/op 128 B/op 1 allocs/op single_bit_set#02/size:1024/[]byte-16 14377192 81.7 ns/op 144 B/op 1 allocs/op single_bit_set#02/size:1024/[]uint64-16 23879532 48.9 ns/op 128 B/op 1 allocs/op # Size: 2048 non_empty#04/size:2048/[]byte-16 24847498 46.0 ns/op 288 B/op 1 allocs/op non_empty#04/size:2048/[]uint64-16 18126237 66.2 ns/op 256 B/op 1 allocs/op half_non_empty#04/size:2048/[]byte-16 11737928 93.4 ns/op 288 B/op 1 allocs/op half_non_empty#04/size:2048/[]uint64-16 15763342 73.2 ns/op 256 B/op 1 allocs/op single_bit_set#04/size:2048/[]byte-16 8793526 137 ns/op 288 B/op 1 allocs/op single_bit_set#04/size:2048/[]uint64-16 14756454 81.1 ns/op 256 B/op 1 allocs/op PASS ok github.com/prysmaticlabs/go-bitfield 36.906s ``` :::success **Comparison** - On non-sparse bitlist: ==x1.3== **worse** performance :exclamation: - On relatively sparse list: ==x1.3== better on number of operations - On a single bit set: ==x1.6== better on number of operations Overall, performance is comparable to that of `Bitlist`. ::: ### Contains() The `Contains()` operation checks whether one set is a superset of another. ``` ± go test . -run NONE -benchmem -bench BenchmarkBitlist_Contains <INSERT< goos: darwin goarch: amd64 pkg: github.com/prysmaticlabs/go-bitfield size:0/[]byte-16 151745586 8.00 ns/op 0 B/op 0 allocs/op size:0/[]uint64-16 327278462 3.68 ns/op 0 B/op 0 allocs/op size:512/[]byte-16 27893593 43.5 ns/op 0 B/op 0 allocs/op size:512/[]uint64-16 143852472 8.37 ns/op 0 B/op 0 allocs/op size:1024/[]byte-16 17014540 70.6 ns/op 0 B/op 0 allocs/op size:1024/[]uint64-16 89776069 13.4 ns/op 0 B/op 0 allocs/op size:1536/[]byte-16 12062512 99.4 ns/op 0 B/op 0 allocs/op size:1536/[]uint64-16 63799292 18.8 ns/op 0 B/op 0 allocs/op size:2048/[]byte-16 9337513 128 ns/op 0 B/op 0 allocs/op size:2048/[]uint64-16 41539470 29.0 ns/op 0 B/op 0 allocs/op ``` :::success **Comparison** - Improved performance: around ==x4.5== better on number of operations. ::: ### Overlaps() The `Overlaps()` checks whether two bitlists have a common bit set to 1. ``` ± go test . -run NONE -benchmem -bench BenchmarkBitlist_Overlaps <INSERT< goos: darwin goarch: amd64 pkg: github.com/prysmaticlabs/go-bitfield size:0/[]byte-16 165332434 7.27 ns/op 0 B/op 0 allocs/op size:0/[]uint64-16 423296642 2.74 ns/op 0 B/op 0 allocs/op size:512/[]byte-16 16143088 74.0 ns/op 0 B/op 0 allocs/op size:512/[]uint64-16 149936602 7.96 ns/op 0 B/op 0 allocs/op size:1024/[]byte-16 9348651 129 ns/op 0 B/op 0 allocs/op size:1024/[]uint64-16 97509343 12.1 ns/op 0 B/op 0 allocs/op size:1536/[]byte-16 6538896 183 ns/op 0 B/op 0 allocs/op size:1536/[]uint64-16 72636613 16.4 ns/op 0 B/op 0 allocs/op size:2048/[]byte-16 5026563 238 ns/op 0 B/op 0 allocs/op size:2048/[]uint64-16 45698238 26.0 ns/op 0 B/op 0 allocs/op PASS ok github.com/prysmaticlabs/go-bitfield 14.839s ``` :::success **Comparison** - Improved performance: around ==x9 (sic!)== better on number of operations. ::: ### Or() and NoAllocOr() The `Or()` is union of two bitlists. The `NoAllocOr()` accepts extra parameter to which write the output. ``` ± go test . -run NONE -benchmem -bench BenchmarkBitlist_Or <INSERT< goos: darwin goarch: amd64 pkg: github.com/prysmaticlabs/go-bitfield size:0/[]byte-16 46625775 25.2 ns/op 2 B/op 2 allocs/op size:0/[]uint64-16 17925796 66.0 ns/op 64 B/op 2 allocs/op size:0/[]uint64_(noalloc)-16 318798058 3.76 ns/op 0 B/op 0 allocs/op size:256/[]byte-16 15924896 69.8 ns/op 96 B/op 2 allocs/op size:256/[]uint64-16 11983741 99.4 ns/op 128 B/op 4 allocs/op size:256/[]uint64_(noalloc)-16 120699474 9.94 ns/op 0 B/op 0 allocs/op size:512/[]byte-16 11325868 104 ns/op 160 B/op 2 allocs/op size:512/[]uint64-16 10553553 113 ns/op 192 B/op 4 allocs/op size:512/[]uint64_(noalloc)-16 73604287 16.2 ns/op 0 B/op 0 allocs/op size:1024/[]byte-16 6462662 184 ns/op 288 B/op 2 allocs/op size:1024/[]uint64-16 8271004 142 ns/op 320 B/op 4 allocs/op size:1024/[]uint64_(noalloc)-16 42255121 28.2 ns/op 0 B/op 0 allocs/op size:2048/[]byte-16 3655354 324 ns/op 576 B/op 2 allocs/op size:2048/[]uint64-16 5955828 200 ns/op 576 B/op 4 allocs/op size:2048/[]uint64_(noalloc)-16 22867712 52.1 ns/op 0 B/op 0 allocs/op PASS ok github.com/prysmaticlabs/go-bitfield 36.950s ``` :::success **Comparison** - Improved performance: around ==x1.5== better on number of operations for `Or()` and ==x6.25== for `NoAllocOr()`. - Generally, for a repetitive union operation, `NoAllocOr()` should be used, it means updating the existing code to rely on this new method. ::: ### And() and NoAllocAnd() The `And()` is an intersection of two bitlists. The `NoAllocAnd()` accepts extra parameter to which write the output. ``` ± go test . -run NONE -benchmem -bench BenchmarkBitlist_And <INSERT< goos: darwin goarch: amd64 pkg: github.com/prysmaticlabs/go-bitfield size:0/[]byte-16 47553297 24.4 ns/op 2 B/op 2 allocs/op size:0/[]uint64-16 18500850 62.0 ns/op 64 B/op 2 allocs/op size:0/[]uint64_(noalloc)-16 325849761 3.66 ns/op 0 B/op 0 allocs/op size:256/[]byte-16 16739907 69.5 ns/op 96 B/op 2 allocs/op size:256/[]uint64-16 11672568 101 ns/op 128 B/op 4 allocs/op size:256/[]uint64_(noalloc)-16 120597835 10.1 ns/op 0 B/op 0 allocs/op size:512/[]byte-16 11048820 104 ns/op 160 B/op 2 allocs/op size:512/[]uint64-16 10470393 114 ns/op 192 B/op 4 allocs/op size:512/[]uint64_(noalloc)-16 73036993 16.2 ns/op 0 B/op 0 allocs/op size:1024/[]byte-16 6527413 185 ns/op 288 B/op 2 allocs/op size:1024/[]uint64-16 8419184 142 ns/op 320 B/op 4 allocs/op size:1024/[]uint64_(noalloc)-16 42037360 28.3 ns/op 0 B/op 0 allocs/op size:2048/[]byte-16 3669594 324 ns/op 576 B/op 2 allocs/op size:2048/[]uint64-16 5946556 199 ns/op 576 B/op 4 allocs/op size:2048/[]uint64_(noalloc)-16 22891801 52.6 ns/op 0 B/op 0 allocs/op PASS ok github.com/prysmaticlabs/go-bitfield 36.902s ``` :::success **Comparison** - Improved performance: around ==x1.5== better on number of operations for `And()` and ==x6.25== for `NoAllocAnd()`. ::: ### Xor() and NoAllocXor() The `Xor()` is symmetric difference of two bitlists. The `NoAllocXor()` accepts extra parameter to which write the output. ``` ± go test . -run NONE -benchmem -bench BenchmarkBitlist_Xor <INSERT< goos: darwin goarch: amd64 pkg: github.com/prysmaticlabs/go-bitfield size:0/[]byte-16 45929958 25.7 ns/op 2 B/op 2 allocs/op size:0/[]uint64-16 17604517 66.3 ns/op 64 B/op 2 allocs/op size:0/[]uint64_(noalloc)-16 318306673 3.75 ns/op 0 B/op 0 allocs/op size:256/[]byte-16 15961924 73.5 ns/op 96 B/op 2 allocs/op size:256/[]uint64-16 11790674 101 ns/op 128 B/op 4 allocs/op size:256/[]uint64_(noalloc)-16 100000000 10.0 ns/op 0 B/op 0 allocs/op size:512/[]byte-16 10602619 112 ns/op 160 B/op 2 allocs/op size:512/[]uint64-16 10241956 114 ns/op 192 B/op 4 allocs/op size:512/[]uint64_(noalloc)-16 73407048 16.3 ns/op 0 B/op 0 allocs/op size:1024/[]byte-16 5964783 199 ns/op 288 B/op 2 allocs/op size:1024/[]uint64-16 8255700 141 ns/op 320 B/op 4 allocs/op size:1024/[]uint64_(noalloc)-16 42253543 28.4 ns/op 0 B/op 0 allocs/op size:2048/[]byte-16 3316660 355 ns/op 576 B/op 2 allocs/op size:2048/[]uint64-16 5386344 199 ns/op 576 B/op 4 allocs/op size:2048/[]uint64_(noalloc)-16 22799277 52.1 ns/op 0 B/op 0 allocs/op PASS ok github.com/prysmaticlabs/go-bitfield 35.718s ``` :::success **Comparison** - Improved performance: around ==x1.6== better on number of operations for `Xor()` and ==x6.8== for `NoAllocXor()`. ::: ### Not() and NoAllocNot() The `Not()` is a complement of a bitlist. The `NoAllocNot()` accepts extra parameter to which write the output. ``` ± go test . -run NONE -benchmem -bench BenchmarkBitlist_Not <INSERT< goos: darwin goarch: amd64 pkg: github.com/prysmaticlabs/go-bitfield size:0/[]byte-16 444375667 2.64 ns/op 0 B/op 0 allocs/op size:0/[]uint64-16 799970971 1.50 ns/op 0 B/op 0 allocs/op size:0/[]uint64_(noalloc)-16 934162906 1.28 ns/op 0 B/op 0 allocs/op size:256/[]byte-16 34709737 31.5 ns/op 48 B/op 1 allocs/op size:256/[]uint64-16 24563046 48.0 ns/op 64 B/op 2 allocs/op size:256/[]uint64_(noalloc)-16 326730100 3.72 ns/op 0 B/op 0 allocs/op size:512/[]byte-16 21427358 51.5 ns/op 80 B/op 1 allocs/op size:512/[]uint64-16 21802969 54.4 ns/op 96 B/op 2 allocs/op size:512/[]uint64_(noalloc)-16 214917336 5.53 ns/op 0 B/op 0 allocs/op size:1024/[]byte-16 14705668 77.5 ns/op 144 B/op 1 allocs/op size:1024/[]uint64-16 18085238 65.1 ns/op 160 B/op 2 allocs/op size:1024/[]uint64_(noalloc)-16 132950383 9.02 ns/op 0 B/op 0 allocs/op size:2048/[]byte-16 8730987 134 ns/op 288 B/op 1 allocs/op size:2048/[]uint64-16 13205708 89.7 ns/op 288 B/op 2 allocs/op size:2048/[]uint64_(noalloc)-16 75631984 15.8 ns/op 0 B/op 0 allocs/op PASS ok github.com/prysmaticlabs/go-bitfield 36.421s ``` :::success **Comparison** - Improved performance: around ==x1.51== better on number of operations for `Not()` and ==x8.66== for `NoAllocNot()`. ::: ### BitIndices() and NoAllocBitIndices() The `BitIndeces()` produces bit position of set bit in a bitlist. The `NoAllocBitIndices()` is no-alloc variant of that function. ``` ± go test . -run NONE -benchmem -bench BenchmarkBitlist_BitIn <INSERT< goos: darwin goarch: amd64 pkg: github.com/prysmaticlabs/go-bitfield # Size: 0 non_empty/size:0/[]byte 82033018 14.5 ns/op 0 B/op 0 allocs/op non_empty/size:0/[]uint64 154052895 8.29 ns/op 0 B/op 0 allocs/op non_empty/size:0/[]uint64_(noalloc) 585813720 1.91 ns/op 0 B/op 0 allocs/op half_non_empty/size:0/[]byte 82114581 14.5 ns/op 0 B/op 0 allocs/op half_non_empty/size:0/[]uint64 153287430 7.84 ns/op 0 B/op 0 allocs/op half_non_empty/size:0/[]uint64_(noalloc) 612117058 1.93 ns/op 0 B/op 0 allocs/op single_bit/size:0/[]byte 82175704 14.5 ns/op 0 B/op 0 allocs/op single_bit/size:0/[]uint64 153201876 7.84 ns/op 0 B/op 0 allocs/op single_bit/size:0/[]uint64_(noalloc) 622727553 1.93 ns/op 0 B/op 0 allocs/op # Size: 512 non_empty#01/size:512/[]byte 1974549 602 ns/op 416 B/op 1 allocs/op non_empty#01/size:512/[]uint64 8720407 135 ns/op 416 B/op 1 allocs/op non_empty#01/size:512/[]uint64_(noalloc) 16527469 72.4 ns/op 0 B/op 0 allocs/op half_non_empty/size:512/[]byte 2066620 585 ns/op 208 B/op 1 allocs/op half_non_empty/size:512/[]uint64 14824033 79.3 ns/op 208 B/op 1 allocs/op half_non_empty/size:512/[]uint64_(noalloc) 30524906 39.1 ns/op 0 B/op 0 allocs/op single_bit#01/size:512/[]byte 2224578 542 ns/op 0 B/op 0 allocs/op single_bit#01/size:512/[]uint64 78249454 15.1 ns/op 0 B/op 0 allocs/op single_bit#01/size:512/[]uint64_(noalloc) 210314572 5.64 ns/op 0 B/op 0 allocs/op # Size: 1024 non_empty#02/size:1024/[]byte 916622 1194 ns/op 896 B/op 1 allocs/op non_empty#02/size:1024/[]uint64 4401720 270 ns/op 896 B/op 1 allocs/op non_empty#02/size:1024/[]uint64_(noalloc) 8009198 151 ns/op 0 B/op 0 allocs/op half_non_empty/size:1024/[]byte 966969 1140 ns/op 416 B/op 1 allocs/op half_non_empty/size:1024/[]uint64 7962246 146 ns/op 416 B/op 1 allocs/op half_non_empty/size:1024/[]uint64_(noalloc) 15919908 75.6 ns/op 0 B/op 0 allocs/op single_bit#02/size:1024/[]byte 1000000 1093 ns/op 0 B/op 0 allocs/op single_bit#02/size:1024/[]uint64 51005234 25.4 ns/op 0 B/op 0 allocs/op single_bit#02/size:1024/[]uint64_(noalloc) 131435709 9.06 ns/op 0 B/op 0 allocs/op # Size: 2048 non_empty#04/size:2048/[]byte 484184 2347 ns/op 1792 B/op 1 allocs/op non_empty#04/size:2048/[]uint64 2366553 509 ns/op 1792 B/op 1 allocs/op non_empty#04/size:2048/[]uint64_(noalloc) 4108609 289 ns/op 0 B/op 0 allocs/op half_non_empty/size:2048/[]byte 517838 2253 ns/op 896 B/op 1 allocs/op half_non_empty/size:2048/[]uint64 4065385 297 ns/op 896 B/op 1 allocs/op half_non_empty/size:2048/[]uint64_(noalloc) 7369372 161 ns/op 0 B/op 0 allocs/op single_bit#04/size:2048/[]byte 569166 2109 ns/op 0 B/op 0 allocs/op single_bit#04/size:2048/[]uint64 29431207 40.4 ns/op 0 B/op 0 allocs/op single_bit#04/size:2048/[]uint64_(noalloc) 75044857 15.8 ns/op 0 B/op 0 allocs/op PASS ok github.com/prysmaticlabs/go-bitfield 64.198s ``` :::success **Comparison** - Improved performance: from ==x4.5== to ==x51 (sic!)== better on number of operations for `BitIndices()` (with especially notable results for a very sparse 1 bit set bitlists). - Improved performance: from ==x8.45== to ==x131 (sic!)== for `NoAllocBitIndices()`. ::: ### Summary Table Method | Bitlist | Bitlist64 | Delta -------|---------|-----------|------ NewBitlist()|40.8 ns/op|37.3 ns/op|Comparable| Len()|0.214 ns/op|0.222 ns/op|Comparable| SetBitAt() / BitAt()|4.47 ns/op |2.98 ns/op|x1.5 better| Count()|107 ns/op |22.2 ns/op|x4.8 better| Bytes() non-sparse|46 ns/op |66.2 ns/op|x1.3 worse :exclamation:| Bytes() relatively sparse|93.4 ns/op |73.2 ns/op|x1.34 better | Bytes() single bit|137 ns/op |87.1 ns/op|x1.6 better | Contains()|128 ns/op |29 ns/op|x4.5 better| Overlaps()|238 ns/op |26 ns/op|x9 better| Or()|324 ns/op |200 ns/op|x1.5 better| NoAllocOr()|324 ns/op |52.1 ns/op|x6.25 better| And()|324 ns/op |199 ns/op|x1.5 better| NoAllocAnd()|324 ns/op |52.6 ns/op|x6.25 better| Xor()|355 ns/op |199 ns/op|x1.6 better| NoAllocXor()|355 ns/op |52.1 ns/op|x6.8 better| Not()|134 ns/op |89.7 ns/op|x1.51 better| NoAllocNot()|134 ns/op |15.8 ns/op|x8.66 better| BitIndices() non-sparse|2347 ns/op |509 ns/op|x4.88 better| NoAllocBitIndices() non-sparse|2347 ns/op |289 ns/op|x8.48 better| BitIndices() single bit|2109 ns/op |40.4 ns/op|x51 better| NoAllocBitIndices() single bit|2109 ns/op |15.8 ns/op|x131 better| ## Conclusion The proposed `Bitlit64` implementation should significantly improve performance on operation-heavy algorithms (with the immediate target for optimization being MaxCover attestation aggregation). Additionally, since it extends our existing types without any breaking changes, switch from `Bitlist` to `Bitlist64` (even when warranted) can be done gradually.