This page documents investigation into panics reported by "SnowyHitch" on the Lighthouse Discord.

There have been three panics:

  1. cache_arena.rs:184:12 (Investigation)
  2. cache.rs:153:14 (Discord)
  3. cache.rs:159:14 (Discord)

Screenshots

Panic #1

Panic #2

Panic #3

Panic Messages

Note: I used an image-to-text thing to scrape these panic messages. They're lil' wonky with the paths.

  1. panicked at 'range end index 144115188075905177 out of range for slice of length 61440", D: \a\lighthouse\lighthouse\consensus \cached_tree_hash\src\cache_arena.r5:184:12
  2. panicked at 'cached tree should have a root layer: UnknownAllocId (144115188075856164)', consensus \cached tree hash\src\cache.rs:153:14
  3. panicked at index out of bounds: the len is 4 but the index is 144115188075855875, consensus\cached_tree_hash\sc\cache.rs:159:14

Investigation

All of these panics are out-of-bounds list access. Interestingly, the accessing indices are similar-looking and insanely large (insane because a list of this size would be on the order of petabytes):

  1. 144115188075905177
  2. 144115188075856164
  3. 144115188075855875

As pointed out by "Diophantus" on Discord, all these values have the 58th bit set (eg bit[57] == 1 where 0 is the 1st index). However, if we un-set that 58th bit they would be relatively small/sane values:

>>> bin(144115188075905177)
'0b1000000000000000000000000000000000000000001100000010011001'
>>> bin(144115188075856164)
'0b1000000000000000000000000000000000000000000000000100100100'
>>> bin(144115188075855875)
'0b1000000000000000000000000000000000000000000000000000000011'

This leads me to believe that something is setting the 58th bit of these indices. Let's look at similarities between the storage locations of the panicky indices.

Panics 1 and 2 both use values from CacheArena.offsets:

#[derive(Debug, PartialEq, Clone, Default, Encode, Decode)]
pub struct CacheArena<T: Encode + Decode> {
    /// The backing array, storing cached values.
    backing: Vec<T>,
    /// A list of offsets indicating the start of each allocation.
    offsets: Vec<usize>,
}

Panic 3 uses a value from TreeHashCache.depth:

#[derive(Debug, PartialEq, Clone, Default, Encode, Decode)]
pub struct TreeHashCache {
    pub initialized: bool,
    /// Depth is such that the tree has a capacity for 2^depth leaves
    depth: usize,
    /// Sparse layers.
    ///
    /// The leaves are contained in `self.layers[self.depth]`, and each other layer `i`
    /// contains the parents of the nodes in layer `i + 1`.
    layers: SmallVec8<CacheArenaAllocation>,
}

Panic 3 is rather illuminating since I think it clearly shows us that there is memory corruption of some sort. In my previous investigation I make a somewhat complicated claim that there must be corruption causing panic 1 (and panic 2, but I hadn't seen it yet). Whilst I stand by that claim, I believe panic 3 is a clear-cut demonstration of corruption. As long as these points stand, I have proven memory corruption with panic 3:

  1. TreeHashCache.depth is set once and never modified.
  2. When TreeHashCache.depth is set, a SmallVec8<CacheArenaAllocation> is allocated of that length.
  3. It is impossible for the user to allocate a SmallVec8<CacheArenaAllocation> of length 144115188075855875 without OOMing.

Rust's privacy (depth is not pub) and mutability syntax should make it trivial to infer my first point by looking at usages of "depth" in cache.rs.

From this point onward I am going to assume that the Rust code in cache.rs and cache_arena.rs is not at fault. Rather, I am going to assume that there is memory modification going on outside of the guarantees provided the Rust compiler.

Corruption

I can think of the following places where the corruption might happen:

  1. Hardware: this seems unlikely to me. The pattern of corruption doesn't lend itself to stochastic bit-flipping due to faulty RAM.
  2. The Rust compiler for Windows: There might be a fault in the Rust compiler. I believe it would only be for Windows because this code has had a huge amount of testing on Linux.
  3. Some other unsafe code in Lighthouse (but only on Windows): it could be that one of our crypto libraries (BLST, hashing?) are "breaking" out and overriding bits of memory. Notably, none of the affected structs are being passed to an FFI, so whatever weird unsafe code we're running just happens to jump out and mess around with the 58th bit of usizes. However, it doesn't seem to flip bits in Hash256 and corrupt hashes).

My intuition sends me down path (2) first. It be that we're doing something weird that's triggering a bug in Windows Rust.

Corruption due to the Rust Compiler

The storage locations for the corrupt values for all three panics are similar in that they are both a usize inside a struct. However they're different in that one is in a Vec and the other isn't.

TODO: more investigation. I don't have anything interesting to report yet 😕