This page documents investigation into panics reported by "SnowyHitch" on the Lighthouse Discord. There have been three panics: 1. `cache_arena.rs:184:12` ([Investigation](https://hackmd.io/Z-kpqUApTHuYWb9N0z35Mw)) 1. `cache.rs:153:14` ([Discord](https://discord.com/channels/605577013327167508/605577013331361793/1136858740063219853)) 2. `cache.rs:159:14` ([Discord](https://discord.com/channels/605577013327167508/605577013331361793/1137139298651611268)) ## Screenshots ### Panic #1 ![](https://hackmd.io/_uploads/H1IUy7Aih.jpg) ### Panic #2 ![](https://hackmd.io/_uploads/r1ZQkQCsh.png) ### Panic #3 ![](https://hackmd.io/_uploads/HJiXJQCin.png) ## Panic Messages *Note: I used an image-to-text thing to scrape these panic messages. They're lil' wonky with the paths.* 1. `panicked at 'range end index 144115188075905177 out of range for slice of length 61440", D: \a\lighthouse\lighthouse\consensus \cached_tree_hash\src\cache_arena.r5:184:12` 1. `panicked at 'cached tree should have a root layer: UnknownAllocId (144115188075856164)', consensus \cached tree hash\src\cache.rs:153:14` 1. `panicked at index out of bounds: the len is 4 but the index is 144115188075855875, consensus\cached_tree_hash\sc\cache.rs:159:14` ## Investigation All of these panics are out-of-bounds list access. Interestingly, the accessing indices are similar-looking and insanely large (insane because a list of this size would be on the order of petabytes): 1. 144115188075905177 2. 144115188075856164 3. 144115188075855875 As [pointed out](https://discord.com/channels/605577013327167508/605577013331361793/1137153570928607302) by "Diophantus" on Discord, all these values have the 58th bit set (eg `bit[57] == 1` where `0` is the 1st index). However, if we un-set that 58th bit they would be relatively small/sane values: ```python >>> bin(144115188075905177) '0b1000000000000000000000000000000000000000001100000010011001' >>> bin(144115188075856164) '0b1000000000000000000000000000000000000000000000000100100100' >>> bin(144115188075855875) '0b1000000000000000000000000000000000000000000000000000000011' ``` This leads me to believe that *something* is setting the 58th bit of these indices. Let's look at similarities between the storage locations of the panicky indices. Panics 1 and 2 both use values from [CacheArena.offsets](https://github.com/sigp/lighthouse/blob/dfcb3363c757671eb19d5f8e519b4b94ac74677a/consensus/cached_tree_hash/src/cache_arena.rs#L23-L29): ```rust #[derive(Debug, PartialEq, Clone, Default, Encode, Decode)] pub struct CacheArena<T: Encode + Decode> { /// The backing array, storing cached values. backing: Vec<T>, /// A list of offsets indicating the start of each allocation. offsets: Vec<usize>, } ``` Panic 3 uses a value from [`TreeHashCache.depth`](https://github.com/sigp/lighthouse/blob/dfcb3363c757671eb19d5f8e519b4b94ac74677a/consensus/cached_tree_hash/src/cache.rs#L12-L23): ```rust #[derive(Debug, PartialEq, Clone, Default, Encode, Decode)] pub struct TreeHashCache { pub initialized: bool, /// Depth is such that the tree has a capacity for 2^depth leaves depth: usize, /// Sparse layers. /// /// The leaves are contained in `self.layers[self.depth]`, and each other layer `i` /// contains the parents of the nodes in layer `i + 1`. layers: SmallVec8<CacheArenaAllocation>, } ``` Panic 3 is rather illuminating since I think it clearly shows us that there is memory corruption *of some sort*. In my [previous investigation](https://hackmd.io/Z-kpqUApTHuYWb9N0z35Mw) I make a somewhat complicated claim that there *must* be corruption causing panic 1 (and panic 2, but I hadn't seen it yet). Whilst I stand by that claim, I believe panic 3 is a clear-cut demonstration of corruption. As long as these points stand, I have proven memory corruption with panic 3: 1. `TreeHashCache.depth` is set once and never modified. 2. When `TreeHashCache.depth` is set, a `SmallVec8<CacheArenaAllocation>` is allocated of that length. 3. It is impossible for the user to allocate a `SmallVec8<CacheArenaAllocation>` of length `144115188075855875` without OOMing. Rust's privacy (`depth` is not `pub`) and mutability syntax should make it trivial to infer my first point by looking at usages of "depth" in [`cache.rs`](https://github.com/sigp/lighthouse/blob/stable/consensus/cached_tree_hash/src/cache.rs). From this point onward I am going to assume that the Rust code in `cache.rs` and `cache_arena.rs` is not at fault. Rather, I am going to assume that there is memory modification going on outside of the guarantees provided the Rust compiler. ## Corruption I can think of the following places where the corruption might happen: 1. **Hardware**: this seems unlikely to me. The pattern of corruption doesn't lend itself to stochastic bit-flipping due to faulty RAM. 2. **The Rust compiler for Windows**: There might be a fault in the Rust compiler. I believe it would only be for Windows because this code has had a huge amount of testing on Linux. 3. **Some other unsafe code in Lighthouse (but only on Windows)**: it could be that one of our crypto libraries (BLST, hashing?) are "breaking" out and overriding bits of memory. Notably, none of the affected structs are being passed to an FFI, so whatever weird unsafe code we're running just happens to jump out and mess around with the 58th bit of `usize`s. However, it doesn't seem to flip bits in `Hash256` and corrupt hashes). My intuition sends me down path (2) first. It be that we're doing something weird that's triggering a bug in Windows Rust. ### Corruption due to the Rust Compiler The storage locations for the corrupt values for all three panics are similar in that they are both a `usize` inside a struct. However they're different in that one is in a `Vec` and the other isn't. TODO: more investigation. I don't have anything interesting to report yet 😕