This page documents investigation into panics reported by "SnowyHitch" on the Lighthouse Discord.
There have been three panics:
cache_arena.rs:184:12
(Investigation)cache.rs:153:14
(Discord)cache.rs:159:14
(Discord)Note: I used an image-to-text thing to scrape these panic messages. They're lil' wonky with the paths.
panicked at 'range end index 144115188075905177 out of range for slice of length 61440", D: \a\lighthouse\lighthouse\consensus \cached_tree_hash\src\cache_arena.r5:184:12
panicked at 'cached tree should have a root layer: UnknownAllocId (144115188075856164)', consensus \cached tree hash\src\cache.rs:153:14
panicked at index out of bounds: the len is 4 but the index is 144115188075855875, consensus\cached_tree_hash\sc\cache.rs:159:14
All of these panics are out-of-bounds list access. Interestingly, the accessing indices are similar-looking and insanely large (insane because a list of this size would be on the order of petabytes):
As pointed out by "Diophantus" on Discord, all these values have the 58th bit set (eg bit[57] == 1
where 0
is the 1st index). However, if we un-set that 58th bit they would be relatively small/sane values:
>>> bin(144115188075905177)
'0b1000000000000000000000000000000000000000001100000010011001'
>>> bin(144115188075856164)
'0b1000000000000000000000000000000000000000000000000100100100'
>>> bin(144115188075855875)
'0b1000000000000000000000000000000000000000000000000000000011'
This leads me to believe that something is setting the 58th bit of these indices. Let's look at similarities between the storage locations of the panicky indices.
Panics 1 and 2 both use values from CacheArena.offsets:
#[derive(Debug, PartialEq, Clone, Default, Encode, Decode)]
pub struct CacheArena<T: Encode + Decode> {
/// The backing array, storing cached values.
backing: Vec<T>,
/// A list of offsets indicating the start of each allocation.
offsets: Vec<usize>,
}
Panic 3 uses a value from TreeHashCache.depth
:
#[derive(Debug, PartialEq, Clone, Default, Encode, Decode)]
pub struct TreeHashCache {
pub initialized: bool,
/// Depth is such that the tree has a capacity for 2^depth leaves
depth: usize,
/// Sparse layers.
///
/// The leaves are contained in `self.layers[self.depth]`, and each other layer `i`
/// contains the parents of the nodes in layer `i + 1`.
layers: SmallVec8<CacheArenaAllocation>,
}
Panic 3 is rather illuminating since I think it clearly shows us that there is memory corruption of some sort. In my previous investigation I make a somewhat complicated claim that there must be corruption causing panic 1 (and panic 2, but I hadn't seen it yet). Whilst I stand by that claim, I believe panic 3 is a clear-cut demonstration of corruption. As long as these points stand, I have proven memory corruption with panic 3:
TreeHashCache.depth
is set once and never modified.TreeHashCache.depth
is set, a SmallVec8<CacheArenaAllocation>
is allocated of that length.SmallVec8<CacheArenaAllocation>
of length 144115188075855875
without OOMing.Rust's privacy (depth
is not pub
) and mutability syntax should make it trivial to infer my first point by looking at usages of "depth" in cache.rs
.
From this point onward I am going to assume that the Rust code in cache.rs
and cache_arena.rs
is not at fault. Rather, I am going to assume that there is memory modification going on outside of the guarantees provided the Rust compiler.
I can think of the following places where the corruption might happen:
usize
s. However, it doesn't seem to flip bits in Hash256
and corrupt hashes).My intuition sends me down path (2) first. It be that we're doing something weird that's triggering a bug in Windows Rust.
The storage locations for the corrupt values for all three panics are similar in that they are both a usize
inside a struct. However they're different in that one is in a Vec
and the other isn't.
TODO: more investigation. I don't have anything interesting to report yet 😕