Rust Consensus-safety heuristics

# Rust Consensus-safety heuristics **Disclaimer**: There are others working on Zcash and other consensus systems with more specialized expertise than me in both rust and consensus software engineering, so take all of this with a grain of salt. Think of it more as a "starting proposal". It's also by no means very comprehensive. ## The Goal The goal for this document (or hopefully a better reviewed and refined descendent) is to provide a Rust style/pattern guide aimed specifically at consensus safety. ## Definitions **Consensus Safety**: The software behaves identically across all platforms, given the same inputs and any reordering of input events allowed by consensus. A problem with this definition is that it's a bit circular about event ordering, which depends on consensus. :-/ Let's try a narrower definition that's easier to achieve: **Functional Consensus Safety** (aka **FCS**): Individual functional components of the software behave identically on all systems when presented with a starting state and a specific order of input events. We model the full software composes these critical FCS components with non-consensus critical glue. Drilling into the "functional components" to be more precise with the definition, we idealize a specific "functional consensus component" with this pseudocode interface: ``` fn my_consensus_function(state, input_event) -> result ``` -we say the compopnent, `my_consensus_function` provides FCS _iff_ the `result` is the same on all architectures given the same `(state, input_event)` bits. This implies the function _must not admit any side-effects_. This definition relies heavily on the notion of "non-consensus critical glue code" and composition. This is a big drawback, but we believe if we can at least isolate some consensus components, we can analyze their safety individually to rule out certain failure modes. We cannot, however, use this approach to reason about consensus safety for the full composition. **TODO:** Can we do better than this approach? ## Rules The idea of this heuristics guide is that if all code within the scope of a "functional consensus component" follows the rules, we exclude known categories of potential FCS failures. These rules do not guarantee the code provides FCS, but it helps us exclude certain failure cases to simplify our analysis of the component's consensus safety. For each rule, we attempt to argue that following the rule is necessary but not sufficient to achieve _functional consensus safety_. ### Rule 1. avoid `as` operator The `as` operator in Rust allows runtime casts between primitive types. The behavior is platform specific, and therefore violates FCS guarantees. By avoiding this operator, we can exclude the category of FCS failures due to variations across platforms in how runtime primtive casts work. #### Alteratives ##### Static Assertions A static assertion about the relationship between two types can ensure that a subset of platforms fail to type-check. This supports FCS by simply and explicitly excluding certain platforms. **TODO:** - How do we do static assertions like this? Give an example. - How do we ensure any runtime casts we rely on are covered by static assertions? What if we use a runtime case but miss covering it with an assertion? - **TODO:** It feels like this is a fruitful area for tooling support like a crate that provides an `safe_as` macro that inserts static assertions automagically. ##### `TryFrom` Replace `x as T` with `T::try_from(x).unwrap()`. This causes different platforms to behave differently, therefore it does not support FCS directly! However, it forces a subset of platforms to panic, and the complementary set to continue execution. We argue this is safer than `as` where the two (or more) subsets continue executing with different behavior. Downsides: - more runtime cost. (How true / how much is this? It seems feasible to me that smart compilers can minimize runtime cost.) - more tedius to write. - are there cases where `impl TryFrom<K> for T` doesn't exist yet `as` does exist? If so, this alternative isn't possible. ### Rule 2. Avoid Non-Deterministic Data Structures In rust, some standard datastructures such as `HashSet` and `HashMap` are nondeterministic: they are seeded with entropy such that they _behave differently on every run over the same inputs_! This violates FCS in two subtle ways: - First, using these _relies on side effects_ (namely seeding the initial entropy). - Second, it's tricky to understand when code using these data structures depends on the non-determinism. For example, iterating over these structures probably produces different iteration order on every run _even with identical inputs_. **Importantly**, this design choice by rust `std` is well motivated: for a deterministic hash-based data structure, maliciously chosen inputs can trigger worst-case pathological performance. So using a deterministic hashing data structure can often expose a denial-of-service vulnerability. #### Alternatives ##### Use Deterministic Datastructures which are DoS-resistant In many cases the nondeterministic datastructures may be swapped for deterministic data structures, for example replacing `HashSet` with `BTreeSet`. However, this doesn't always work conveniently. Hash sets require hashable items, whereas `BTree`-style containers require orderable items. This mismatch may make this alternative messier than at first glance. **TODO:** - Analyze how `BTree` style data structures hold up against pathologically worst-case inputs. - Find alternative data structure crates that meet our needs? #### Rule 2a. Avoid "Deterministic-Approximation Wrapper" Strategies It may be tempting (and common practice!) to try to "encapsulate" the non-determinism in an outer layer to provide a black box that provides an apparent deterministic interface. For example, a wrapper type over `HashSet` that ensures iteration has consistent order by sorting the elements. This strategy seems very likely to be error prone in at least two ways: - First, how can we be sure all non-deterministic behavior is correctly "covered" by the wrapper? It seems likely there will be unanticipated edge cases in which nondeterminism leaks through. - Second, suppose we have high confidence the wrapper covers all of the bahavior our code needs, but _then_ the code is updated (perhaps by new developers with no connection to older developers). While the first case is hard enough for a single codebase in time, the problem becomes even harder for codebases evolving through time. Maybe a simpler way to describe the problem with this approach: the goal of the rules in this doc is to easily exclude categories of FCS failures, but this approach doesn't make it easy to exclude FCS failures! ### Rule 3. Avoid Non-Deterministic Composition within Functional Consensus Components It is common in rust to rely on non-deterministic composition of code, especially for performance improvements via concurrency, ie threads or asynchronous tasks. While these can be used "safely" and may provide performance or code architecture benefits, they are non-determinstic and thus violate FCS! The cop-out we can use here is to define the composition of threads or tasks as part of the "glue" outside of FCS components. This approach would mean a "functional consensus component" could be contained within a concurrency primitive, but concurrency primitives could not be contained within a single functional consensus component. **TODO:** This against suggests the hand-waving about "non-consensus-critical composition glue" isn't useful enough to get us what we really want. But maybe it is? What if consensus rules were defined such that they can always be implemented as "functional consensus components"? In other words, the consensus definition would map more closely to code architecture in order to improve safety guarantees?