`Onotole` high-level overview

# `Onotole` high-level overview [Onotole](https://github.com/ericsson49/research/tree/experimental5) is an experimental transpiler, whis is intended to translate [Ethereum Proof-of-Stake Consensus Specifications](https://github.com/ethereum/consensus-specs), expressed in a (statically-typed) subset of Python to other languages, like Kotlin, Java, Dafny, etc. **NB** I am using "beacon chain specs" also to refer to the Ethereum Proof-of-Stake Consensus Specifications, for historical reasons. In a wider/longer perspective, the goal is to attach formal semantics to the subset of Python, so that there is a solid formal foundation for the Eth PoS Consensus specifications. Basically, it tries to alleviate the fact that Python doesn't have a formal specification and there is lack of formal method tools (though, there exist some, like [MyPy](http://mypy-lang.org/) or [Nagini](https://github.com/marcoeilers/nagini)). This is a short overview, there is a longer (but largely outdated) description: [part 1](https://hackmd.io/REqkGV21T8Kf86Rh_FFN_g), [part 2](https://hackmd.io/mq7a6mOPTYacVux7fssL-Q) and [part 3](https://hackmd.io/29oG2wvORLmUekkPnhOjpw). # Transformation phases ## python parsing Python has a built-in ability to parse python code, resulting in [PyAST](https://docs.python.org/3/library/ast.html). Currently, Onotole relies on the functionality to parse original `md` files of the Eth PoS Consensus specs (together with the specs utility functions). The result is serialized to a text file. The approach is somewhat weird, however, it's employed to avoid calling Python from Kotlin code or implementing python parser in Kotlin/Java. In future, one can implement JVM native parser based on the Python's parser description ## beacon chain specs specific preprocessing ### `forOps` inline Onotole doesn't support locally defined functions - e.g. [`forOps`](https://github.com/ethereum/consensus-specs/blob/dev/specs/phase0/beacon-chain.md#operations). Currently, it's in-lined with a dedicated transformation. ## phase merge Entities employed by the specs (e.g. `AttestationData`, `BeaconState`, `BeaconBlockBody`, etc) are altered between phases - e.g. a filed has been added or its type has been modified. That leads to type checking problems. Therefore, a "smart" merge of phase definitions is performed, which re-defines previous phase definitions, affected by altered entities. One simple example is [`phase0.get_current_epoch(state: phase0.BeaconState)`](https://github.com/ethereum/consensus-specs/blob/dev/specs/phase0/beacon-chain.md#get_current_epoch): if `BeaconState` is altered in a phase (like `altair`), the the "smart" merger introduces a new version `altair.get_current_slot(state: altair.BeaconState)`. See more details [here](https://ethresear.ch/t/how-to-combine-beacon-chain-phase-definitions/9953). # simplifications A number of simplifications (de-sugaring) is applied to the AST tree: - operations (`+`, `/`, etc) are replaced with calls - `for` loops de-sugared to `while` loops - complex assignments, like `a, b = ...` are destructed to simpler ones - sub-expressions are introduced as temporary variables One specific simplification is transforming `for (i,v) in enumerate(...)` to a simpler form, which doesn't use `enumerate` explicitly - this is done to simplify alias analysis. # filling holes There are "ambiguities" in Python AST, which should be resolved in order to get a "formal" representation of the specs: - variable declarations are missing, since there are only assignment statements, which may or may not introduce a new variable - a name can mean different things in different contexts, for example, `bit` refers to the `ssz.bit` class, in general, however, occasionally, it's a name of a local variable. - local variable types and constant types are not typically specified and thus, should be inferred - there are implicit coercions (e.g. `int` are converted to sub-types, like `Epoch`, `Slot`, etc). This will often break type checking. - type variables are sometimes missing, e.g. functions like `map`, `soreted`, etc have type parameters, which should be inferred - same name can be used bey several local variables, their type can differ (e.g. when in-lining the `forOps` function) ## conversion to SSA AST is converted to [SSA](https://en.wikipedia.org/wiki/Static_single_assignment_form) form first. It helps to resolve two problems: - different by their virtue variables, but sharing the same name become named differently - [dominator](https://en.wikipedia.org/wiki/Dominator_(graph_theory)) tree can be used to infer places where a variable is implicitly declared **NB** AST is a high-level representation (e.g. has `while` loops in an explicit form), however, while SSA assumes a [CFG](https://en.wikipedia.org/wiki/Control-flow_graph) representation. Currently, Onotole depends on the high level representation (in order to emit more or less readable code). For the reason, the SSA form is destructed, e.g. `Phi` nodes are translated to copy instructions. ## type inference and type checking Renaming different variables, sharing the same name is important for type inference and type checking. Type inference is a somewhat complicated phase and there are a couple of implementations. A more or less solid is based on translating AST to a set of typing constraints, like `?Ta <: int` or `?Tb <: Seq[?N]`, where names like `?Ta` or `?N` are placeholders (corresponding to local or type variables), which should be filled by the type inference algorithm. The constrain solver is based on an iterative fix-point solver over a sub-typing lattice. A specialized "decision" procedure is used to fill unknown variables or find contradictions, based on unification, strongly connected components, join/meet operations. Approximate sub-phases: - instantiate type variables (polymorphic functions and classes like `map` and `List`, list/dict/set literals, lambda functions) - generate typing constraints (an equality constraint, a sub-typing constraint or a "functional" constraint, associated with operations) - infer types - type checking - inline inferred type information into AST The exact algorithm is Work In Progress, so details can change. ## optional transformations There are beacon chain specific transformations, which are intended to transform the specs to "pure" functional form: - all updates are non-destructive ones, e.g. returning a new version of an updated entity - loops are converted to recursion - exceptions are converted to an explicit outcome propagation, e.g. a `union` type which represents a normal or an exceptional outcome There are three corresponding sub-phases (WIP). The phases are based on some static analysis, like aliasing and purity analyses (see some details [here](https://ethresear.ch/t/mutable-forest-memory-model-for-blockchain-specs/10882)). ## code generation After all the "ambiguities" are resolved, the AST can be translated to a target language - Kotlin or Java, currently. Translation to Dafny is more tricky, since Dafny doesn't support exceptions, however, I'm working on resolving it: one of the core problem has been the type inference, which is important to precisely identify locations, where an exception may be thrown. **NB** There is a sub-phase to translate function calls, which involves keyword arguments (which are supported by Python and employed by the beacon chain specs) to a no-keyword form, which is required for languages like Rust, Java or Dafny - while it's not required for Kotlin. # Plans ## formal semantics Once the PyAST is annotated with full type information and transformed to the "pure" functional form, it can be further simplified to a (very) simple language, e.g. a variant of a typed lambda calculus. Which in turn can be equipped with formal semantics, e.g. based on the [CEK machine](https://en.wikipedia.org/wiki/CEK_Machine) or similar. Translation of the annotated PyAST to Dafny or another verification-friendly language (e.g. Coq) is the other way to go. ## backward code slicer Another plan is to implement a backward [code slicer](https://en.wikipedia.org/wiki/Program_slicing), aiming to deal with exceptions or to (hopefully) modularize the beacon chain specs. ## test generation The fully annotated "pure" form PyAST can be used to generate tests, which can be employed to validate conformance of the beacon chain implementations to the specs. In particular, a uniform-like SAT (or SMT) sampler like [CMSGen](https://github.com/meelgroup/cmsgen) can be used to sample test cases.