(Py)Spec Formal Engineering

slides: https://hackmd.io/@ericsson49/pyspec-formal-engineering


My background

  • S/W engineer, mostly Financial S/W
  • Familiar with Formal Methods for a long time
    • Mostly around Coq and SMT/SAT solvers
    • didn't have a (good) chance to apply as S/W Engineer
  • Joined Eth2 research recently
    • have a great chance now

My work @TXRX

My work @TXRX/PegaSys/ConsenSys:

  • clock sync, DeFi, oracles
  • fork choice tests + a bit of bug fixing
  • pyspec transpilation to Kotlin
  • pyspec formal engineering

But concentrate on the latter here


(Py)Spec Formal Engineering

  • Want high quality financial S/W
    • Bugs can be costly
  • Even more important for blockchains
    • Lack of control after deployment

==>

Use Formal Methods


(Py)Spec Formal Engineering

  • Formal Methods are a mostly academic topic
    • But things are changing
  • (Very) difficult to employ in a typical industrial setting
    • ConsenSys, in fact, the first real opportunity for me

==>

Concentrate on Engineering aspects


(Py)Spec Formal Engineering

  • Concentrate on beacon chain pyspec, currently
  • But should be applicable for Distributed Protocols, in general
    • E.g. network layer (libp2p), other blockchains, oracles

Key problem

Expressive language means it's more difficult to reason about.

==>

S/W engineers and FM practioners chase different goals, as a consequence.


Key problem

Beacon chain perspective

  • Formal Methods importance is recognized
  • But the spec development process is centered around traditional S/W engineering

==>

FM practioners needs are largely ignored.

My research tries to answer "What can be done about it?"


Beacon chain problems (FM perspective)

  • Python is a bad choice from FM perspective
    • lack of tools, semantics, etc
  • pyspec is not a spec
    • not only from FM perspective
  • pyspec is a moving target
    • syncing can be costly
  • manual implementations in different languages
    • easy to introduce bugs

Opportunities

Good stuff

  • EF team is open to suggestions (but won't blindly accept them)
  • FM importance is generaly recognized
  • Can introduce simplifying assumptions
  • Transpilation is not crazy difficult

My approach

  • "Engineer" the spec
  • Meta-programming
  • Tailor FM to match problem domain
  • Light(er) weight formal methods

Engineer the spec

  • A spec is arguably the most important artifact
    • Especially true for the beacon chain case
  • "Engineer" the pyspec to make it suitable for applying FMs (and reach other goals)
    • most part of the spec is executable - pyspec
  • Define requirements for the spec
    • what does "good spec" mean?

Tailor FM to problem domain

  • Constrain expressivity but gain simpler formal reasoning
  • Python subset employed by the pyspec is already restricted
    • type annotations, limited OO features, limited set of types, no concurrency
  • Simplified memory model
    • mostly about restricted aliasing

Meta-programming

  • Transpile pyspec to a "better" form
    • pure functional - for formal needs
    • other langs - for more practical stuff
  • Executable semantics for a subset of Python
  • Pehaps, other execution models:
    • Extensions of Datalog
    • Differential dataflow

Light(er)-weight FMs

  • Start from light-weight formal methods (static type checking)
    • easier to introduce
    • less annotations (easy to adapt to changes)
    • still useful
  • Continue with middle-weight stuff
    • extended static checking
    • bounded model checking
    • concolic execution
  • But aim at full-blown verification in a long term

Current state


Transpilation

  • Semi-automatic translation to Kotlin
    • Regular pyspec updates (phse0/phase1)
    • Used as a base for Mikhail Kalinin's work on Eth1/Eth2 simulation
    • Several pyspec bugs found+fixed
  • Working on dataflow analyses
    • to support translations
    • starting point for semantic work
  • Working on a translation to Rust

Research stuff

Mostly ready, but not published yet

Started to work on "semantics"

  • typing rules
  • memory model

Plans


Near term plans

  • Complete dataflow analyses
  • Fully automated translation to Kotlin
  • Complete translation to Rust
  • Implement executable semantics
    • typing rules
    • memory model (based on Rust's one)
  • Spec purification
    • Pure functional/logic language target
  • Static analyses
    • Switch to datalog engines (Souffle, Formulog)

Longer term plans

  • Complete Formal Engineering series of posts
  • Trusted Assumption Base concept
  • Middle-weight Formal Methods

Semantics Highlits


Executable "semantics"

"Semantics" - a model which allows to tranform the python subset (employed by pyspec) to a "better" form.

  • pure functional (Coq, Stainless/Scala, WhyML)
  • other formal frameworks
  • mainstream langs (Kotlin, Rust, Scala, etc)

Components

  • start from annotated Python AST
  • supported features
  • infer variable declarations
  • infer types
  • resolve ambiguities
  • type libraries (BLS, SSZ, pylib, etc)
  • typing rules
  • memory model

Memory Model


Some assumptions

  • Actor model
  • Protocol is built around a persistent data structure (e.g. blockchain)
  • Ignore concurrency (inside an actor)
    • but should be possible in a safe way (e.g. Rust-style)

==>

  • Pure functions mostly
  • But mutation is important for state updates

Formal Engineering perspective

  • Reasoning about mutation can be way more difficult
  • Pure code can be way more slow

Want the best of both ways, under reasonable restrictions


Aliasing

Aliasing is the core problem. Several ways to solve:

  • detect 'no aliasing' at runtime reference - counting + static analysis (Lean, Whiley) -
  • avoid aliasing - linear/uniqueness types (ATS, Idris, Mercury, Clean, SAC, etc)
  • avoid "dangerous" aliasing - e.g. Rust's ownership type system with borrowing

"Dangerous" aliasing

  • Two or more references pointing to the same location
  • At least one of them is mutable

In a concurrent program, access to the location should be synchronized to avoid a data race.

Rust ownership type + borrow checker prevents such dangerous aliasing. There's either:

  • one mutable reference
  • or multiple immutable references

Reasoning and aliasing

Formal reasoning about detructive updates is notoriously difficult for very similar reasons.

  • Formal properties can be seen as immutable references
  • Aliasing means more info about mutations to be propgated

Memory model

Typically, some simplifying assumptions are made. Often it's about restricting aliasing: the less aliasing, the less info to propagate.

The Rust memory model is very attractive in this respect:

  • multiple immutable references constitute no problem, as there is no update possible
  • any mutable reference is exclusive, i.e. there are no other reference to the location.
  • the approach works well in practice

Fast and Purious

My hypothesess are:

  • the Rust's memory model fits nicely the beacon chain pyspec
  • the same is true about Distributed Protocol specs, in general
  • 'Fast and Purious', both goals are achievable:
  • fast execution (destructive updates)
  • translation to a pure code w/o blow-up (mutable references are exclusive)

Q&A


Select a repo