slides: https://hackmd.io/@ericsson49/pyspec-formal-engineering
My background
- S/W engineer, mostly Financial S/W
- Familiar with Formal Methods for a long time
- Mostly around Coq and SMT/SAT solvers
- didn't have a (good) chance to apply as S/W Engineer
- Joined Eth2 research recently
My work @TXRX
My work @TXRX/PegaSys/ConsenSys:
- clock sync, DeFi, oracles
- fork choice tests + a bit of bug fixing
- pyspec transpilation to Kotlin
- pyspec formal engineering
But concentrate on the latter here
- Want high quality financial S/W
- Even more important for blockchains
- Lack of control after deployment
==>
Use Formal Methods
- Formal Methods are a mostly academic topic
- (Very) difficult to employ in a typical industrial setting
- ConsenSys, in fact, the first real opportunity for me
==>
Concentrate on Engineering aspects
- Concentrate on beacon chain pyspec, currently
- But should be applicable for Distributed Protocols, in general
- E.g. network layer (libp2p), other blockchains, oracles
Key problem
Expressive language means it's more difficult to reason about.
==>
S/W engineers and FM practioners chase different goals, as a consequence.
Key problem
Beacon chain perspective
- Formal Methods importance is recognized
- But the spec development process is centered around traditional S/W engineering
==>
FM practioners needs are largely ignored.
My research tries to answer "What can be done about it?"
Beacon chain problems (FM perspective)
- Python is a bad choice from FM perspective
- lack of tools, semantics, etc
- pyspec is not a spec
- not only from FM perspective
- pyspec is a moving target
- manual implementations in different languages
Opportunities
Good stuff
- EF team is open to suggestions (but won't blindly accept them)
- FM importance is generaly recognized
- Can introduce simplifying assumptions
- Transpilation is not crazy difficult
My approach
- "Engineer" the spec
- Meta-programming
- Tailor FM to match problem domain
- Light(er) weight formal methods
Engineer the spec
- A spec is arguably the most important artifact
- Especially true for the beacon chain case
- "Engineer" the pyspec to make it suitable for applying FMs (and reach other goals)
- most part of the spec is executable - pyspec
- Define requirements for the spec
- what does "good spec" mean?
Tailor FM to problem domain
- Constrain expressivity but gain simpler formal reasoning
- Python subset employed by the pyspec is already restricted
- type annotations, limited OO features, limited set of types, no concurrency
- Simplified memory model
- mostly about restricted aliasing
- Transpile pyspec to a "better" form
- pure functional - for formal needs
- other langs - for more practical stuff
- Executable semantics for a subset of Python
- Pehaps, other execution models:
- Extensions of Datalog
- Differential dataflow
Light(er)-weight FMs
- Start from light-weight formal methods (static type checking)
- easier to introduce
- less annotations (easy to adapt to changes)
- still useful
- Continue with middle-weight stuff
- extended static checking
- bounded model checking
- concolic execution
- But aim at full-blown verification in a long term
Current state
Transpilation
- Semi-automatic translation to Kotlin
- Regular pyspec updates (phse0/phase1)
- Used as a base for Mikhail Kalinin's work on Eth1/Eth2 simulation
- Several pyspec bugs found+fixed
- Working on dataflow analyses
- to support translations
- starting point for semantic work
- Working on a translation to Rust
Research stuff
Mostly ready, but not published yet
Started to work on "semantics"
- typing rules
- memory model
Plans
Near term plans
- Complete dataflow analyses
- Fully automated translation to Kotlin
- Complete translation to Rust
- Implement executable semantics
- typing rules
- memory model (based on Rust's one)
- Spec purification
- Pure functional/logic language target
- Static analyses
- Switch to datalog engines (Souffle, Formulog)
Longer term plans
- Complete Formal Engineering series of posts
- Trusted Assumption Base concept
- Middle-weight Formal Methods
Semantics Highlits
Executable "semantics"
"Semantics" - a model which allows to tranform the python subset (employed by pyspec) to a "better" form.
- pure functional (Coq, Stainless/Scala, WhyML)
- other formal frameworks
- mainstream langs (Kotlin, Rust, Scala, etc)
Components
- start from annotated Python AST
- supported features
- infer variable declarations
- infer types
- resolve ambiguities
- type libraries (BLS, SSZ, pylib, etc)
- typing rules
- memory model
Memory Model
Some assumptions
- Actor model
- Protocol is built around a persistent data structure (e.g. blockchain)
- Ignore concurrency (inside an actor)
- but should be possible in a safe way (e.g. Rust-style)
==>
- Pure functions mostly
- But mutation is important for state updates
- Reasoning about mutation can be way more difficult
- Pure code can be way more slow
Want the best of both ways, under reasonable restrictions
Aliasing
Aliasing is the core problem. Several ways to solve:
- detect 'no aliasing' at runtime reference - counting + static analysis (Lean, Whiley) -
- avoid aliasing - linear/uniqueness types (ATS, Idris, Mercury, Clean, SAC, etc)
- avoid "dangerous" aliasing - e.g. Rust's ownership type system with borrowing
"Dangerous" aliasing
- Two or more references pointing to the same location
- At least one of them is mutable
In a concurrent program, access to the location should be synchronized to avoid a data race.
Rust ownership type + borrow checker prevents such dangerous aliasing. There's either:
- one mutable reference
- or multiple immutable references
Reasoning and aliasing
Formal reasoning about detructive updates is notoriously difficult for very similar reasons.
- Formal properties can be seen as immutable references
- Aliasing means more info about mutations to be propgated
Memory model
Typically, some simplifying assumptions are made. Often it's about restricting aliasing: the less aliasing, the less info to propagate.
The Rust memory model is very attractive in this respect:
- multiple immutable references constitute no problem, as there is no update possible
- any mutable reference is exclusive, i.e. there are no other reference to the location.
- the approach works well in practice
Fast and Purious
My hypothesess are:
- the Rust's memory model fits nicely the beacon chain pyspec
- the same is true about Distributed Protocol specs, in general
- 'Fast and Purious', both goals are achievable:
- fast execution (destructive updates)
- translation to a pure code w/o blow-up (mutable references are exclusive)
Q&A