(Py)Spec Formal Engineering

slides: https://hackmd.io/@ericsson49/pyspec-formal-engineering

My background

S/W engineer, mostly Financial S/W
Familiar with Formal Methods for a long time
- Mostly around Coq and SMT/SAT solvers
- didn't have a (good) chance to apply as S/W Engineer
Joined Eth2 research recently
- have a great chance now

My work @TXRX

My work @TXRX/PegaSys/ConsenSys:

clock sync, DeFi, oracles
fork choice tests + a bit of bug fixing
pyspec transpilation to Kotlin
pyspec formal engineering

But concentrate on the latter here

(Py)Spec Formal Engineering

Want high quality financial S/W
- Bugs can be costly
Even more important for blockchains
- Lack of control after deployment

==>

Use Formal Methods

(Py)Spec Formal Engineering

Formal Methods are a mostly academic topic
- But things are changing
(Very) difficult to employ in a typical industrial setting
- ConsenSys, in fact, the first real opportunity for me

==>

Concentrate on Engineering aspects

(Py)Spec Formal Engineering

Concentrate on beacon chain pyspec, currently
But should be applicable for Distributed Protocols, in general
- E.g. network layer (libp2p), other blockchains, oracles

Key problem

Expressive language means it's more difficult to reason about.

==>

S/W engineers and FM practioners chase different goals, as a consequence.

Key problem

Beacon chain perspective

Formal Methods importance is recognized
But the spec development process is centered around traditional S/W engineering

==>

FM practioners needs are largely ignored.

My research tries to answer "What can be done about it?"

Beacon chain problems (FM perspective)

Python is a bad choice from FM perspective
- lack of tools, semantics, etc
pyspec is not a spec
- not only from FM perspective
pyspec is a moving target
- syncing can be costly
manual implementations in different languages
- easy to introduce bugs

Opportunities

Good stuff

EF team is open to suggestions (but won't blindly accept them)
FM importance is generaly recognized
Can introduce simplifying assumptions
Transpilation is not crazy difficult

My approach

"Engineer" the spec
Meta-programming
Tailor FM to match problem domain
Light(er) weight formal methods

Engineer the spec

A spec is arguably the most important artifact
- Especially true for the beacon chain case
"Engineer" the pyspec to make it suitable for applying FMs (and reach other goals)
- most part of the spec is executable - pyspec
Define requirements for the spec
- what does "good spec" mean?

Tailor FM to problem domain

Constrain expressivity but gain simpler formal reasoning
Python subset employed by the pyspec is already restricted
- type annotations, limited OO features, limited set of types, no concurrency
Simplified memory model
- mostly about restricted aliasing

Meta-programming

Transpile pyspec to a "better" form
- pure functional - for formal needs
- other langs - for more practical stuff
Executable semantics for a subset of Python
Pehaps, other execution models:
- Extensions of Datalog
- Differential dataflow

Light(er)-weight FMs

Start from light-weight formal methods (static type checking)
- easier to introduce
- less annotations (easy to adapt to changes)
- still useful
Continue with middle-weight stuff
- extended static checking
- bounded model checking
- concolic execution
But aim at full-blown verification in a long term

Current state

Transpilation

Semi-automatic translation to Kotlin
- Regular pyspec updates (phse0/phase1)
- Used as a base for Mikhail Kalinin's work on Eth1/Eth2 simulation
- Several pyspec bugs found+fixed
Working on dataflow analyses
- to support translations
- starting point for semantic work
Working on a translation to Rust

Research stuff

Mostly ready, but not published yet

Started to work on "semantics"

typing rules
memory model

Plans

Near term plans

Complete dataflow analyses
Fully automated translation to Kotlin
Complete translation to Rust
Implement executable semantics
- typing rules
- memory model (based on Rust's one)
Spec purification
- Pure functional/logic language target
Static analyses
- Switch to datalog engines (Souffle, Formulog)

Longer term plans

Complete Formal Engineering series of posts
Trusted Assumption Base concept
Middle-weight Formal Methods

Semantics Highlits

Executable "semantics"

"Semantics" - a model which allows to tranform the python subset (employed by pyspec) to a "better" form.

pure functional (Coq, Stainless/Scala, WhyML)
other formal frameworks
mainstream langs (Kotlin, Rust, Scala, etc)

Components

start from annotated Python AST
supported features
infer variable declarations
infer types
resolve ambiguities
type libraries (BLS, SSZ, pylib, etc)
typing rules
memory model

Memory Model

Some assumptions

Actor model
Protocol is built around a persistent data structure (e.g. blockchain)
Ignore concurrency (inside an actor)
- but should be possible in a safe way (e.g. Rust-style)

==>

Pure functions mostly
But mutation is important for state updates

Formal Engineering perspective

Reasoning about mutation can be way more difficult
Pure code can be way more slow

Want the best of both ways, under reasonable restrictions

Aliasing

Aliasing is the core problem. Several ways to solve:

detect 'no aliasing' at runtime reference - counting + static analysis (Lean, Whiley) -
avoid aliasing - linear/uniqueness types (ATS, Idris, Mercury, Clean, SAC, etc)
avoid "dangerous" aliasing - e.g. Rust's ownership type system with borrowing

"Dangerous" aliasing

Two or more references pointing to the same location
At least one of them is mutable

In a concurrent program, access to the location should be synchronized to avoid a data race.

Rust ownership type + borrow checker prevents such dangerous aliasing. There's either:

one mutable reference
or multiple immutable references

Reasoning and aliasing

Formal reasoning about detructive updates is notoriously difficult for very similar reasons.

Formal properties can be seen as immutable references
Aliasing means more info about mutations to be propgated

Memory model

Typically, some simplifying assumptions are made. Often it's about restricting aliasing: the less aliasing, the less info to propagate.

The Rust memory model is very attractive in this respect:

multiple immutable references constitute no problem, as there is no update possible
any mutable reference is exclusive, i.e. there are no other reference to the location.
the approach works well in practice

Fast and Purious

My hypothesess are:

the Rust's memory model fits nicely the beacon chain pyspec
the same is true about Distributed Protocol specs, in general
'Fast and Purious', both goals are achievable:
fast execution (destructive updates)
translation to a pure code w/o blow-up (mutable references are exclusive)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

(Py)Spec Formal Engineering

My background

My work @TXRX

(Py)Spec Formal Engineering

(Py)Spec Formal Engineering

(Py)Spec Formal Engineering

Key problem

Key problem

Beacon chain problems (FM perspective)

Opportunities

My approach

Engineer the spec

Tailor FM to problem domain

Meta-programming

Light(er)-weight FMs

Current state

Transpilation

Research stuff

Plans

Near term plans

Longer term plans

Semantics Highlits

Executable "semantics"

Components

Memory Model

Some assumptions

Formal Engineering perspective

Aliasing

"Dangerous" aliasing

Reasoning and aliasing

Memory model

Fast and Purious

Q&A