# Measuring Michelson execution and typechecking
Period: <2020-09-30> -- <2020-10-09>
## Methodology and tools
We need a tool to benchmark the execution and the typechecking
of user-written contracts under realistic conditions.
To that end, we extended `tezos-snoop` (which was initially designed to
build the Michelson cost model) to handle a new form of benchmarks defined
by a corpus of scenarios. This extension is submitted for inclusion here:
https://gitlab.com/metastatedev/tezos/-/merge_requests/320
A scenario is composed of a scene (a sequence of instructions to set up an evaluation context)
and a serie of contract calls (that is a contract name, a storage, and a input). A contract
call can omit the storage or/and the input. In that case, we use the random
samplers of `tezos-snoop` to fill the holes.
When provided with such a corpus, `tezos-snoop` executes the scenarios
and produces a trace for each of scenario.
If a trace already existed, it checks that the outputs have not changed.
`tezos-snoop` also outputs the wall-clock time took by typechecking and execution:
```
typechecker : { max_time = 420834.000000 ; min_time = 93.000000 ; mean_time = 1556.214549 ; sigma = 3530.845233 }
interpreter : { max_time = 200876.000000 ; min_time = 8206.000000 ; mean_time = 9970.549261 ; sigma = 2190.420441 }
```
To improve the precision of the evaluation, the measures are done multiple times.
`tezos-snoop` removes outliers to compute the mean and the standard deviation.
As shown by the standard deviation, there is some room for improvement here.
Even with `cpufreq-set -g performance`, a laptop produces too much noise.
Using a dedicated benchmarking machine is probably the right thing to try next.
To get a more precise account of how typechecking and interpretation spent CPU
time, we use `linux-perf`, a performance analyzing tool for linux and `hotspot`
a gui to visualise `linux-perf` data. Since `linux-perf` only provides a statistical
analysis of the number of cycles consumed by the code, it is important to amplify
the time taken by typechecking and interpretation. Fortunately, since we measure
typechecking and interpretation multiple times, their execution time dominate
the cost of the benchmarking infrastructure. Hence, `linux-perf` is perfectly
usable to get useful profiling information.
The benchmark can be found here:
https://gitlab.com/nomadic-labs/michelson-benchmarks
_Next actions_:
- Setup a dedicated benchmarking machine.
- Populate the corpus with contracts from carthagenet.
- Experiments with OCaml's [new memtrace profiler](https://blog.janestreet.com/finding-memory-leaks-with-memtrace/)
## Results of a first experiment
What are the most expensive aspects of Michelson execution and typechecking?
To get a rough idea of some implementation choices that cost CPU time,
we took the following contract that computes the factorial of an integer:
```
parameter int ;
storage int ;
code { CAR; # Ignore the initial store, stack = [n]
PUSH @acc int 1; # We will accumulate the result on top, stack = [1; n]
SWAP; # Put n on top, stack = [n; accu]
PUSH bool True; # It is a do-while loop.
LOOP { DUP; # stack = [n; n; accu]
PUSH int 1; # stack = [1; n; n; accu]
SWAP; # stack = [n; 1; n; accu]
SUB; # stack = [n - 1; n; accu]
SWAP; # stack = [n; n - 1; accu]
DIP 1 { SWAP }; # stack = [n; accu; n - 1]
MUL; # stack = [n * accu; n - 1]
SWAP; # stack = [n - 1; n * accu]
DUP; # stack = [n - 1; n - 1; n * accu]
PUSH int 0; # stack = [0; n - 1; n - 1; n * accu]
COMPARE; # stack = [cmp 0 (n - 1); n - 1; n * accu]
NEQ }; # stack = [n - 1 <> 0; n - 1; n * accu]
SWAP; # stack = [fact n; 0]
DIP 1 { DROP }; # stack = [fact n]
NIL operation ; # stack = [ []; fact n ]
PAIR } # stack = [ ([], fact n) ]
```
and did some measurements of its typechecking and execution when computing `fact 100`.
To get a point of reference, we also implemented this function in OCaml:
```ocaml
let fact n =
let accu = ref Z.one in
let n = ref (Z.of_int n) in
while !n > 0 do
accu := Z.mul !accu !n;
n := Z.sub !n Z.one
done;
!accu
```
On an Intel(R) Core(TM) i7-10510U CPU 4.7Ghz,
when applied to `100`, this OCaml function runs
in ~3μs when compiled in native code
and in ~7μs when compiled in byte code.
The current implementation interprets this Michelson contract in ~350μs and typechecks it in ~6μs.
### Decomposing the interpreter execution cost
To understand how these 350μs are consumed by the interpreter, we rewrite the interpreter
in successive steps. Each step is guided by an analysis of `linux-perf` output.
1. We remove gas computation: ~350μs -> ~65μs.
2. We remove Lwt.t: ~65μs -> ~40μs.
3. We remove the calls to the logger: ~40μs -> ~30μs.
4. We inline the most often used combinators : ~30μs -> ~25μs.
5. We rewrite interpretation of sequences of instruction: ~25μs -> ~20μs.
6. We use integer-specific comparison functions instead of generic comparison: ~20μs -> ~15μs.
7. We remove the error monad and the state monad: ~15μs -> ~12μs.
Note that these numbers are obtained using multiple runs. Yet, as micro-benchmarks
executed on a modern laptop and modern CPU, they only give an idea and cannot be taken
too seriously for many reasons, e.g., controlling CPU's clock frequency is a delicate matter.
Yet, it seems clear that gas computation should be optimized first as it represents ~4/5
of CPU time consumed by the interpretation.
The remaining cost can be related to the contract representation which is less
compact that the OCaml bytecode. Another cost probably comes from the representation
of the stack which is a linked list in the current implementation while it is an
array in the OCaml runtime. Further experiments are needed to confirm that.
The final `linux-perf` output is:

For information, the initial `linux-perf` output was:

One can retrieve the `linux-perf` traces [here](https://we.tl/t-Z1Ejb7TXX0).
(Be aware that this archive weights ~1.5GB when decompressed.)
### Decomposing the typechecker execution cost
We followed a similar path to understand where typechecking spends its time.
1. We remove gas computation: ~6μs -> ~4μs.
2. We remove Lwt.t: ~4μs -> ~3.5μs.
3. We remove parse_var_annot: ~3.5μs -> ~3μs.
4. We remove protection mechanisms to catch large types: ~3μs -> ~2.5μs
Other micro-optimisations could be done but this experiment is limited.
Indeed, we do not evaluate how fast type equality is because the only type here is `int`
and the stack types are of small depths (< 5). Since the efficiency of type equality has
probably an important impact in practice, further experiments are needed.
## Brainstorming: how to optimise gas consumption and checking?
Currently, the execution of `gas.consume` amounts to the execution of the following function:
```ocaml=
let raw_consume block_gas operation_gas cost =
match operation_gas with
| Unaccounted ->
ok (block_gas, Unaccounted)
| Limited {remaining} ->
let gas = cost_to_milligas cost in
if Arith.(gas > zero) then
let remaining = Arith.sub remaining gas in
let block_remaining = Arith.sub block_gas gas in
if Arith.(remaining < zero) then error Operation_quota_exceeded
else if Arith.(block_remaining < zero) then error Block_quota_exceeded
else ok (block_remaining, Limited {remaining})
else ok (block_gas, operation_gas)
```
`Arith` provides arbitrary-precision arithmetic operations.
_Ideas_:
- Use saturating arithmetic over machine integers.
- Decouple the updates of the cost counter and the check that the amount of remaining gas is positive.
- Reify cost update as an instruction and insert cost updates
before execution (at origination time?)
trying to regroup the cost of multiple instructions when possible.
- Draw some inspiration from [existing literature](http://www.iro.umontreal.ca/~feeley/papers/FeeleyFPCA93.pdf) about asynchronous interruption polling to check the gas regularly but not too often (depending on the latency we expect).
- Use a version of Zarith that computes the cost of operations.
- Simplify consume_cost
- Understand how much is spent between cost computations, cost updates, context updates, etc.
- Inline Seq in each node of the AST
## Other optimisations of the interpreter
Xavier gave us the following pointer:
http://www.complang.tuwien.ac.at/papers/ertl+02.ps.gz
## Next actions
- Analyze how gas consumption and checking burn CPU cycles
- Optimize gas consumption and checking
- Checkout Xavier's new version of Z