Measuring Michelson execution and typechecking

# Measuring Michelson execution and typechecking Period: <2020-09-30> -- <2020-10-09> ## Methodology and tools We need a tool to benchmark the execution and the typechecking of user-written contracts under realistic conditions. To that end, we extended `tezos-snoop` (which was initially designed to build the Michelson cost model) to handle a new form of benchmarks defined by a corpus of scenarios. This extension is submitted for inclusion here: https://gitlab.com/metastatedev/tezos/-/merge_requests/320 A scenario is composed of a scene (a sequence of instructions to set up an evaluation context) and a serie of contract calls (that is a contract name, a storage, and a input). A contract call can omit the storage or/and the input. In that case, we use the random samplers of `tezos-snoop` to fill the holes. When provided with such a corpus, `tezos-snoop` executes the scenarios and produces a trace for each of scenario. If a trace already existed, it checks that the outputs have not changed. `tezos-snoop` also outputs the wall-clock time took by typechecking and execution: ``` typechecker : { max_time = 420834.000000 ; min_time = 93.000000 ; mean_time = 1556.214549 ; sigma = 3530.845233 } interpreter : { max_time = 200876.000000 ; min_time = 8206.000000 ; mean_time = 9970.549261 ; sigma = 2190.420441 } ``` To improve the precision of the evaluation, the measures are done multiple times. `tezos-snoop` removes outliers to compute the mean and the standard deviation. As shown by the standard deviation, there is some room for improvement here. Even with `cpufreq-set -g performance`, a laptop produces too much noise. Using a dedicated benchmarking machine is probably the right thing to try next. To get a more precise account of how typechecking and interpretation spent CPU time, we use `linux-perf`, a performance analyzing tool for linux and `hotspot` a gui to visualise `linux-perf` data. Since `linux-perf` only provides a statistical analysis of the number of cycles consumed by the code, it is important to amplify the time taken by typechecking and interpretation. Fortunately, since we measure typechecking and interpretation multiple times, their execution time dominate the cost of the benchmarking infrastructure. Hence, `linux-perf` is perfectly usable to get useful profiling information. The benchmark can be found here: https://gitlab.com/nomadic-labs/michelson-benchmarks _Next actions_: - Setup a dedicated benchmarking machine. - Populate the corpus with contracts from carthagenet. - Experiments with OCaml's [new memtrace profiler](https://blog.janestreet.com/finding-memory-leaks-with-memtrace/) ## Results of a first experiment What are the most expensive aspects of Michelson execution and typechecking? To get a rough idea of some implementation choices that cost CPU time, we took the following contract that computes the factorial of an integer: ``` parameter int ; storage int ; code { CAR; # Ignore the initial store, stack = [n] PUSH @acc int 1; # We will accumulate the result on top, stack = [1; n] SWAP; # Put n on top, stack = [n; accu] PUSH bool True; # It is a do-while loop. LOOP { DUP; # stack = [n; n; accu] PUSH int 1; # stack = [1; n; n; accu] SWAP; # stack = [n; 1; n; accu] SUB; # stack = [n - 1; n; accu] SWAP; # stack = [n; n - 1; accu] DIP 1 { SWAP }; # stack = [n; accu; n - 1] MUL; # stack = [n * accu; n - 1] SWAP; # stack = [n - 1; n * accu] DUP; # stack = [n - 1; n - 1; n * accu] PUSH int 0; # stack = [0; n - 1; n - 1; n * accu] COMPARE; # stack = [cmp 0 (n - 1); n - 1; n * accu] NEQ }; # stack = [n - 1 <> 0; n - 1; n * accu] SWAP; # stack = [fact n; 0] DIP 1 { DROP }; # stack = [fact n] NIL operation ; # stack = [ []; fact n ] PAIR } # stack = [ ([], fact n) ] ``` and did some measurements of its typechecking and execution when computing `fact 100`. To get a point of reference, we also implemented this function in OCaml: ```ocaml let fact n = let accu = ref Z.one in let n = ref (Z.of_int n) in while !n > 0 do accu := Z.mul !accu !n; n := Z.sub !n Z.one done; !accu ``` On an Intel(R) Core(TM) i7-10510U CPU 4.7Ghz, when applied to `100`, this OCaml function runs in ~3μs when compiled in native code and in ~7μs when compiled in byte code. The current implementation interprets this Michelson contract in ~350μs and typechecks it in ~6μs. ### Decomposing the interpreter execution cost To understand how these 350μs are consumed by the interpreter, we rewrite the interpreter in successive steps. Each step is guided by an analysis of `linux-perf` output. 1. We remove gas computation: ~350μs -> ~65μs. 2. We remove Lwt.t: ~65μs -> ~40μs. 3. We remove the calls to the logger: ~40μs -> ~30μs. 4. We inline the most often used combinators : ~30μs -> ~25μs. 5. We rewrite interpretation of sequences of instruction: ~25μs -> ~20μs. 6. We use integer-specific comparison functions instead of generic comparison: ~20μs -> ~15μs. 7. We remove the error monad and the state monad: ~15μs -> ~12μs. Note that these numbers are obtained using multiple runs. Yet, as micro-benchmarks executed on a modern laptop and modern CPU, they only give an idea and cannot be taken too seriously for many reasons, e.g., controlling CPU's clock frequency is a delicate matter. Yet, it seems clear that gas computation should be optimized first as it represents ~4/5 of CPU time consumed by the interpretation. The remaining cost can be related to the contract representation which is less compact that the OCaml bytecode. Another cost probably comes from the representation of the stack which is a linked list in the current implementation while it is an array in the OCaml runtime. Further experiments are needed to confirm that. The final `linux-perf` output is: ![](https://i.imgur.com/e5Ou8Wi.png) For information, the initial `linux-perf` output was: ![](https://i.imgur.com/f4ddGhw.png) One can retrieve the `linux-perf` traces [here](https://we.tl/t-Z1Ejb7TXX0). (Be aware that this archive weights ~1.5GB when decompressed.) ### Decomposing the typechecker execution cost We followed a similar path to understand where typechecking spends its time. 1. We remove gas computation: ~6μs -> ~4μs. 2. We remove Lwt.t: ~4μs -> ~3.5μs. 3. We remove parse_var_annot: ~3.5μs -> ~3μs. 4. We remove protection mechanisms to catch large types: ~3μs -> ~2.5μs Other micro-optimisations could be done but this experiment is limited. Indeed, we do not evaluate how fast type equality is because the only type here is `int` and the stack types are of small depths (< 5). Since the efficiency of type equality has probably an important impact in practice, further experiments are needed. ## Brainstorming: how to optimise gas consumption and checking? Currently, the execution of `gas.consume` amounts to the execution of the following function: ```ocaml= let raw_consume block_gas operation_gas cost = match operation_gas with | Unaccounted -> ok (block_gas, Unaccounted) | Limited {remaining} -> let gas = cost_to_milligas cost in if Arith.(gas > zero) then let remaining = Arith.sub remaining gas in let block_remaining = Arith.sub block_gas gas in if Arith.(remaining < zero) then error Operation_quota_exceeded else if Arith.(block_remaining < zero) then error Block_quota_exceeded else ok (block_remaining, Limited {remaining}) else ok (block_gas, operation_gas) ``` `Arith` provides arbitrary-precision arithmetic operations. _Ideas_: - Use saturating arithmetic over machine integers. - Decouple the updates of the cost counter and the check that the amount of remaining gas is positive. - Reify cost update as an instruction and insert cost updates before execution (at origination time?) trying to regroup the cost of multiple instructions when possible. - Draw some inspiration from [existing literature](http://www.iro.umontreal.ca/~feeley/papers/FeeleyFPCA93.pdf) about asynchronous interruption polling to check the gas regularly but not too often (depending on the latency we expect). - Use a version of Zarith that computes the cost of operations. - Simplify consume_cost - Understand how much is spent between cost computations, cost updates, context updates, etc. - Inline Seq in each node of the AST ## Other optimisations of the interpreter Xavier gave us the following pointer: http://www.complang.tuwien.ac.at/papers/ertl+02.ps.gz ## Next actions - Analyze how gas consumption and checking burn CPU cycles - Optimize gas consumption and checking - Checkout Xavier's new version of Z