Optimizing the elaboration of Michelson programs

# Optimizing the elaboration of Michelson programs ## Lessons from benchmarking Dexter On the 31st of December 2020, Daniel Hines reported the following results on benchmarking [Dexter](https://dexter.exchange/): ``` With the current interpreter: ┌───────────────────┬────────────┬──────────┬────────────┬────────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │ ├───────────────────┼────────────┼──────────┼────────────┼────────────┼────────────┤ │ FA1.2_Approve │ 29.54us │ 19.44kw │ 16.58w │ 16.58w │ 1.85% │ │ FA1.2_Transfer │ 61.12us │ 45.91kw │ 34.05w │ 34.05w │ 3.84% │ │ Dexter_xtzToToken │ 1_592.77us │ 673.83kw │ 10_819.06w │ 10_819.06w │ 100.00% │ └───────────────────┴────────────┴──────────┴────────────┴────────────┴────────────┘ With the new interpreter: ───────────────────┬────────────┬──────────┬────────────┬────────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │ ├───────────────────┼────────────┼──────────┼────────────┼────────────┼────────────┤ │ FA1.2_Approve │ 16.64us │ 6.92kw │ 9.61w │ 9.61w │ 1.16% │ │ FA1.2_Transfer │ 31.42us │ 11.22kw │ 13.38w │ 13.38w │ 2.19% │ │ Dexter_xtzToToken │ 1_435.72us │ 628.41kw │ 10_730.69w │ 10_730.69w │ 100.00% │ └───────────────────┴────────────┴──────────┴────────────┴────────────┴────────────┘ ``` These results were very surprising considering that we were expecting a 5-8x speedup of Michelson execution times. By analyzing the execution of these contracts, we realized that the performance issues were not related to the interpreter but to the implementation of some internal primitives: the serialization of Micheline was suffering from a problematic implementation of recursive encoders. After fixing this [performance bug](https://gitlab.com/nomadic-labs/data-encoding/-/merge_requests/23), we observed a more significant speedup: ``` - 008 with the performance bug fixed ┌───────────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │ ├───────────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤ │ FA1.2_Approve │ 19.80us │ 15.95kw │ 8.37w │ 8.37w │ 12.67% │ │ FA1.2_Transfer │ 46.28us │ 41.55kw │ 24.80w │ 24.80w │ 29.60% │ │ Dexter_xtzToToken │ 156.35us │ 75.97kw │ 744.17w │ 744.17w │ 100.00% │ └───────────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘ - With the new interpreter and the performance bug fixed ┌───────────────────┬──────────┬─────────┬──────────┬──────────┬────────────┐ │ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │ ├───────────────────┼──────────┼─────────┼──────────┼──────────┼────────────┤ │ FA1.2_Approve │ 9.72us │ 3.43kw │ 1.33w │ 1.33w │ 7.79% │ │ FA1.2_Transfer │ 20.47us │ 6.86kw │ 3.02w │ 3.02w │ 16.39% │ │ Dexter_xtzToToken │ 124.87us │ 30.55kw │ 694.41w │ 694.41w │ 100.00% │ └───────────────────┴──────────┴─────────┴──────────┴──────────┴────────────┘ ``` The analysis of the execution of Dexter with `perf` shows that: - the interpretation layer is indeed ~5x faster in the new interpreter ; - that 30% of the execution time is still used by serialization of Big_maps ; - that the execution time is more or less divided in two between elaboration and execution. ### _Lessons learned_: - Realistic executions are indeed very informative. - An audit of the implementations of the internal primitives would be a good idea. - The elaboration takes a significant part in the execution time, we should try to optimize it. ## Lazy elaboration The scripts found in the chain are often a mere sequence of branching instructions with no loops. (See [Dexter's source code](https://gitlab.com/marigold/tezos/-/blob/dh@benchmark-suite/src/proto_alpha/lib_protocol/benchmark/contracts/dexter.tz).) Since we elaborate a program to execute only a specific execution path, most of the code that has been elaborated is actually never executed in scripts of this shape. Hence, elaboration for this part of the code was a waste of resources. Xavier Leroy suggested to elaborate lazily: typically, a branching instruction should only elaborate the branch that must be executed. To start exploring this idea, we implemented such an [elaboration](https://gitlab.com/-/snippets/2061526) on a subset of Michelson corresponding to the shape of the scripts described earlier. A micro-benchmark compares this lazy elaboration to the current implementation on a specific family of scripts of the form: ``` ((push k); (dup; (eq 1); (if (drop; (push 1)) (dup; (eq 2); (if (drop; (push 2)) (dup; (eq 3); (if (drop; (push 3)) ... (dup; (eq n); (if (drop; (push n)) ...) ``` We make the two parameters vary in the range `[(k, n) | 1 <= n <= 6; 1 <= k <= n]` to get the following results: ``` ┌────────────────────┬──────────┬──────────┬─────────────────┬─────────┬────────────┬─────────┐ │ Name │ Time R^2 │ Time/Run │ 95ci │ mWd/Run │ Percentage │ Speedup │ ├────────────────────┼──────────┼──────────┼─────────────────┼─────────┼────────────┼─────────┤ │ if_seq_1_1 Current │ 1.00 │ 32.80ns │ -0.02ns +0.03ns │ 69.00w │ 100.00% │ 1.15 │ │ if_seq_1_1 Lazy │ 1.00 │ 28.53ns │ -0.02ns +0.03ns │ 58.00w │ 86.97% │ 1.00 │ │ if_seq_1_2 Current │ 1.00 │ 102.71ns │ -0.32ns +0.28ns │ 163.00w │ 100.00% │ 1.59 │ │ if_seq_1_2 Lazy │ 1.00 │ 64.72ns │ -0.05ns +0.06ns │ 127.00w │ 63.02% │ 1.00 │ │ if_seq_2_2 Current │ 1.00 │ 102.66ns │ -0.24ns +0.22ns │ 163.00w │ 100.00% │ 1.59 │ │ if_seq_2_2 Lazy │ 1.00 │ 64.57ns │ -0.04ns +0.04ns │ 127.00w │ 62.89% │ 1.00 │ │ if_seq_1_3 Current │ 1.00 │ 154.27ns │ -0.14ns +0.15ns │ 229.00w │ 100.00% │ 2.39 │ │ if_seq_1_3 Lazy │ 1.00 │ 64.63ns │ -0.03ns +0.03ns │ 127.00w │ 41.89% │ 1.00 │ │ if_seq_2_3 Current │ 1.00 │ 154.44ns │ -0.16ns +0.17ns │ 229.00w │ 100.00% │ 1.50 │ │ if_seq_2_3 Lazy │ 1.00 │ 103.09ns │ -0.08ns +0.11ns │ 196.00w │ 66.75% │ 1.00 │ │ if_seq_3_3 Current │ 1.00 │ 154.26ns │ -0.17ns +0.16ns │ 229.00w │ 100.00% │ 1.46 │ │ if_seq_3_3 Lazy │ 1.00 │ 105.60ns │ -0.14ns +0.14ns │ 196.00w │ 68.46% │ 1.00 │ │ if_seq_1_4 Current │ 1.00 │ 175.03ns │ -0.26ns +0.24ns │ 295.00w │ 100.00% │ 2.70 │ │ if_seq_1_4 Lazy │ 1.00 │ 64.80ns │ -0.12ns +0.14ns │ 127.00w │ 37.02% │ 1.00 │ │ if_seq_2_4 Current │ 1.00 │ 177.56ns │ -0.24ns +0.22ns │ 295.00w │ 100.00% │ 1.72 │ │ if_seq_2_4 Lazy │ 1.00 │ 103.18ns │ -0.05ns +0.06ns │ 196.00w │ 58.11% │ 1.00 │ │ if_seq_3_4 Current │ 1.00 │ 177.80ns │ -0.24ns +0.22ns │ 295.00w │ 100.00% │ 1.25 │ │ if_seq_3_4 Lazy │ 1.00 │ 142.60ns │ -0.10ns +0.12ns │ 265.00w │ 80.20% │ 1.00 │ │ if_seq_4_4 Current │ 1.00 │ 177.64ns │ -0.22ns +0.21ns │ 295.00w │ 100.00% │ 1.12 │ │ if_seq_4_4 Lazy │ 1.00 │ 158.80ns │ -0.10ns +0.11ns │ 265.00w │ 89.40% │ 1.00 │ │ if_seq_1_5 Current │ 1.00 │ 206.95ns │ -0.27ns +0.24ns │ 361.00w │ 100.00% │ 3.20 │ │ if_seq_1_5 Lazy │ 1.00 │ 64.67ns │ -0.06ns +0.06ns │ 127.00w │ 31.25% │ 1.00 │ │ if_seq_2_5 Current │ 1.00 │ 206.25ns │ -1.67ns +1.71ns │ 361.00w │ 100.00% │ 2.00 │ │ if_seq_2_5 Lazy │ 1.00 │ 103.31ns │ -0.05ns +0.06ns │ 196.00w │ 50.09% │ 1.00 │ │ if_seq_3_5 Current │ 1.00 │ 206.14ns │ -1.23ns +0.92ns │ 361.00w │ 100.00% │ 1.44 │ │ if_seq_3_5 Lazy │ 1.00 │ 143.01ns │ -0.13ns +0.14ns │ 265.00w │ 69.37% │ 1.00 │ │ if_seq_4_5 Current │ 1.00 │ 206.86ns │ -0.76ns +0.57ns │ 361.00w │ 100.00% │ 1.11 │ │ if_seq_4_5 Lazy │ 1.00 │ 186.32ns │ -0.12ns +0.12ns │ 334.00w │ 90.07% │ 1.00 │ │ if_seq_5_5 Current │ 1.00 │ 207.01ns │ -0.58ns +0.51ns │ 361.00w │ 100.00% │ 1.11 │ │ if_seq_5_5 Lazy │ 1.00 │ 187.10ns │ -0.17ns +0.21ns │ 334.00w │ 90.38% │ 1.00 │ │ if_seq_1_6 Current │ 1.00 │ 236.13ns │ -1.50ns +1.39ns │ 427.00w │ 100.00% │ 3.65 │ │ if_seq_1_6 Lazy │ 1.00 │ 64.65ns │ -0.03ns +0.04ns │ 127.00w │ 27.38% │ 1.00 │ │ if_seq_2_6 Current │ 1.00 │ 226.65ns │ -0.92ns +1.08ns │ 427.00w │ 100.00% │ 2.19 │ │ if_seq_2_6 Lazy │ 1.00 │ 103.28ns │ -0.08ns +0.09ns │ 196.00w │ 45.57% │ 1.00 │ │ if_seq_3_6 Current │ 1.00 │ 229.08ns │ -1.60ns +1.66ns │ 427.00w │ 100.00% │ 1.62 │ │ if_seq_3_6 Lazy │ 1.00 │ 141.61ns │ -0.08ns +0.09ns │ 265.00w │ 61.82% │ 1.00 │ │ if_seq_4_6 Current │ 1.00 │ 237.45ns │ -1.64ns +1.57ns │ 427.00w │ 100.00% │ 1.27 │ │ if_seq_4_6 Lazy │ 1.00 │ 186.28ns │ -0.12ns +0.14ns │ 334.00w │ 78.45% │ 1.00 │ │ if_seq_5_6 Current │ 1.00 │ 231.58ns │ -1.48ns +1.62ns │ 427.00w │ 100.00% │ 1.04 │ │ if_seq_5_6 Lazy │ 1.00 │ 223.28ns │ -0.14ns +0.16ns │ 403.00w │ 96.42% │ 1.00 │ │ if_seq_6_6 Current │ 1.00 │ 234.23ns │ -1.58ns +1.58ns │ 427.00w │ 100.00% │ 1.04 │ │ if_seq_6_6 Lazy │ 1.00 │ 225.05ns │ -0.12ns +0.12ns │ 403.00w │ 96.08% │ 1.00 │ └────────────────────┴──────────┴──────────┴─────────────────┴─────────┴────────────┴─────────┘ ``` #### _Remarks:_ - With the lazy elaboration, the execution times show that we only pay for what we get, as expected. - Even though we go through an indirection because of the laziness, the execution time of this elaboration seems similar to the current one. - The differences between the current elaboration and the lazy one is not simply the introduction of the lazy but also a slightly different style: we implemented a strict variant of the lazy elaboration and found no significant differences due to the laziness on the fully-elaborated cases. - There is however a significant difference between the current elaboration and the lazy one: in the lazy case, we never explicitly check the equality between the output stack types of the two branches of a conditional. As far as I understand, the two types are made compatible by the actual execution of the continuation elaboration. #### Next actions - Determine if we can get some more performance by defunctionalizing the lazy elaboration. - Check if the approach scales to the actual elaboration. - Can we also deserialize Micheline lazily? ## Hashconsing types and stack types Hash-consing is a standard technique to optimize equality tests and to reduce memory usage. Unfortunately, the standard way to implement it as found in the [literature](https://www.lri.fr/~filliatr/ftp/publis/hash-consing2.pdf) cannot be applied in the case of Michelson types and stack types, because they are GADTs. Indeed, this implementation is using OCaml's physical equality: ```ocaml ( == ) : 'a -> 'a -> bool ``` while we would need a way to compare values with distinct static types. Typically, we would need to compare a type representation `t1 : a ty` to another one `t2 : b ty` and if they are equal, the type equality `a = b` should be known to the typechecker to justify the welltypedness of the rest of the program. ### A failed attempt A [first attempt](https://gitlab.com/tezos/tezos/-/merge_requests/2477) consisted in the introduction of richly typed physical equality: ```ocaml ( === ) : 'a t -> 'b t -> ('a, 'b) eq option ``` implemented using `Obj.magic` asking for a proof that `t` is injective. Unfortunately, injectivity is not enough, as shown by the counter-example of `list`. We probably need a proof that the GADT is a singleton type to get a safe physical equality between GADTs with that type. ### A new hope To safely compare GADTs, we cannot easily use their physical addresses but we can simulate one using type-indexed names implemented using a combination of an extensible GADT and local modules. ```ocaml type (_, _) eq = Eq : ('a, 'a) eq type _ id = .. module type SGen = sig type u type _ id += Self : u id val eq : 'v id -> (u, 'v) eq option end module Gen (X : sig type u end) : SGen with type u = X.u = struct type u = X.u type _ id += Self : X.u id let eq (type u') (this : u' id) : (X.u, u') eq option = match this with Self -> Some Eq | _ -> None end type 'a gid = (module SGen with type u = 'a) let eq_id : type a b. a gid -> b gid -> (a, b) eq option = fun (module G) (module O) -> G.eq O.Self let gen (type a) () : a gid = (module Gen (struct type u = a end) : SGen with type u = a) ``` (This technique was actually already used by Daniel Bunzli in [hmap](https://github.com/dbuenzli/hmap/blob/master/src/hmap.ml). The Pleiade effect.) Thanks to these indexed names, it is possible to memoize GADT data constructors, which is enough to maintain the invariant of uniqueness of value representation required for hash-consing. The construction of stack cells can be optimized: by embedding a memoization table in each stack `s`, we can memoize the stack `t : s` which is obtained by consing `t` on the top of `s`. This local table is probably faster to query than a global one. The code is available here: https://gitlab.com/-/snippets/2060960 #### Next actions: - Check that this approach is robust. - Check that this approach scales to the actual elaboration, and provides actual speedups on realistic executions.