# Optimizing the gas update and checking Period: <2020-10-14> -- <2020-10-22> ## Preliminary notes We have moved to a dedicated machine to run the benchmark. Hence, the numbers are slightly different from the ones obtained in the previous benchmark. They are probably more robust because no other active processes run on the machine and this machine is not a laptop. ## How much CPU time is devoted to gas update and checking? The current implementation of Michelson computes `fact 100` in 260μs on the benchmarking machine (Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz). ~260μs - ~60μs = ~200μs is spent in updating and checking gas. More precisely, if we remove the call to `cost_of_instr`, the execution time decreases from ~260μs to ~150μs and if we remove the call to `Gas.consume`, the execution time decreases from ~150μs to ~60μs. Decomposing a little bit more, `Gas.consume` uses ~50% of its CPU time to substract the instruction cost to two gas counters (`block_gas` and `operation_gas`) and to test if they are still positive. ~50% are used to copy the context in order to update the two gas counters (!). Roughly speaking, we followed three strategies to optimize the gas update and the check for the gas exhaustion: - (i) reduce the number of times `cost_of_instr` and `Gas.consume` are called ; - (ii) optimize `Gas.consume` ; - (iii) optimize `cost_of_instr`. To make sure that each optimization did not change the way gas is consumed, we checked that the final amount of gas did not change after each optimization. ## Summary of results | Optimization | Running time | |----------------------------------------------------------|-------| | Initial running time | 260μs | | Reducing the frequency of gas updates and checks | 200μs | | Optimizing the gas updates | 170μs | | Optimizing the context update | 160μs | | Using efficient comparison operators for native integers | 155μs | | Using a version of Zarith with a fast-path | 145μs | | Using saturating arithmetic | 120μs | | Several Micro-optimizations of the cost model | 95μs | | Representing the gas counter as a parameter of step | 75μs | These optimizations could have been applied in a different order and probably interact with each other. The initial report from `linux-perf` can be found here: https://hackmd.io/3nkjs8EWRM2UmVBVXoWp9w The final report from `linux-perf` can be found here: https://hackmd.io/cJqCYvCkQGmjOTagulYm6w ## Reducing the frequency of gas updates and checks https://gitlab.com/yrg/tezos/-/commit/ab9dd1dcc2abda1e6597527f16db9fe94d4c382e There are many sequences of instructions whose costs do not depend on the data. These costs can be computed statically and regrouped. _Idea_: We apply a preprocessing to Michelson scripts before executing them. This preprocessor inserts new instructions of the form `PayGas (d, i)` where `i` is a sequence of instructions of statically determined cost `d`. This preprocessor rewrites each atomic instruction `i` which has a data independent (hence statically determined) cost `d` into an instruction of the form `PayGas (d, i)`. While moving up to the root of the script AST, sequences of `PayGas` are merged together as follows: PayGas (c1, i1); PayGas (c2, i2) => PayGas (c1 + c2, i1; i2) A similar rule is valid for `Dip` and `Dipn`, which apply an operation deeply in the stack. It is NOT a valid rule for any instruction that iterates nested instructions, e.g. `Loop`. On the factorial, the preprocessor produces the following script: ``` { PAYGAS 220 { CAR ; { PUSH int 1 ; { SWAP ; PUSH bool True } } } ; { LOOP { PAYGAS 160 { DUP ; { PUSH int 1 ; SWAP } } ; { SUB ; { PAYGAS 200 { SWAP ; DIP 1 { SWAP ; {} } } ; { MUL ; { PAYGAS 160 { SWAP ; { DUP ; PUSH int 0 } } ; { COMPARE int ; PAYGAS 120 { NEQ ; {} } } } } } } } ; PAYGAS 360 { SWAP ; { DIP 1 { DROP ; {} } ; { NIL ; { PAIR ; {} } } } } } } ``` _Performance gain_: In the case of `fact 100`, this optimization reduces execution time from ~260μs to ~200μs. _Remarks_: - The preprocessing is linear in the size of the input script and runs pretty fast. - This optimization may become obsolete if we introduce a superinstruction-based JIT. ## Optimizing `Gas.consume` ### Optimizing the gas updates https://gitlab.com/yrg/tezos/-/commit/3f026374a9452c3b198ade8053adcb3a4ab6783 Let `min_gas` be the minimum between `block_gas` and `operation_gas`, and `max_gas` be the maximum of them. If one of the two counters has to hit 0, then it must be `min_gas`. _Idea_: Gas updates and checks should only be done on `min_gas`. By remembering which of `block_gas` and `operation_gas` was the minimum using a boolean, we can update 'max_gas' afterwards. This boolean can also be used to produce the right error message in case of gas exhaustion. This optimization removes two `Zarith` operations over four in the critical routine `consume_gas`. _Performance gain_: In the case of `fact 100`, this optimization reduces execution time from ~200μs to ~170μs. ### Optimizing the context update https://gitlab.com/yrg/tezos/-/commit/5615a682b89acd798f2565708c56b168ecfefc85 In the function `Raw_context.consume`: ```ocaml let consume_gas ctxt cost = Gas_limit_repr.raw_consume ctxt.gas_counter ctxt.back.count_block_gas cost >>? fun gas_counter -> ok {ctxt with gas_counter} ``` The construction of the returned context requires the copy of 24 fields. Most of them are rarely updated during execution except the `gas_counter` (introduced in the previous optimization). _Idea_: Structure the context in two layers to only copy 2 fields instead of 24. _Performance gain_: In the case of `fact 100`, this optimization reduces execution time from ~170μs to ~160μs. ### Using efficient comparison operators for native integers The following patch has been applied to properly compare native integers in the protocol codebase: https://gitlab.com/tezos/tezos/-/merge_requests/2171 _Performance gain_: In the case of `fact 100`, this optimization combined with the previous one reduces execution time from ~160μs to ~155μs. ### Using a version of Zarith with a fast-path https://gitlab.com/yrg/tezos/-/commit/819d82a85937c6185493cc9d7fbe085840651e9a Xavier Leroy developed a branch of Zarith with a fast-path for small integers where operations are directly implemented in OCaml. This should prevent some expensive calls to C functions to occur. _Performance gain_: In the case of `fact 100`, this optimization reduces execution time from ~155μs to ~145μs. ### Using saturating arithmetic https://gitlab.com/yrg/tezos/-/commit/fb569b3446e8f3b3daec73e69a32b17a52f6d713 Since the gas limit per operation is 1040000000 mgas, the `gas_counter` can be represented using an OCaml `int` (even on 32-bits architectures). Indeed, 2 ^ 30 = 1073741824 > 1040000000. The cost model can produce values outside of this range but since they are ultimately subtracted to `gas_counter`, we simply have to use a saturated arithmetic over `int` instead of the arbitrary precision arithmetic from `ZArith.t`. _Performance gain_: In the case of `fact 100`, this optimization reduces execution time from ~145μs to ~120μs. ### Some room for improvement #### Put `gas_counter` outside the context In an experiment, we tried to represent `gas_counter` as a parameter of the step function to have the compiler store its value in a register. The difficulty is to properly embed this parameter in the context when the interpreter calls a function that consumes some gas. For the moment, there is an error in the code since the final amount of gas is invalid. However, the performance gain seems significant. #### Break the `raw_protocol_alpha` barrier for more inlining The build system of Tezos is not standard because protocols are implemented as plugins executed under a specific environment. Surprisingly, this seems to have some impact on the compiler's potency for inlining even inside a given protocol implementation. For instance, we failed at convincing the compiler to inline the functions in `Alpha_context.Gas` even though some of them are small and called in a critically hot spot, e.g. `Raw_context.consume_gas`. ## Optimizing `cost_of_instr` `cost_of_instr` inspects the instruction and the current stack and computes a cost using the cost model provided by the module `Michelson_v1_gas`. This cost is decomposed into a constant cost named `cycle` and a cost which may vary depending on the values on the stack. #### A general remark about `cost_of_instr` `cost_of_instr` would be faster if it was mapping each instruction to a mere constant, i.e., if it did not depend on the size of the data manipulated by the instruction. Unfortunately, this cannot be done to properly reject some adversarial programs. However, we can try to optimise `cost_of_instr` to follow that principle for the instructions manipulating "small enough to be harmless" data. To that end, we could specialize `cost_of_instr` for these harmless (and probably most common) cases. We could also simplify the cost model to make cost function faster to compute. Finally, we have found some optimizations opportunities in the current implementation. #### Micro-optimizations of the cost model https://gitlab.com/yrg/tezos/-/commit/7b096266dbf848c8db571513ea626aa73594e576 Thanks to a few micro-optimizations of the cost model implementation, we were able to reduce the execution time of `fact 100` from ~120μs to 95μs. ## Representing the gas counter as a parameter of step Even though we reduce the cost of the context functional updates, the gas counter is still a field of a record, that is a memory cell in a heap allocated block. Given that the gas counter is a value which is read and write very often, it would make sense to have it stored in a machine register. _Idea_: - By passing the gas counter as a parameter and a return value of step, the OCaml compiler is likely to store its value is a machine register. - The difficulty here is to make sure to properly update the gas counter which is in the context when the interpreter calls functions that consume gas. Fortunately, by introducing a new type for `outdated_context`, we make sure that all these calls are tracked down by the typechecker. _Performance gain_: - In the case of `fact 100`, this optimization reduces execution time from 95μs to ~75μs. ## Checkpoint At this point, if we deactivate the gas update and check for exhaustion, we still get an execution time of ~60μs for `fact 100`, this means that we have reduced the execution time for gas update and check for exhaustion from 200μs to 15μs. ## Next actions - Try flambda - Evaluate the usage of tabulation-based techniques to compute cost functions - Evaluate the idea of having Zarith numbers carry their sizes - Rebase with respect to metastatedev/proto-proposal - Ask Pierre Chambart what [@@inline always] means for the vanilla compiler - Prepare some MRs for the well understood optimizations