# Optimizing the gas update and checking
Period: <2020-10-14> -- <2020-10-22>
## Preliminary notes
We have moved to a dedicated machine to run the benchmark.
Hence, the numbers are slightly different from the ones obtained in the
previous benchmark. They are probably more robust because no other
active processes run on the machine and this machine is not a laptop.
## How much CPU time is devoted to gas update and checking?
The current implementation of Michelson computes `fact 100` in
260μs on the benchmarking machine (Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz).
~260μs - ~60μs = ~200μs is spent in updating and checking gas.
More precisely, if we remove the call to `cost_of_instr`,
the execution time decreases from ~260μs to ~150μs
and if we remove the call to `Gas.consume`,
the execution time decreases from ~150μs to ~60μs.
Decomposing a little bit more, `Gas.consume` uses ~50% of its CPU time to substract
the instruction cost to two gas counters (`block_gas` and `operation_gas`) and to
test if they are still positive. ~50% are used to copy the context
in order to update the two gas counters (!).
Roughly speaking, we followed three strategies
to optimize the gas update and the check for the gas exhaustion:
- (i) reduce the number of times `cost_of_instr` and `Gas.consume` are called ;
- (ii) optimize `Gas.consume` ;
- (iii) optimize `cost_of_instr`.
To make sure that each optimization did not change the way gas is
consumed, we checked that the final amount of gas did not change
after each optimization.
## Summary of results
| Optimization | Running time |
|----------------------------------------------------------|-------|
| Initial running time | 260μs |
| Reducing the frequency of gas updates and checks | 200μs |
| Optimizing the gas updates | 170μs |
| Optimizing the context update | 160μs |
| Using efficient comparison operators for native integers | 155μs |
| Using a version of Zarith with a fast-path | 145μs |
| Using saturating arithmetic | 120μs |
| Several Micro-optimizations of the cost model | 95μs |
| Representing the gas counter as a parameter of step | 75μs |
These optimizations could have been applied in a different order
and probably interact with each other.
The initial report from `linux-perf` can be found here:
https://hackmd.io/3nkjs8EWRM2UmVBVXoWp9w
The final report from `linux-perf` can be found here:
https://hackmd.io/cJqCYvCkQGmjOTagulYm6w
## Reducing the frequency of gas updates and checks
https://gitlab.com/yrg/tezos/-/commit/ab9dd1dcc2abda1e6597527f16db9fe94d4c382e
There are many sequences of instructions whose costs do
not depend on the data. These costs can be computed statically
and regrouped.
_Idea_:
We apply a preprocessing to Michelson scripts before executing
them. This preprocessor inserts new instructions of the form `PayGas (d, i)`
where `i` is a sequence of instructions of statically determined cost `d`.
This preprocessor rewrites each atomic instruction `i` which has a data
independent (hence statically determined) cost `d` into an
instruction of the form `PayGas (d, i)`. While moving up to the
root of the script AST, sequences of `PayGas` are merged together
as follows:
PayGas (c1, i1); PayGas (c2, i2) => PayGas (c1 + c2, i1; i2)
A similar rule is valid for `Dip` and `Dipn`, which apply an operation
deeply in the stack.
It is NOT a valid rule for any instruction that iterates nested instructions,
e.g. `Loop`.
On the factorial, the preprocessor produces the following script:
```
{ PAYGAS 220 { CAR ; { PUSH int 1 ; { SWAP ; PUSH bool True } } } ;
{ LOOP { PAYGAS 160 { DUP ; { PUSH int 1 ; SWAP } } ;
{ SUB ;
{ PAYGAS 200 { SWAP ; DIP 1 { SWAP ; {} } } ;
{ MUL ;
{ PAYGAS 160 { SWAP ; { DUP ; PUSH int 0 } } ;
{ COMPARE int ; PAYGAS 120 { NEQ ; {} } } } } } } } ;
PAYGAS
360
{ SWAP ; { DIP 1 { DROP ; {} } ; { NIL ; { PAIR ; {} } } } } } }
```
_Performance gain_:
In the case of `fact 100`, this optimization reduces execution
time from ~260μs to ~200μs.
_Remarks_:
- The preprocessing is linear in the size of the input script and runs pretty fast.
- This optimization may become obsolete if we introduce a superinstruction-based JIT.
## Optimizing `Gas.consume`
### Optimizing the gas updates
https://gitlab.com/yrg/tezos/-/commit/3f026374a9452c3b198ade8053adcb3a4ab6783
Let `min_gas` be the minimum between `block_gas` and `operation_gas`,
and `max_gas` be the maximum of them.
If one of the two counters has to hit 0, then it must be `min_gas`.
_Idea_:
Gas updates and checks should only be done on `min_gas`.
By remembering which of `block_gas` and `operation_gas`
was the minimum using a boolean, we can update 'max_gas' afterwards. This boolean can
also be used to produce the right error message in case of gas exhaustion.
This optimization removes two `Zarith` operations over four in
the critical routine `consume_gas`.
_Performance gain_:
In the case of `fact 100`, this optimization reduces execution
time from ~200μs to ~170μs.
### Optimizing the context update
https://gitlab.com/yrg/tezos/-/commit/5615a682b89acd798f2565708c56b168ecfefc85
In the function `Raw_context.consume`:
```ocaml
let consume_gas ctxt cost =
Gas_limit_repr.raw_consume ctxt.gas_counter ctxt.back.count_block_gas cost
>>? fun gas_counter -> ok {ctxt with gas_counter}
```
The construction of the returned context requires the copy of 24 fields.
Most of them are rarely updated during execution except the `gas_counter` (introduced in the previous optimization).
_Idea_:
Structure the context in two layers to only copy 2 fields instead of 24.
_Performance gain_:
In the case of `fact 100`, this optimization reduces execution time from ~170μs to ~160μs.
### Using efficient comparison operators for native integers
The following patch has been applied to properly compare native integers
in the protocol codebase:
https://gitlab.com/tezos/tezos/-/merge_requests/2171
_Performance gain_:
In the case of `fact 100`, this optimization combined with the
previous one reduces execution time from ~160μs to ~155μs.
### Using a version of Zarith with a fast-path
https://gitlab.com/yrg/tezos/-/commit/819d82a85937c6185493cc9d7fbe085840651e9a
Xavier Leroy developed a branch of Zarith with a fast-path for small integers
where operations are directly implemented in OCaml. This should prevent
some expensive calls to C functions to occur.
_Performance gain_:
In the case of `fact 100`, this optimization reduces execution time from
~155μs to ~145μs.
### Using saturating arithmetic
https://gitlab.com/yrg/tezos/-/commit/fb569b3446e8f3b3daec73e69a32b17a52f6d713
Since the gas limit per operation is 1040000000 mgas, the `gas_counter`
can be represented using an OCaml `int` (even on 32-bits architectures). Indeed,
2 ^ 30 = 1073741824 > 1040000000.
The cost model can produce values outside of this range but since they are
ultimately subtracted to `gas_counter`, we simply have to use a
saturated arithmetic over `int` instead of the arbitrary precision
arithmetic from `ZArith.t`.
_Performance gain_:
In the case of `fact 100`, this optimization reduces execution time from
~145μs to ~120μs.
### Some room for improvement
#### Put `gas_counter` outside the context
In an experiment, we tried to represent `gas_counter` as a parameter of
the step function to have the compiler store its value in a register.
The difficulty is to properly embed this parameter in the context when
the interpreter calls a function that consumes some gas.
For the moment, there is an error in the code since the final amount of
gas is invalid. However, the performance gain seems significant.
#### Break the `raw_protocol_alpha` barrier for more inlining
The build system of Tezos is not standard because protocols are implemented
as plugins executed under a specific environment.
Surprisingly, this seems to have some impact on the compiler's potency for
inlining even inside a given protocol implementation. For instance, we
failed at convincing the compiler to inline the functions in `Alpha_context.Gas` even though some of them are small and called in
a critically hot spot, e.g. `Raw_context.consume_gas`.
## Optimizing `cost_of_instr`
`cost_of_instr` inspects the instruction and the current stack and computes
a cost using the cost model provided by the module `Michelson_v1_gas`. This
cost is decomposed into a constant cost named `cycle` and a cost which may
vary depending on the values on the stack.
#### A general remark about `cost_of_instr`
`cost_of_instr` would be faster if it was mapping each instruction to a mere
constant, i.e., if it did not depend on the size of the data manipulated by
the instruction. Unfortunately, this cannot be done to properly reject some adversarial programs.
However, we can try to optimise `cost_of_instr` to follow that principle for the
instructions manipulating "small enough to be harmless" data. To that end,
we could specialize `cost_of_instr` for these harmless (and probably most common) cases.
We could also simplify the cost model to make cost function faster to compute.
Finally, we have found some optimizations opportunities in the current implementation.
#### Micro-optimizations of the cost model
https://gitlab.com/yrg/tezos/-/commit/7b096266dbf848c8db571513ea626aa73594e576
Thanks to a few micro-optimizations of the cost model implementation, we
were able to reduce the execution time of `fact 100` from ~120μs to 95μs.
## Representing the gas counter as a parameter of step
Even though we reduce the cost of the context functional updates, the gas
counter is still a field of a record, that is a memory cell in a heap
allocated block. Given that the gas counter is a value which is read
and write very often, it would make sense to have it stored in a machine
register.
_Idea_:
- By passing the gas counter as a parameter and a return value of step,
the OCaml compiler is likely to store its value is a machine register.
- The difficulty here is to make sure to properly update the gas counter which
is in the context when the interpreter calls functions that consume gas.
Fortunately, by introducing a new type for `outdated_context`, we make
sure that all these calls are tracked down by the typechecker.
_Performance gain_:
- In the case of `fact 100`, this optimization reduces execution time from 95μs to ~75μs.
## Checkpoint
At this point, if we deactivate the gas update and check for exhaustion, we still get
an execution time of ~60μs for `fact 100`, this means that we have reduced the execution
time for gas update and check for exhaustion from 200μs to 15μs.
## Next actions
- Try flambda
- Evaluate the usage of tabulation-based techniques to compute cost functions
- Evaluate the idea of having Zarith numbers carry their sizes
- Rebase with respect to metastatedev/proto-proposal
- Ask Pierre Chambart what [@@inline always] means for the vanilla compiler
- Prepare some MRs for the well understood optimizations