2022 XX.YY Q4 Gas costs regressions detected twice a week

# 2022 XX.YY Q4 Gas costs regressions detected twice a week Gas cost regression detected twice a week and developer processes are automated. tags: blockchain security, smart contracts, gas. ## Workload ### Current Capacity (october 17th) - Nomadic labs: 3 x eng @ 10 weeks. (Lucas, Nicolas, Raphaël) - Dai Lambda: 2.5 eng - Personal Time Off: ?? (1w+3w) - Eng: ?? ### Project load ## Running benchmarks twice a week (continuous workload) https://gitlab.com/tezos/tezos/-/milestones/143 - 3 weeks - Start: 2022-10-17 - End: 2022-10-28 - Run the full snoop benchmark and infer all parameters two times a week on the reference machine. To achieve this frequency, no parallelism nor optimization of Snoop should be needed. Out of scope: - Publishing data ## Automated regression report on inferred models https://gitlab.com/tezos/tezos/-/milestones/143 - 2 weeks - Start: 2022-10-31 - End: 2022-11-11 - Depends on "Running benchmarks twice a week" - Create and publish a report for each run that shows the difference (of inferred values) from some reference (could be the previous run or something fixed). ## Regression alerts https://gitlab.com/tezos/tezos/-/milestones/143 - 2 weeks - Start: 2022-11-14 - End: 2022-11-25 - Depends on "Automated regression report on inferred models" - Send an alert when differences with regressions exceed a given threshold. ## Export quality statistics from benchmarks and inference - 5 weeks - Start: 2022-10-17 - End: 2022-11-11 - A complementary approach to measure the quality of benchmark and inference (and hence improve our confidence in the security of this critical part of the protocol) is to measure the quality of benchmarks (in particular the standard deviation of individual timings) and of the inference (how well does the model fit with empirical data). Snoop could be adapted to report this information. - Check with Illias. ## Fix flaky benchmarks https://gitlab.com/tezos/tezos/-/milestones/143 - 3 weeks - Start: 2022-11-28 - End: 2022-12-16 - Depends on either "Regression alerts" or "Export quality statistics from benchmarks and inferrence"; ideally on both. - We expect some benchmarks to raise false-positive alerts or be reported as low-quality. These alerts will be analysed and the flaky benchmarks will be fixed. ## Solving the problem of ambiguity of parameters / Improve iterative developer workflow in snoop - 7 weeks - The handling of parameters in Snoop is fragile; if two models refer to conceptually different parameters using the same name then the inference can produce incorrect results. There are probably several solutions to this problem but all of them apparently require a deep refactoring of Snoop. Inverse workflow: From variable name, find models, and then benchmarks. ## Benchmarking more metrics (part 1) - 6 weeks - Snoop currently only benchmarks time measurements and ignores the result of the benchmarked computations (in particular it does not store how much gas was consumed). Various projects to improve the security of the gas model require to have Snoop measure other metrics. This requires a refactoring of the part of Snoop related to benchmarking. - 1) Verify a-postoiry - 2) Less machine dependent - 3) Handle memory allocation ## Fine-Tune machines for benchmarking - No CPU-freq scaling. - Disable adress space rand. for memory. - Disable hyperthreading cores to min. cache polution/switch to machines that do not have so. ## Count CPU instructions - 2 weeks. - Depends on "Benchmarking more metrics". - Counting CPU instructions makes Snoop much more machine independent (but still very architecture dependent). We want this for two reasons: the availability of the benchmarking machine is at risk (Dedibox does not propose the exact same hardware anymore) and it would enable running benchmarks in the CI. How: - 1) Run all benchmarks processes using `perf stat record`. - 1a) Report: Verify/doc any measurable overhead. - 1b) Report: Investigate if numbers are usable. - 2) Optional: Only measure region/function instead of full process (See fd perf record). ## Counting memory allocation - 3 weeks - Depends on "Benchmarking more metrics" - Gas in the Tezos protocol is used to bound two kinds of ressources needed to validate operations and blocks: time and memory. Snoop currently only provides time models, memory models are still hand-written. Snoop should be extended to account for memory consumption and somehow plug memory results of benchmarks into the models. ## Validation of gas models using micro benchmarks - 5 weeks - Depends on "Benchmarking more metrics" - Curently, empirical data coming from benchmarks is used to infer gas parameters which are then plugged into the Tezos protocol. At several steps there is a small risk of introducing an error which could have a dramatic consequence on the security of the protocol: during benchmarks there is an automated detection of outliers (by default, Snoop takes for each input point the median of 3000 measures) which could be too optimistic, during inference the model could be wrong leading to a bad fit of the empirical data, during code generation parameters are rounded (and the error incurred by this rounding could accumulate on large inputs), finally the code in the protocol is usually edited by hand and only reviewed by human beings. To increase our confidence in the security of the gas model, we could record at benchmarking time (using the micro benchmarks that we already have and that cover pretty well all cases of carbonated code in the protocol), the gas consumed during the run of the benchmark and store the time/gas ratio to detect cases where it is abnormally high. ## Validation of gas models using macro benchmarks - 4 weeks - The gas model is based on the assumption that timing each instruction independently using micro benchmarks leads to a secure system when the gas costs are combined for larger programs. We should challenge this assumption by comparing the execution time and gas consumption of operations taking as much gas as the protocol allows. As a first step, these macro benchmarks would be hand crafted. ## M Gas protocol update (Frej) ## SCORU Carbonation assistance (Frej) ## Risks: - we are understaffed (we have 30 weeks time but work for 42 weeks) -> we should keep half of this list and postpone the other half for Q5 - "Might be a bit tricky to convince world to add Q5 in 2022"