VampIR optimization

To optimize VampIR we can either: 1. Support more features (e.g. Lookups gates, costum gates, efficient built in intrinsics) 2. Pack 3AC circuits intelligently in plonk-ish provable models, as drafted in [here](https://github.com/anoma/vamp-ir/tree/plonk3). 3. Semplify 3AC circuits themselves. We decided to opt for the 3rd approach, which could in turn be more heavy on compilation, but results in faster proving and verification. ## External systems for simplification + [GAP](https://www.gap-system.org/Overview/overview.html) - Can be used to manipulate (maybe simplify) polynomials over finite fields. - Does this require converting 3ac into a single polynomial and, if so, does this cost more than it saves? + We could treat our system as a system of rational numbers. This would allow any CAS - Examples: Sage, Mathematica, others? - Does this preseve sematics? Could there be more (or fewer) valid witnesses to a rationally simplified circuit? #### LLVM The strategy is to convert vampir 3ac into llvm IR, then use [LLVM passes](https://llvm.org/docs/Passes.html) to perform optimization and analysis. We have a WIP script for converting VampIR 3AC circuit in LLVM IR in [here](https://github.com/anoma/vamp-ir/blob/llvm_optimization/src/llvm_transform.rs). The demo outpus an `.ll` llvm IR script. For example for `fresh.pir` we get the following `test.ll`: ```. define i1 @main(i64 %a, i64 %b) { entry: %v165 = add i64 %a, %b %v178p = alloca i64 store i64 %v165, i64* %v178p %v178= load i64, i64* %v178p %v179 = urem i64 %v178, 2 %v180 = udiv i64 %v165, 2 %v181 = urem i64 %v180, 2 %v182 = udiv i64 %v165, 4 %v183 = urem i64 %v182, 2 %v184 = udiv i64 %v165, 8 %v185 = urem i64 %v184, 2 %v186 = udiv i64 %v165, 16 %v187 = urem i64 %v186, 2 %v199p = alloca i64 store i64 %v179, i64* %v199p %v199= load i64, i64* %v199p %v201p = alloca i64 store i64 %v181, i64* %v201p %v201= load i64, i64* %v201pMid-level spec synthesis %v203p = alloca i64 store i64 %v183, i64* %v203p %v203= load i64, i64* %v203p %v205p = alloca i64 store i64 %v185, i64* %v205p %v205= load i64, i64* %v205p %v207p = alloca i64 store i64 %v187, i64* %v207p %v207= load i64, i64* %v207p %v210 = sub i64 %v199, 1 %v211 = mul i64 %v199, %v210 %v213 = sub i64 %v201, 1 %v214 = mul i64 %v201, %v213 %v216 = sub i64 %v203, 1 %v217 = mul i64 %v203, %v216 %v219 = sub i64 %v205, 1 %v220 = mul i64 %v205, %v219 %v222 = sub i64 %v207, 1 %v223 = mul i64 %v207, %v222 %v224 = mul i64 2, %v201 %v225 = add i64 %v199, %v224 %v226 = mul i64 4, %v203 %v227 = add i64 %v225, %v226 %v228 = mul i64 8, %v205 %v229 = add i64 %v227, %v228 %v230 = mul i64 16, %v207 %v231 = add i64 %v229, %v230 %subexp1 = add i64 %a, %b %constr1 = icmp eq i64 %v165, %subexp1 %subexp2 = sub i64 %v199, 1 %constr2 = icmp eq i64 %v210, %subexp2 %subexp3 = mul i64 %v199, %v210 %constr3 = icmp eq i64 %v211, %subexp3 %subexp4 = sub i64 %v201, 1 %constr4 = icmp eq i64 %v213, %subexp4 %subexp5 = mul i64 %v201, %v213 %constr5 = icmp eq i64 %v214, %subexp5 %subexp6 = sub i64 %v203, 1 %constr6 = icmp eq i64 %v216, %subexp6 %subexp7 = mul i64 %v203, %v216 %constr7 = icmp eq i64 %v217, %subexp7 %subexp8 = sub i64 %v205, 1 %constr8 = icmp eq i64 %v219, %subexp8 %subexp9 = mul i64 %v205, %v219 %constr9 = icmp eq i64 %v220, %subexp9 %subexp10 = sub i64 %v207, 1 %constr10 = icmp eq i64 %v222, %subexp10 %subexp11 = mul i64 %v207, %v222 %constr11 = icmp eq i64 %v223, %subexp11 %subexp12 = mul i64 2, %v201 %constr12 = icmp eq i64 %v224, %subexp12 %subexp13 = add i64 %v199, %v224 %constr13 = icmp eq i64 %v225, %subexp13 %subexp14 = mul i64 4, %v203 %constr14 = icmp eq i64 %v226, %subexp14 %subexp15 = add i64 %v225, %v226 %constr15 = icmp eq i64 %v227, %subexp15 %subexp16 = mul i64 8, %v205 %constr16 = icmp eq i64 %v228, %subexp16 %subexp17 = add i64 %v227, %v228 %constr17 = icmp eq i64 %v229, %subexp17 %subexp18 = mul i64 16, %v207 %constr18 = icmp eq i64 %v230, %subexp18 %subexp19 = add i64 %v229, %v230 %constr19 = icmp eq i64 %v231, %subexp19 %constr20 = icmp eq i64 %v211, 0 %constr21 = icmp eq i64 %v214, 0 %constr22 = icmp eq i64 %v217, 0 %constr23 = icmp eq i64 %v220, 0 %constr24 = icmp eq i64 %v223, 0 %constr25 = icmp eq i64 %v165, %v231 %constr26 = icmp eq i64 %v199, 0 %constr27 = icmp eq i64 %v201, 1 %res1 = and i1 %constr1, %constr2 %res2 = and i1 %res1, %constr3 %res3 = and i1 %res2, %constr4 %res4 = and i1 %res3, %constr5 %res5 = and i1 %res4, %constr6 %res6 = and i1 %res5, %constr7 %res7 = and i1 %res6, %constr8 %res8 = and i1 %res7, %constr9 %res9 = and i1 %res8, %constr10 %res10 = and i1 %res9, %constr11 %res11 = and i1 %res10, %constr12 %res12 = and i1 %res11, %constr13 %res13 = and i1 %res12, %constr14 %res14 = and i1 %res13, %constr15 %res15 = and i1 %res14, %constr16 %res16 = and i1 %res15, %constr17 %res17 = and i1 %res16, %constr18 %res18 = and i1 %res17, %constr19 %res19 = and i1 %res18, %constr20 %res20 = and i1 %res19, %constr21 %res21 = and i1 %res20, %constr22 %res22 = and i1 %res21, %constr23 %res23 = and i1 %res22, %constr24 %res24 = and i1 %res23, %constr25 %res25 = and i1 %res24, %constr26 %res26 = and i1 %res25, %constr27Mid-level spec synthesis ret i1 %res26 } ``` This file is structured with: definitions of variables first, then constraints that return a boolean value and then all the booleans returned by the constraint are ANDed together so that the transformations do not alter the original function. One can then use LLVM passes to expore optimization strategies from command line. :::info **CL instruction for LLVM IR** First convert from `.ll` to `.bc`: ``` llvm-as <test.ll> -o <output_code.bc> ``` Then apply the pass (here instcombine): ``` opt -instcombine <output_code.bc> -o <optimized_instcombine.bc> ``` To get back to a `.ll` file one runs: ``` llvm-dis <optimized_instcombine.bc> -o <optimized_instcombine.ll> ``` ::: The result we get in this example is: ```. ; ModuleID = '/home/carlo/Documents/Work/optimized_fresh_instcombine.bc' source_filename = "/home/carlo/Documents/Work/test.ll" define i1 @main(i64 %a, i64 %b) { entry: %v165 = add i64 %a, %b %v179 = and i64 %v165, 1 %v180 = lshr i64 %v165, 1 %v181 = and i64 %v180, 1 %v182 = lshr i64 %v165, 2 %v183 = and i64 %v182, 1 %v184 = lshr i64 %v165, 3 %v185 = and i64 %v184, 1 %v186 = lshr i64 %v165, 4 %v187 = and i64 %v186, 1 %v210 = add nsw i64 %v179, -1 %v211 = mul nuw nsw i64 %v179, %v210 %v213 = add nsw i64 %v181, -1 %v214 = mul nuw nsw i64 %v181, %v213 %v216 = add nsw i64 %v183, -1 %v217 = mul nuw nsw i64 %v183, %v216 %v219 = add nsw i64 %v185, -1 %v220 = mul nuw nsw i64 %v185, %v219 %v222 = add nsw i64 %v187, -1 %v223 = mul nuw nsw i64 %v187, %v222 %subexp3 = mul nuw nsw i64 %v179, %v210 %constr3 = icmp eq i64 %v211, %subexp3 %subexp5 = mul nuw nsw i64 %v181, %v213 %constr5 = icmp eq i64 %v214, %subexp5 %subexp7 = mul nuw nsw i64 %v183, %v216 %constr7 = icmp eq i64 %v217, %subexp7 %subexp9 = mul nuw nsw i64 %v185, %v219 %constr9 = icmp eq i64 %v220, %subexp9 %subexp11 = mul nuw nsw i64 %v187, %v222 %constr11 = icmp eq i64 %v223, %subexp11 %constr27 = icmp ne i64 %v181, 0 %res4 = and i1 %constr3, %constr5 %res6 = and i1 %res4, %constr7 %res8 = and i1 %res6, %constr9 %res10 = and i1 %res8, %constr11 %0 = or i64 %v211, %v214 %1 = or i64 %0, %v217 %2 = or i64 %1, %v220 %3 = or i64 %2, %v223 %4 = and i64 %v165, -31 %5 = or i64 %4, %3 %6 = icmp eq i64 %5, 0 %7 = and i1 %6, %res10 %res26 = and i1 %7, %constr27 ret i1 %res26 } ``` We see how this optimization pass introduces new operations between numbers such as `lshr` which have to be handled with care when compiling back to vamp-ir as they could be costly in an arithmetic circuit representation. :::info :bulb: **Quick notes on LLVM passes** There are two sorts of passes; analysis passes and transformation passes. Some of the analysis passes don't effect the program, but produce useful information that we may be able to use. Specifically -domfrontier, -domtree, and -regions, which may provide relevant info for us to use for custom optimizations. Here are the transformation passes I think could be used; -adce, -dce, and -die as they merely remove dead code. -instcombine and -aggressive-instcombine, MAYBE. In the examples, it describes multiplying by 2 being turned into a shift by one. This specific example could easily be undone during translation back into Vamp-IR; just turn shifts into multiplications by powers of 2. Transformations may snowball, so that some operation turns into shifts that turn into boolean operations, etc, and it may become nontrivial to recover arithmetic equivalents. I do wonder if this might violate some assumption about the field size since these optimizations could assume we're doing arithmetic mod 2^64, or something. Care needs to be taken, but it's obviously relevant. All the implemented simplifications can be found here; https://llvm.org/doxygen/InstructionCombining_8cpp.html https://llvm.org/doxygen/AggressiveInstCombine_8cpp.html Reading through the list, it doesn't look like it actually does much, and the aggressive variant seems to mostly add bit and array-fold stuff that's not relevant to us. It does do basic algebraic transformations, like factoring, though. My concern may be unfounded, but I'd keep it in mind during testing. -sccp, I think this is the transformation that removes variables which are assigned directly to constants. The vast majority of analysis and transformation passes don't really make sense for Vamp-IR, as they're for things like loop manipulation, stack manipulation, value vs reference, block optimization, allocation, inlining, etc that simply doesn't apply to Vamp-IR ::: ## Internal VampIR optimization functions In this [blog post example](https://baptiste-wicht.com/posts/2012/02/local-optimization-on-three-address-code.html) are explained several optimizations for 3AC cictuits that we can implement by hand on VampIR. Already implemented `copy_propagete` and `eliminate_dead_equalities` are in `transform.rs`. Following this lines, we could implement more optimization strategies. :::info :bulb: **Quick notes on Blog post** I'm not sure about the applicability of 2. Is trading a multiplication by a constant worse for performance than the addition of two variables? I've been operating on the assumption that it isn't for arithmetic circuits, but I'm not actually sure. A lot of the optimizations pertain to assembly blocks, which aren't really relevant to Vamp-IR. It didn't give much in the way of insights into doing simplifications for our problem; though some equations in section 1 might be missing from the `infix_op` function in transform.rs. Although x = 0 / a => x = 0 and x = a / a => x = 1 are only guaranteed if we know that a != 0, so I'm not confident we should use them. ::: String diagram simplification of 3AC: + Most are obvious, but interactions between duplicators, addition, and multiplciation less so. - [Towards an algebraic theory of Boolean circuits](https://core.ac.uk/download/pdf/82687279.pdf) - [String Diagram Rewrite Theory](https://arxiv.org/pdf/2012.01847.pdf) + Prototype implementation can be found [here](https://github.com/anoma/vamp-ir/tree/extra-export). :::info **Using string diagram simplifier** Right now, the demo simply prints a constaiMid-level spec synthesisnt system to a file. Calling ``` cargo run --release generate three-address-file -s tests/fresh.pir -o tests/fresh.3ac ``` in the demo repo will compile and simplify the constraint system in `fresh.pir` and store it into `tests/fresh.3ac`. For a point of comparison, the default 3ac circuit for this file looks like this ``` Var202 = Var191 - 1, Var203 = Var191 * Var202, Var205 = Var193 - 1, Var206 = Var193 * Var205, Var208 = Var195 - 1, Var209 = Var195 * Var208, Var211 = Var197 - 1, Var212 = Var197 * Var211, Var214 = Var199 - 1, Var215 = Var199 * Var214, Var216 = 2 * Var193, Var217 = Var191 + Var216, Var218 = 4 * Var195, Var219 = Var217 + Var218, Var220 = 8 * Var197, Var221 = Var219 + Var220, Var222 = 16 * Var199, Var223 = Var221 + Var222, Var203 = 0, Var206 = 0, Var209 = 0, Var212 = 0, Var215 = 0, Varx = Var223, Var191 = 0, Var193 = 1, ``` after simplification, this becomes ``` Var3 = Varx + Var4, Var5 = 4 * Var2, Var6 = 0, Var7 = 16 * Var1, Var8 = 0, Var2 = 1 + Var10, Var1 = 1 + Var11, Var0 = 1 + Var12, Var4 = Var7 + Var5, Var13 = 8 * Var0, Var6 = Var2 * Var10, Var8 = Var1 * Var11, Var14 = Var0 * Var12, Var13 = 2 + Var3, Var14 = 0, ``` Note that the demo only preserves public variables (x/Varx, in this case), so private variables may be disposed of or replaced. ::: ## Inspirations from other fields + ML optimization for silicon circuits optimization - [Nvidia reinforcement learning approach](https://developer.nvidia.com/blog/designing-arithmetic-circuits-with-deep-reinforcement-learning/), [paper](https://arxiv.org/pdf/2205.07000.pdf) + Supercompilation + Tensor network rapresentation of arithmetic circuit - Forumla compression - [From mulitvariate functions to Tensor Networks](https://journals.aps.org/prresearch/pdf/10.1103/PhysRevResearch.5.013156) - [Tensor networks optimization by google](https://www.tensorflow.org/guide/graph_optimization)