# Week 42 / Notes / Report Working on understanding the impl for `ProveExpression` in `evaluation_gpu.rs` particularly `_eval_gpu()` within it. Working on understanding `Evaluator` and particularly `evaluate_h()` within it, in `evaluation.rs`. This is stillvery much a WIP to fully understand and document the evaluation using GPU kernel, modify and improve the running time for poly_h. ### Evaluator Inside `evaluation.rs`: L275 ```Rust! impl<C: CurveAffine> Evaluator<C> { new(); add_rotation(); add_constant(); add_expression(); evaluate_h(); // non cuda feautre evaluate_h(); // cuda gpu feautre both evaluate_h() return -> Polynomial<C::ScalarExt, ExtendedLagrangeCoeff> } ``` where i left off from last week. ### evaluate_h() Investigate: Inside `evaluation.rs` cuda feature for `evaluate_h()` L801 create new BTreemap "units" which is `BTreeMap<usize, ProveExpression>` and "unit_stat" which is `BTreeMap<usize, usize>` for `gpu_gates_expr ` prepetch units and unit_stat get values (type Polynomial) run permutations. run lookups. ### ProveExpression & reconstruct Inside `evaluation_gpu.rs` ```Rust! impl<F: FieldExt> ProveExpression<F> { pub(crate) fn new() -> Self pub(crate) fn from_expr(e: &Expression<F>) -> Self pub(crate) fn add_gate(self, e: &Expression<F>) -> Self fn reconstruct_coeff(coeff: BTreeMap<u32, F>) -> Self fn reconstruct_units(mut us: BTreeMap<ProveExpressionUnit, u32>) -> Self fn reconstruct_units_coeff( us: BTreeMap<ProveExpressionUnit, u32>, coeff: BTreeMap<u32, F>, ) -> Self fn reconstruct_tree( mut tree: Vec<(BTreeMap<ProveExpressionUnit, u32>, BTreeMap<u32, F>)>, r_deep_limit: u32, ) -> Self pub(crate) fn reconstruct( tree: &[(Vec<ProveExpressionUnit>, BTreeMap<u32, F>)] ) -> Self pub(crate) fn disjoint( es: &Vec<ProveExpressionUnit>, src: &Vec<ProveExpressionUnit> ) -> bool pub(crate) fn disjoint_group( gs: &Vec<(Vec<ProveExpressionUnit>, BTreeMap<u32, F>)>, src: &Vec<ProveExpressionUnit> ) -> bool pub(crate) fn mk_group( tree: &Vec<(Vec<ProveExpressionUnit>, BTreeMap<u32, F>)> ) -> Vec<Vec<(Vec<ProveExpressionUnit>, BTreeMap<u32, F>)>> pub(crate) fn get_complexity(&self) -> ComplexityProfiler pub(crate) fn get_r_deep(&self) -> u32 fn ys_add_assign(l: &mut BTreeMap<u32, F>, r: BTreeMap<u32, F>) fn ys_mul(l: &BTreeMap<u32, F>, r: &BTreeMap<u32, F>) -> BTreeMap<u32, F> pub(crate) fn flatten(self) -> BTreeMap<Vec<ProveExpressionUnit>, BTreeMap<u32, F>> } ``` L1419: ```Rust! pub(crate) fn reconstruct(tree: &[(Vec<ProveExpressionUnit>, BTreeMap<u32, F>)]) -> Self { let tree = tree .into_iter() .map(|(us, v)| { let mut map = BTreeMap::new(); for u in us { if let Some(c) = map.get_mut(u) { *c = *c + 1; } else { map.insert(u.clone(), 1); } } (map, v.clone()) }) .collect(); let r_deep = std::env::var("HALO2_PROOF_GPU_EVAL_R_DEEP").unwrap_or("6".to_owned()); let r_deep = u32::from_str_radix(&r_deep, 10).expect("Invalid HALO2_PROOF_GPU_EVAL_R_DEEP"); Self::reconstruct_tree(tree, r_deep) } ``` L1437: reconstruct_tree() ## interpolation from [here](https://github.com/ZK-Garage/plonk/blob/master/plonk-book/src/chapter_1.md) The Lagrange interpolation consists of 2 steps: 1. Construct a **Lagrange basis**. This is a set of $n$ polynomials of degree $n-1$ that take the value 0 at all points of the set except one, where their value is 1. Expressed in a formula: $$ L_i(x) = \begin{cases} 0, & \text{if }\ x = {x_j}, j \in [n], j\neq i\\ 1, & \text{if }\ x = x_i \end{cases} $$ The polynomials $L_i$ can be constructed in the following way: $$ L_i(x) = \prod_{0 \leq j < n\text{, }j\neq i} \frac{x-x_j}{x_i - x_j} $$ Notice that this product has $n-1$ terms, and therefore results in a degree $n-1$ polynomial. 2. Scale and sum the polynomials of the basis. $$ f(x) = \sum_{i=0}^{n-1} y_i \cdot L_i(x) $$ The properties of the Lagrange basis now allow us to scale each polynomial to its target value -- multiplying by $y_i$ and then add up all the terms. The important observation we can extract from the Lagrange interpolation is that given a fixed set of points $x_1,\dots,x_n$ (an evaluation domain) we can represent any polynomial of degree $d<n$ by its evaluations $f(x_i)$ at $d+1$ points in the set. As it turns out, this representation is much more convenient than the usual coefficient representation as it provides a very simple and fast way of computing sums and multiplication of polynomials. However, the coefficient form is still useful for evaluating the polynomial at points outside the evaluation domain. Switching between these two forms of representation is very useful. The coefficient form is preferred when the polynomial must be evaluated at a random point (outside of the evaluation domain). The evaluation form is better suited for operations between polynomials such as addition, multiplication and exact quotients. The algorithm that allows us to efficiently switch between representations is the Fast Fourier Transform (FFT). This is an efficient algorithm for the more general discrete Fourier Transform (DFT). It has a complexity of $\mathcal{O}(n \cdot log(n))$ with $n$ being the degree of the polynomial. ### GPU evaluation --> _eval_gpu() this is a WIP looking into `LookupProveExpression` and its ascoiated prove expression blocks ```Rust! impl<F: FieldExt> LookupProveExpression<F> { // match for type: LookupProveExpression::Expression(e) LookupProveExpression::LcTheta(l, r) LookupProveExpression::LcBeta(l, r) LookupProveExpression::AddGamma(l) } ``` and looking into `ProveExpression` with its functions, particularly `_eval_gpu()` ```Rust! impl<F: FieldExt> ProveExpression<F> { pub(crate) fn gen_cache_policy(&self, unit_cache: &mut Cache<Buffer<F>>) pub(crate) fn eval_gpu<C: CurveAffine<ScalarExt = F>>(...) pub(crate) fn _eval_gpu_buffer<C: CurveAffine<ScalarExt = F>>(...) pub(crate) fn _eval_gpu<C: CurveAffine<ScalarExt = F>>(...) } ``` `_eval_gpu()` is: ```Rust! pub(crate) fn _eval_gpu<C: CurveAffine<ScalarExt = F>>(...) { // match on: ProveExpression::Op(l, r, op) ProveExpression::Y(ys) ProveExpression::Unit(u) ProveExpression::Scale(l, ys) } ``` looking at and writing documentation down for the below: op = Operation (Sum or Product) the l & r. ```Rust ProveExpression::Op(l, r, op) => {...} ``` Multiplication with a fixed order. ```Rust ProveExpression::Y(ys) => {...} ``` for the below a `ProveExpressionUnit` is made up of fixed, advice & instance coulmns. ```Rust ProveExpression::Unit(u) => {...} ``` if cached return v, if not in the cache then calculate de_extenxed_fft(), which is an ExtendedField FFT calculation. The poly that is passed in to _eval_gpu() is in coefficient form. If the polynomial is in the coefficient form we dont know the evaulation of each point. If its in the evaluation from then we dont have to run the else{ code, ext fft etc}. **still working on this...** Task: understand and study the evaluation of binary tree in the evaluaor, in terms of the running kernel on GPU. modify the kernel to take different operands (not just two) and bench performace.