## Week 1 **work topics for Jan:** * deal with montgomery to normal representation from cuda gpu code. * get radix-2 cuda working finish, bench against current cl impl. * write extended split radix-2/8 algorithm in cuda, bench against above and current cl. * understand diff to evaluation_gpu from better-polyeval * understand and document fft.cl and Xin testing * #### Montgomery investigation from gpu fft: spent start of week understanding Mont repr in the context of the cuda code in polyeval PR from last year, the field impl and what is required to write the Mont backend for FieldElement. #### benching with zkWasm: looking at `evaluation_gpu.rs` (better_polyeval) ```rust const MAX_LOG2_RADIX: u32 = 8; const MAX_LOG2_LOCAL_WORK_SIZE: u32 = 7; ``` To: ```rust let max_log2_radix = if log_n % 8 == 1 { 7 // scale the local worker size of the last round } else { 8 }; let max_log2_local_work_size: u32 = 6; ``` `_eval_gpu()` modified to ```rust if let Some(v) = self.eval_flat_scale ( pk, program, memory_cache, advice, instance, y, unit_cache, allocator, helper )? { Ok((Some((v, 0)), None)) } else { match self { ProveExpressi ... } ``` addition of ```rust fn eval_flat_scale<C: CurveAffine<ScalarExt = F>>( ) ``` with kernel: ```rust let kernel_name = format!("{}_eval_batch_scale", "Bn256_Fr"); ``` take a look at `ec-gpu-gen` `fft.cl` ```cl KERNEL void FIELD_eval_batch_scale( GLOBAL FIELD* res, GLOBAL FIELD* l, GLOBAL int* l_rot, uint nb_scale, uint size, GLOBAL FIELD* c ) { uint gid = GET_GLOBAL_ID(); uint idx = gid; uint lidx = (idx + size + l_rot[0]) & (size - 1); res[idx] = FIELD_mul(l[lidx], c[0]); for (uint i = 1; i < nb_scale; i++) { uint lidx = (idx + size + l_rot[i]) & (size - 1); res[idx] = FIELD_add(res[idx], FIELD_mul(l[lidx], c[i])); } } ```