## Week 1
**work topics for Jan:**
* deal with montgomery to normal representation from cuda gpu code.
* get radix-2 cuda working finish, bench against current cl impl.
* write extended split radix-2/8 algorithm in cuda, bench against above and current cl.
* understand diff to evaluation_gpu from better-polyeval
* understand and document fft.cl and Xin testing
*
#### Montgomery investigation from gpu fft:
spent start of week understanding Mont repr in the context of the cuda code in polyeval PR from last year, the field impl and what is required to write the Mont backend for FieldElement.
#### benching with zkWasm:
looking at `evaluation_gpu.rs` (better_polyeval)
```rust
const MAX_LOG2_RADIX: u32 = 8;
const MAX_LOG2_LOCAL_WORK_SIZE: u32 = 7;
```
To:
```rust
let max_log2_radix = if log_n % 8 == 1 {
7 // scale the local worker size of the last round
} else {
8
};
let max_log2_local_work_size: u32 = 6;
```
`_eval_gpu()` modified to
```rust
if let Some(v) = self.eval_flat_scale (
pk,
program,
memory_cache,
advice,
instance,
y,
unit_cache,
allocator,
helper
)? {
Ok((Some((v, 0)), None))
} else {
match self {
ProveExpressi
...
}
```
addition of
```rust
fn eval_flat_scale<C: CurveAffine<ScalarExt = F>>( )
```
with kernel:
```rust
let kernel_name = format!("{}_eval_batch_scale", "Bn256_Fr");
```
take a look at `ec-gpu-gen` `fft.cl`
```cl
KERNEL void FIELD_eval_batch_scale(
GLOBAL FIELD* res,
GLOBAL FIELD* l,
GLOBAL int* l_rot,
uint nb_scale,
uint size,
GLOBAL FIELD* c
) {
uint gid = GET_GLOBAL_ID();
uint idx = gid;
uint lidx = (idx + size + l_rot[0]) & (size - 1);
res[idx] = FIELD_mul(l[lidx], c[0]);
for (uint i = 1; i < nb_scale; i++) {
uint lidx = (idx + size + l_rot[i]) & (size - 1);
res[idx] = FIELD_add(res[idx], FIELD_mul(l[lidx], c[i]));
}
}
```