Implement the following algorithm (loosely based on the Sumcheck protocol) using CUDA.
The following implementation of the BabyBear field may be used for finite field arithmetic: https://github.com/supranational/sppark/tree/main/ff
You are encouraged to create an implementation that demonstrates good coding style, engineering practices and efficient scheduling and communication between CPU and GPU. Execution speed of individual kernels is less important.
### Part 1
Let $\mathbb{F}$ be the BabyBear finite field. Let $N$ and $n$ be integers.
You are given $N$ $n$-variate multi-linear polynomials $\{ f_i \}_{i=0}^{N-1}$, for $f_i : \mathbb{F}^n \rightarrow \mathbb{F}$, in the form of evaluations on the boolean hypercube $B^n$ (where $B = \{0,1\}$).
Evaluate the univariate polynomials:
$$g_i(x) = \sum_{(x2, \ldots, x_n) \in B^{n-1}} f_i(x, x_2, \ldots, x_n)$$
at $x=0, 1$.
Write the outputs of each polynomial to the terminal.
### Part 2
Given *evaluation points* $\{ r_i \}_{i=0}^{N-1} \in \mathbb{F}^N$, compute a new set of $n-1$-variate multi-linear polynomials $\{ f' \}_{i=0}^{N-1}$, where
$$f_i'(x_2, \ldots, x_n) = f_i(r_i, x_2, \ldots, x_n)$$
These polynomials should again be expressed as evaluations over the boolean hypercube $B^{n-1}$.
### Part 3
Assume that the *evaluation point* $r_i$ are to be computed as
$$r_i = g_i(0)^{-1} \oplus g_i(1)^{-1}$$
$\oplus$ represents XOR over the 32-bit canonical representation of elements in $\mathbb{F}$.
Given $N$ polynomials in the form of evaluations over the boolean hypercube (as above), use the above expression for $r_i$ to execute parts 1 and 2 iteratively to reduce each polynomial to a single value.