## Introduction
This is my approach to solving the Polynomial Evaluation Challenge given by the ingonyama team.
Here is the [Challenge Link](https://github.com/ingonyama-zk/Challenges/blob/main/Polynomial_Evaluation_Challenge.md).
Github Link of the [Solution](https://github.com/Sahilgill24/poly-eval).
***Currently I have set the repository to public, please mention if this is a concern.***
## Polynomial Evaluation
My first approach was to rather than think of the given Horner's method maybe use FFT's which have better complexity then Horner's Method, but the Prime given $32483$ is not of the form $2^{k} + 1$ which makes it's complexity far worse and also the batch_size and domain_size were very small for it to be effective.
So I just tried to implement a horner kernel for the Challenge as follows :-
I have two Horner kernel's in the `poly_eval.cu`, The `horner_kernel_v1` and `horner_kernel`.
- `horner_kernel_v1`
The v1 is a simple 1:1 translation from cpp to CUDA with not great parallelization or benchmarks. Here each thread evaluates the polynomial for the domain separately.
```cuda
__global__ void horner_kernel_v1(
const T *__restrict__ coeffs,
const T *__restrict__ domain,
int coeffs_size,
int domain_size,
int batch_size,
T *__restrict__ evals /*OUT*/)
{
int idx_in_batch = blockIdx.x * blockDim.x + threadIdx.x;
if (idx_in_batch >= batch_size)
return;
const T *curr_coeffs = coeffs + idx_in_batch * coeffs_size;
T *curr_evals = evals + idx_in_batch * domain_size;
for (int eval_idx = 0; eval_idx < domain_size; ++eval_idx)
{
curr_evals[eval_idx] = curr_coeffs[coeffs_size - 1];
for (int coeff_idx = coeffs_size - 2; coeff_idx >= 0; --coeff_idx)
{
curr_evals[eval_idx] =
curr_evals[eval_idx] * domain[eval_idx] + curr_coeffs[coeff_idx];
}
}
}
```
RTX 2000 Ada -> $2028.831$ ms
- `horner_kernel`
This kernel is far more parallelized then the previous ones, as now we use a 2-D grid, each thread is evaluating the polynomial at a specific domain point, which makes it much more parallelized.
``` cuda
__global__ void horner_kernel(
const T *__restrict__ coeffs,
const T *__restrict__ domain,
int coeffs_size,
int domain_size,
int batch_size,
T *__restrict__ evals /*OUT*/)
{
int idx_in_domain = blockIdx.x * blockDim.x + threadIdx.x;
int poly = blockIdx.y;
if (idx_in_domain >= domain_size)
return;
const T *curr_coeffs = coeffs + poly * coeffs_size;
T *curr_evals = evals + poly * domain_size;
curr_evals[idx_in_domain] = curr_coeffs[coeffs_size - 1];
for (int coeff_idx = coeffs_size - 2; coeff_idx >= 0; --coeff_idx)
{
curr_evals[idx_in_domain] =
curr_evals[idx_in_domain] * domain[idx_in_domain] + curr_coeffs[coeff_idx];
}
}
```
without `__restrict__` RTX 2000 Ada -> $286.529$ ms
with `__restrict__` RTX 2000 Ada -> $227.083$ ms
To try the code simply run the `Makefile` on your local machine after cloning.
## Contact
Email: gursahilgill24@gmail.com
x: [Sahilgill24](https://x.com/Sahilgill24)
Discord: gills0924 (already present in the Ingonyama Discord channel)