Polynomial Evaluation

## Introduction This is my approach to solving the Polynomial Evaluation Challenge given by the ingonyama team. Here is the [Challenge Link](https://github.com/ingonyama-zk/Challenges/blob/main/Polynomial_Evaluation_Challenge.md). Github Link of the [Solution](https://github.com/Sahilgill24/poly-eval). ***Currently I have set the repository to public, please mention if this is a concern.*** ## Polynomial Evaluation My first approach was to rather than think of the given Horner's method maybe use FFT's which have better complexity then Horner's Method, but the Prime given $32483$ is not of the form $2^{k} + 1$ which makes it's complexity far worse and also the batch_size and domain_size were very small for it to be effective. So I just tried to implement a horner kernel for the Challenge as follows :- I have two Horner kernel's in the `poly_eval.cu`, The `horner_kernel_v1` and `horner_kernel`. - `horner_kernel_v1` The v1 is a simple 1:1 translation from cpp to CUDA with not great parallelization or benchmarks. Here each thread evaluates the polynomial for the domain separately. ```cuda __global__ void horner_kernel_v1( const T *__restrict__ coeffs, const T *__restrict__ domain, int coeffs_size, int domain_size, int batch_size, T *__restrict__ evals /*OUT*/) { int idx_in_batch = blockIdx.x * blockDim.x + threadIdx.x; if (idx_in_batch >= batch_size) return; const T *curr_coeffs = coeffs + idx_in_batch * coeffs_size; T *curr_evals = evals + idx_in_batch * domain_size; for (int eval_idx = 0; eval_idx < domain_size; ++eval_idx) { curr_evals[eval_idx] = curr_coeffs[coeffs_size - 1]; for (int coeff_idx = coeffs_size - 2; coeff_idx >= 0; --coeff_idx) { curr_evals[eval_idx] = curr_evals[eval_idx] * domain[eval_idx] + curr_coeffs[coeff_idx]; } } } ``` RTX 2000 Ada -> $2028.831$ ms - `horner_kernel` This kernel is far more parallelized then the previous ones, as now we use a 2-D grid, each thread is evaluating the polynomial at a specific domain point, which makes it much more parallelized. ``` cuda __global__ void horner_kernel( const T *__restrict__ coeffs, const T *__restrict__ domain, int coeffs_size, int domain_size, int batch_size, T *__restrict__ evals /*OUT*/) { int idx_in_domain = blockIdx.x * blockDim.x + threadIdx.x; int poly = blockIdx.y; if (idx_in_domain >= domain_size) return; const T *curr_coeffs = coeffs + poly * coeffs_size; T *curr_evals = evals + poly * domain_size; curr_evals[idx_in_domain] = curr_coeffs[coeffs_size - 1]; for (int coeff_idx = coeffs_size - 2; coeff_idx >= 0; --coeff_idx) { curr_evals[idx_in_domain] = curr_evals[idx_in_domain] * domain[idx_in_domain] + curr_coeffs[coeff_idx]; } } ``` without `__restrict__` RTX 2000 Ada -> $286.529$ ms with `__restrict__` RTX 2000 Ada -> $227.083$ ms To try the code simply run the `Makefile` on your local machine after cloning. ## Contact Email: gursahilgill24@gmail.com x: [Sahilgill24](https://x.com/Sahilgill24) Discord: gills0924 (already present in the Ingonyama Discord channel)