# Partial reduction of $x \bmod (B + c)$
To implement modular arithmetic (e.g. prime field arithmetic), we want to have an efficient reduction algorithm. Often the modulus is chosen to be $B+c$ where $B$ is a power of two, and $c$ is *very* small in magnitude compared to $B$. For example, Curve25519 uses $B = 2^{255}$ and $c = -19$.
In fact it's normally sufficient to have efficient "partial reduction", which leaves the result within a range that is a few times larger than $B+c$. Given an efficient partial reduction algorithm where the bound on the output can be calculated given a bound on the input, we can easily perform full reduction when needed (e.g. when we're comparing or hashing values).
You could use [Montgomery](https://en.wikipedia.org/wiki/Montgomery_reduction) or [Barrett](https://en.wikipedia.org/wiki/Barrett_reduction) reduction, but I want to describe a simpler algorithm that may be more efficient in some cases, and that does not require a special representation of numbers such as Montgomery form. I don't claim any novelty; I expect that this is just a reinvention of a folklore technique.
The algorithm I'm going to describe works when $c$ is not necessarily *very* small, but just signficantly smaller than $B$ (say, half the bit length). This is useful in situations where you can't arbitrarily choose the prime $B+c$, but still have enough freedom to make $c$ small enough. For instance, that applies to the fields used for elliptic curves suited to [Halo](https://eprint.iacr.org/2019/1021), and it could also apply in cases where $B$ and $B+c$ are related by a [Hasse bound](https://en.wikipedia.org/wiki/Hasse_bound). Note that we don't need to assume that $B$ is a power of two; the important thing is that there is an efficient way to convert numbers into base-$B$ representation.
In practice we would make specific choices of $L_{0,1,2}$ and $L'_{0,1}$ suited for a given implementation of the low-level arithmetic, and work out all the bounds exactly rather than asymptotically (as done in the example below).
## With signed limbs
Write the input $x$ as $B^2 x_2 + B x_1 + x_0$ with $|x_i| \leq L_i$. (The $B^2$ term is optional; you can instead just increase the length of $x_1$ slightly.)
$x = [(B+c)(B-c) + c^2]x_2 + [(B + c) - c]x_1 + x_0$
$\hphantom{x} \equiv c^2 x_2 - c x_1 + x_0 \pmod{B + c}$
Let $x' = c^2 x_2 - c x_1 + x_0$. Then $|x'| \leq c^2 L_2 + c L_1 + L_0$.
Now write $x'$ as $B x'_1 + x'_0$ with $|x'_i| \leq L'_i$.
$x \equiv x' \equiv -c x'_1 + x'_0 \pmod{B + c}$.
Let $x'' = -c x'_1 + x'_0$. Then $|x''| \leq c L'_1 + L'_0$.
Suppose $L_0, L_1$ are $O(B)$, $L_2$ is $O(B^{1/2})$, and $c$ is $O(B^{1/2})$. Then $|x'|$ is $O(B^{3/2})$.
If $L'_0$ is $O(B)$, then $L'_1$ is $O(B^{1/2})$, and $|x''|$ is $O(B)$ as required.
## With unsigned limbs
If we want to use only unsigned limbs, we have to avoid the possibility that $x'$ can be negative. The simplest way to do this is to unconditionally add $k(B+c)$ to $x'$ where $k = \left\lceil \frac{c L_1}{B+c} \right\rceil$.
That is, write the input $x$ as $B^2 x_2 + B x_1 + x_0$ with $0 \leq x_i \leq L_i$. (Again, the $B^2$ term is optional.)
$x = [(B+c)(B-c) + c^2]x_2 + [(B + c) - c]x_1 + x_0$
$\hphantom{x} \equiv c^2 x_2 + k(B+c) - c x_1 + x_0 \pmod{B + c}$
Let $x' = c^2 x_2 + k(B+c) - c x_1 + x_0$. Then $0 \leq x' \leq c^2 L_2 + k(B+c) + L_0$.
Now write $x'$ as $B x'_1 + x'_0$ with $0 \leq x'_i \leq L'_i$.
We have the same issue again that $-c x'_1 + x'_0$ could be negative, so we'll add $k'(B+c)$ where $k' = \left\lceil \frac{c L'_1}{B+c} \right\rceil$ (in practice we'll usually have $k' = 1$).
$x \equiv x' \equiv k'(B+c) - c x'_1 + x'_0 \pmod{B + c}$.
Let $x'' = k'(B+c) - c x'_1 + x'_0$. Then $0 \leq x'' \leq k'(B+c) + L'_0$.
Suppose $L_0, L_1$ are $O(B)$, $L_2$ is $O(B^{1/2})$, and $c$ is $O(B^{1/2})$. Then $k$ is $O(B^{1/2})$, and so $x'$ is $O(B^{3/2})$.
If $L'_0$ is $O(B)$, then $L'_1$ is $O(B^{1/2})$, $k'$ is $O(1)$, and $x''$ is $O(B)$ as required.
## Example with exact bounds
Let's say we want to implement the field over which the [Tweedledum curve](https://github.com/daira/tweedle) from Halo is defined.
Set $B = 2^{254}$ and $c = 4707489545178046908921067385359695873$ (which is 122 bits).
If we have ``uint64`` arithmetic available, then we probably want to express field elements using 4 unsigned limbs. So for example, set $L_0 = L_1 = 2^{255}-1$ and $L_2 = 15$. Then we have $k = 9414979090356093817842134770719391746$, and $0 \leq x' \leq c^2 L_2 + k(B+c) + L_0$ which is 377 bits.
Then, we set $L'_0 = B-1$ and obtain $L'_1 = (c^2 L_2 + k(B+c) + L_0) \gg 254 = 9414979090356093817842134770719391748$. We have $k' = 1$, so $0 \leq x'' \leq 2B + c - 1$, which is only 1 bit away from being fully reduced.
Furthermore, $(2B + c - 1)^2 \gg 508 = 4$, so if we multiply two such partially reduced values and express the result as $B^2 x_2 + B x_1 + x_0$ for the next reduction, we have $0 \leq x_2 \leq 4$. So we could have chosen any $L_2 \geq 4$ (but still $O(B^{1/2})$); the reason we chose $L_2 = 15$ is so that we can add together several products as 512-bit values without immediate reduction.
The cost of the reduction was:
* expressing the input $x$ in base $B$;
* one multiplication of a 4-bit $x_2$ with a 244-bit constant $c^2$ (maybe using a lookup table);
* one multiplication of a 256-bit $x_1$ with a 122-bit constant $c$, i.e. $s = c x_1$;
* a 256-bit addition $t = c^2 x_2 + x_0$;
* a 377-bit subtraction from a constant and addition of 256-bit $t$, $x' = k(B+c) - s + t$;
* expressing $x'$ in base $B$;
* one multiplication of a 123-bit $x'_1$ with a 122-bit constant $c$, i.e. $u = c x'_1$;
* a 255-bit subtraction from a constant and 256-bit addition $x'' = (B + c) - u + x'_0$.
We could also have avoided the multiplication by $c^2$, by setting $L_2 = 0$ and $L_1 = 2^{261}-1$ for example. In that case the cost would be:
* expressing the input $x$ in base $B$;
* one multiplication of a 261-bit $x_1$ with a 122-bit constant $c$, i.e. $s = c x_1$;
* a 383-bit subtraction from a constant and addition of 256-bit $t$, $x' = k(B+c) - s + t$;
* expressing $x'$ in base $B$;
* one multiplication of a 129-bit $x'_1$ with a 122-bit constant $c$, i.e. $u = c x'_1$;
* a 255-bit subtraction from a constant and 256-bit addition $x'' = (B + c) - u + x'_0$.