# The Barrett-Montgomery duality and a new multi-precision modular reduction scheme with only $n^2+1$ digit multiplications *Yuval Domb* yuval@ingonyama.com ### ![x (33)](https://hackmd.io/_uploads/r1BaiuVoyx.png) Multiplication in prime fields with large unstructured primes is a fundamental building block of modern cryptography. Modular multiplication of two field elements $a, b \in \mathbb{F}_p$ requires computing their full product $ab$ and reducing it modulo $p$. The two most widely used reduction algorithms are due to [Barrett](https://en.wikipedia.org/wiki/Barrett_reduction) and [Montgomery](https://en.wikipedia.org/wiki/Montgomery_modular_multiplication), with the latter operating in a permuted coset of the field (Montgomery form). Single-precision implementations of both algorithms typically require three full multiplications: one for $ab$ and two for the reduction. With schoolbook multiplication, each full multiplier costs $n^2$ digit multiplications, resulting in a total of $2n^2$ for the reduction. Faster methods like [Karatsuba](https://en.wikipedia.org/wiki/Karatsuba_algorithm) can decrease the number of digit multiplications but introduce significant overhead in additions and complex internal dependencies, making them less suitable for CPU-based implementations. Multi-precision approaches (operand scanning), particularly with Montgomery reduction, are the most widely used today. The best known algorithm (at least to us and [ChatGPT](https://chatgpt.com/)) achieves reduction in $n^2 + n$ digit multiplications without increasing additions or introducing complex dependencies, making it highly efficient for CPU-based systems. In this note, we aim to explore modular reduction from a fresh perspective, balancing conciseness with an informal approach. Specifically, let us: - Introduce an alternative way to conceptualize modular reduction, revealing a natural duality between Barrett and Montgomery methods. - Present the $n^2 + n$ Montgomery reduction algorithm. - Derive the dual $n^2 + n$ Barrett reduction counterpart. - Explain why Montgomery reduction is inherently better suited for CPU-based implementations. - Introduce a new $n^2 + 1$ algorithm, demonstrating its applicability to both Montgomery and Barrett reductions. - Show how Barrett and Montgomery reductions can often be used interchangeably in key cryptographic primitives. As far as we are aware, much of this content is novel, providing new insights into the structure and efficiency of modular reduction techniques. ## Barrett-Montgomery duality In Montgomery form, a field element is mapped to its corresponding representation in the coset $\mathbb{F}_p R$, where $R$ is a value greater than and coprime to $p$. In practice, $R$ is always chosen as a power of two, and for multi-precision implementations, it is typically set $R=2^{nw}$, where $n$ and $w$ are the number of digits and the machine's digit width in bits, respectively. Specifically, elements $a$ and $b$ are replaced by $aR$ and $bR$, so their product becomes $abR^2$. To recover the product's Montgomery form $abR$, one must multiply by $R^{-1}$ and reduce modulo $p$. However, these two steps can be efficiently combined into a single operation, known as Montgomery reduction. From this perspective, Montgomery reduction computes $cR^{-1} \mod p$, whereas Barrett reduction computes $c \mod p$, given some $c = ab$. To generalize, let us assume $c \in [0, R^2 - 1]$ which slightly exceeds $p^2$, but simplifies the explanation. The number $c$ can be presented as $$ c = c_1R+c_0 $$ where $c_1$ and $c_0$ are the Most-Significant (MS) and Least-Significant (LS) words of $c$, respectively. This leads to a Barrett reduction of the form $$ c \mod p = (c_1R \mod p) + c_0 $$ and a Montgomery reduction of the form $$ cR^{-1} \mod p = c_1 + (c_0R^{-1} \mod p) $$ Clearly, there's a dual relation between the two as the first adds the LS word to a reduction of the product of the MS word by $R$ and the second adds the MS word to a reduction of the product of the LS word by the inverse of $R$. ## Multi-precision $n^2+n$ Montgomery reduction In a multi-precision setting, let $R = r^n$, where $r = 2^w$ represents a single machine digit. The multi-precision Montgomery reduction operates iteratively on $$ c = c'_{2n-1}r^{2n-1} + \dots + c'_nr^n + c'_{n-1}r^{n-1} + \dots + c'_0 $$ At each iteration, the digits of $c$ are reduced one by one, starting with LS digit. The LS digit is reduced by executing the first partial reduction $$ \begin{align}cr^{-1} \mod p &= c'_{2n-1}r^{2n-2} + \dots + c'_nr^{n-1} + c'_{n-1}r^{n-2} + \dots + c'_1 + (c'_0r^{-1} \mod p) \\&= c''_{2n-1}r^{2n-2} + \dots + c''_nr^{n-1} + c''_{n-1}r^{n-2} + \dots + c''_1\end{align} $$ The second digit is reduced by executing the next partial reduction $$ cr^{-2} \mod p = c''_{2n-1}r^{2n-3} + \dots + c''_nr^{n-2} + c''_{n-1}r^{n-3} + \dots + (c''_1r^{-1} \mod p) $$ and so on. After $n$ rounds, $c$ is reduced to $n$ digits, as required. In each round, the free coefficient $\tilde{c} \in \{c'_0, c''_1, \dots\}$ is reduced by adding $mp$ (a multiple of $p$) to the current sum, ensuring divisibility by $r$ without residue. This step is valid because adding a multiple of $p$ does not affect the residue modulo $p$. To find $m$, let us solve the following equation, assuring divisibility by $r$ $$ \begin{align}&\tilde{c} + mp = 0 \mod r \\ &\Rightarrow m = -p^{-1} \cdot \tilde{c} \mod r = \mu_r \tilde{c} \mod r\end{align} $$ where $\mu_r \equiv -p^{-1} \mod r$ is a precomputed constant. We note that $p^{-1}$ always exists since $r$ is coprime to $p$. A complete round can be summarized by the following steps: 1. Compute $m=\mu_r \tilde{c} \mod r$ 2. Compute $mp$ and add it to the the current sum. 3. Remove the LS zeroed digit and divide by $r$. The final sum requires one last reduction step to eliminate a potential redundant copy of $p$. The total cost in digit multiplications for steps 1 and 2 is $n + 1$ per round. Over $n$ rounds, this results in a total of $n^2 + n$ digit multiplications. Below is an illustration of the first reduction round for $n=4$. <p style="text-align:center;"> <img src="https://hackmd.io/_uploads/BkJZI32qyl.png" alt="drawing" width="500"/> </p> ## Multi-precision $n^2+n$ Barrett reduction In a dual manner, Barrett reduction proceeds from the MS digit. The first digit is reduced by executing the following partial reduction $$ \begin{align}c \mod kp &= (c'_{2n-1}R \mod kp)r^{n-1} + c'_{2n-2}r^{2n-2} + \dots + c'_nr^{n-1} + c'_{n-1}r^{n-2} + \dots + c'_0 \\&= c''_{2n-2}r^{2n-2} + \dots + c''_nr^{n-1} + c''_{n-1}r^{n-2} + \dots + c''_0\end{align} $$ where the parameter $k$ is the **smallest** integer such that $kp > R$ and, as we shall see, enables restricting $m$ to a single digit (i.e. $0 \leq m < r$, in contrast to the standard Barrett implementation). The second digit is reduced by executing the next partial reduction $$ \hat{c} \mod kp = (c''_{2n-2}R \mod kp)r^{n-2} + c''_{2n-3}r^{2n-3} + \dots + c''_nr^{n-1} + c''_{n-1}r^{n-2} + \dots + c''_0 $$ and so on, for a total of $n$ rounds, until reaching $n$ digit coefficients. (The term $\hat{c}$ will be explained shortly.) Per round, the MS coefficient $\tilde{c}\in\{c'_{2n-1}, c''_{2n-1},...\}$ is reduced by subtracting $mkp$ from the current sum. The idea is to attempt to zero the resulting MS coefficient by removing a sufficient number of $p$'s. Under these constraints, the optimal $m$ can be found by solving the following optimization $$ \begin{align} \text{maximize:}\quad& m\\ \text{subject to:}\quad&m < r \\ & \tilde{c}R - mkp \geq 0 \end{align} $$ which results in the maximal single digit $m$ such that subtraction of $mkp$ from the sum is never negative. The solution to the optimization is $$ m = R (kp)^{-1} \cdot \tilde{c} \quad\text{dom}\:\: r = \mu_r \tilde{c} \quad\text{dom}\:\: r $$ where the operator $\text{dom}$, termed the *domulo* of a number, is the dual of modulo that extracts the MS $r$-digit (rather than the LS $r$-digit in modulo), and $\mu_r \equiv R (kp)^{-1} \quad\text{dom}\:\: r$ is a precomputed constant. Without providing a more formal definition for the domulo operator we'll just add that $\forall i\in\mathbb{Z},\:\:x \quad\text{dom}\:\: r \equiv r^i x \quad\text{dom}\:\: r$, providing a simple way to compute $\mu_r$ using $$ \mu_r = rR (kp)^{-1} \quad\text{dom}\:\: r = \left\lfloor\frac{rR}{kp}\right\rfloor $$ A complete round can be summarized by the following steps: 1. Compute $m=\mu_r \tilde{c} \quad\text{dom}\:\: r$ 2. Compute $mkp$ and subtract it from the the current sum. (Note that $mkp$ must be left-aligned wrt the sum.) 3. Remove the MS "zeroed" digit. The reason that "zeroed" above is in quotations is because the MS digit at the end of a round is not identically $0$ and can often result in the value $1$. This happens because the optimization is only aware of the MS digit $\tilde{c}$ and does not take lower digits into account, which is necessary for meeting the required number of total digit multiplications. As a result, the sum after $n$ rounds is of the form $$ c \mod p = \beta_{2n-1}r^{2n-1} + \dots + \beta_n r^n + \alpha $$ where $[\beta_{2n-1}, \dots, \beta_n]$ is a sequence of $n$ bits and $\alpha < R$ is the current, not yet final, reduction. The $2^n$ possible combinations $(\beta_{2n-1}r^{2n-1} + \dots + \beta_n r^n) \mod p$ can be precomputed and the relevant one $\beta$ added to $\alpha$ upon completion of all rounds. The final sum $\alpha + \beta$ then needs to undergo a final reduction step as it may contain up to $k$ redundant copies of $p$. This inaccuracy is also the reason that we used $\hat{c}$ instead of $c$ in the second equation at the beginning of this section. The total cost in digit multiplications for steps 1 and 2 is $n + 1$ per round. Over $n$ rounds, this results in a total of $n^2 + n$ digit multiplications. Below is an illustration of the first reduction round for $n=4$. <p style="text-align:center;"> <img src="https://hackmd.io/_uploads/S1Wq_o3qkg.png" alt="drawing" width="500"/> </p> ## Barrett vs. Montgomery As shown in the last two sections, Barrett and Montgomery reductions form dual multi-precision schemes. However, Montgomery reduction holds a key efficiency advantage, particularly in CPU systems where minimizing dependencies is crucial. The key distinction lies in the reduction order: Montgomery operates from the LS to the MS digit. As a result, any overflow in step 2 of a given round naturally propagates to a more significant digit that has yet to be processed. In this sense, overflows are *causal*—they occur before they become relevant to the reduction algorithm. In contrast, Barrett reduction follows the opposite order due to its dual nature. Overflows arise *non-causally*, meaning they cannot be handled once they occur. The most efficient way to address them, as discussed earlier, is through post-processing after the final round, adding extra computational overhead. Moreover, overflow handling in Barrett reduction is inherently conditional, which is less desirable for efficiency. Finally, when $k > 1$, Barrett reduction tends to produce more redundant copies of $p$. Nevertheless, the primary advantage of Barrett reduction is that it operates directly on the zeroth coset, eliminating the need for costly transformations. In some applications, working within the zeroth coset is not just beneficial but necessary. ## The new $n^2 + 1$ algorithm The new algorithm arises from a simple observation about the multi-precision representation of a product. In hindsight, the insight is remarkably trivial—yet, surprisingly, it was overlooked. Per round of the Montgomery reduction, we previously computed $\tilde{c}r^{-1} \mod p$ using $n + 1$ digit multiplications. An alternative method to achieve this, using only $n$ digit multiplications, is by computing $\tilde{c}(r^{-1} \mod p)$ instead, using the precomputed constant $\rho_1 = r^{-1} \mod p$. One way to understand this algorithm is that, in each round, the current sum is divided by $r$, and the resulting fraction is converted into an $(n+1)$-digit number $\tilde{c} \cdot \rho_1$, which is then added to the integer part of the division. Because the product $\tilde{c} \cdot \rho_1$ consists of $n + 1$ digits, this method can only be used for the first $n - 1$ rounds, and the final reduction round remains identical to the standard multi-precision reduction. This then results in a saving of $n - 1$ digit multiplications and a total of $n^2 + 1$ digit multiplications for the complete reduction. The first round of this algorithm for $n = 4$ is illustrated in the figure below. <p style="text-align:center;"> <img src="https://hackmd.io/_uploads/HJrGIa-j1e.png" alt="drawing" width="500"/> </p> By precomputing $\rho_i = r^{-i} \mod p$, multiple digits can be reduced at once. Given that the resulting product of $i$ reduced digits has $n + i$ digits, the maximum feasible reduction per step is half of the remaining digits to be reduced. This leads to $\log_2 n$ rounds, where each successive round halves the number of digits being removed concurrently. The first round in this logarithmic scheme for $n=4$ is illustrated in the figure below. <p style="text-align:center;"> <img src="https://hackmd.io/_uploads/H1wN8aZjkx.png" alt="drawing" width="500"/> </p> As before this is followed by one standard multi-precision round. While this does not reduce the total number of digit multiplications, it minimizes dependencies, as all intra-round digit multiplications can be executed independently. A fully parallelized single-step approach can be achieved by multiplying each digit by its corresponding $\rho_i$. This is described by the following expression $$ cr^{-(n-1)} \mod p = c'_{2n-1}r^n + \dots + c'_nr + c'_{n-1} + c'_{n-1}\rho_1 + \dots + c'_0\rho_{n-1} $$ and illustrated for $n=4$ in the figure below. <p style="text-align:center;"> <img src="https://hackmd.io/_uploads/rkrwUTboJe.png" alt="drawing" width="500"/> </p> As before this is followed by one standard multi-precision round. The counterpart Barrett reduction employs $\rho_i = Rr^i \mod p$, in the same—yet dual—manner. ## Transparent Barrett-Montgomery multiplications As discussed previously, Montgomery reduction is more efficient for CPU systems, while Barrett reduction is more convenient, as it eliminates the need for transformations. This raises the question: can we leverage Montgomery reduction while achieving the overall behavior of Barrett? For many important cryptographic primitives, the answer is affirmative. In this section, we present two methods and three examples demonstrating how this can be achieved. The first method involves multiplication by constants. Assume we have an input $x$ that we would like to multiply by the constant $K$. The desired output is $Kx$, yet a Montgomery multiplier would output $R^{-1}Kx$. To mend that, we can simply store the constant $RK$ in place of $K$. A Montgomery multiplier will now provide us with the desired output, indistinguishable to the external user. The first example that can utilize the multiplication by constants method is the [NTT](https://github.com/ingonyama-zk/papers/blob/main/ntt_201_book.pdf) primitive, since in NTT all multiplications are by constants. The second example that can utilize this is arithmetic hash functions such as [Poseidon2](https://eprint.iacr.org/2023/323.pdf). Many arithmetic hashes use linear mixing in the form of matrix multiplications where the matrix elements are constant. The second method is slightly more complicated and involves specifically [projective elliptic curve addition](https://hackmd.io/@cailynyongyong/HkuoMtz6o). The [complete formula](https://eprint.iacr.org/2015/1060.pdf) for such an adder produces an output triplet $(x_o, y_o, z_o)$ from two inputs $(x_1, y_1, z_1)$ and $(x_2, y_2, z_2)$. Careful examination of the formula shows that a constant factor of $R^{-1}$ in the base-field multipliers, used in the adder, results in the following weighted output $(R^{-3}x_o, R^{-3}y_o, R^{-3}z_o)$. Since the affine representation is a normalization of the $x, y$ coordinates by $z$, the output remains valid. ## Acknowledgments The duality between Barrett and Montgomery reductions has always fascinated me, though I could never quite pinpoint it precisely. Over the past month, I’ve been working on a World Foundation project focused on client-side zero-knowledge-proving. Optimizing modular multiplication for low-end mobile devices has given me the chance to explore this topic in depth, and I’m excited to share some of my findings! I’d like to extend my gratitude to Remco Bloemen from World and my teammates on the project, Xander van der Goot from Nethermind and Koh Wei Jie, for insightful discussions. A special thanks to Suyash Bagad, Karthik Inbasekar, Tomer Solberg, and Omer Shlomovits from Ingonyama for their invaluable input.