goff
: fast modular arithmetic in Gogoff
(go finite field): a tool that generates pure Go source code for fast modular arithmetic operations for any modulus.goff
at a glancegoff
produces pure Go code (no assembly[1]) that is competitive with state-of-the-art libraries written in C++, Rust, or assembly.goff
. goff
is available on GitHub under the Apache 2.0 License.Edit (Sept 2020): you will find up to date benchmarks here.
goff
has not been independently audited and is provided "as-is". Use goff
at your own risk. goff
offers no security guarantees such as constant-time implementation or side-channel attack resistance.
goff
: a short exampleThe command
goff -o ./bn256/ -p bn256 -e Element -m 21888242871839275222246405745257275088696311157297823662689037894645226208583
outputs three Go source files in ./bn256/
:
element.go
element_test.go
arith.go
These files define a Go type (Element
in this example) that behaves similarly to big.Int
from the Go standard library.
For example, modular multiplication has the following signature:
// Mul z = x * y mod q
func (z *Element) Mul(x, y *Element) *Element
It can be used like so:
var a, b Element
a.SetUint64(2)
b.SetString("984896738")
a.Mul(a, b) // alternatively: a.MulAssign(b)
a.Sub(a, a)
.Add(a, b)
.Inv(a)
b.Exp(b, 42)
b.Neg(b)
goff
.math/big
?The Go standard library provides a math/big
package capable of modular arithmetic. Why didn't we use it?
Our journey started with math/big
but we quickly noticed that math/big
is 10 times slower than optimized libraries written in Rust or C++ for several reasons:
big.Int
is a general-purpose library. It is not targeted for modular arithmetic, so core operations such as integer multiplication do not take advantage of fast algorithms for this task.big.Int
is meant to handle integers of varying size. Integers are stored as slices, and these slices are constantly expanding and shrinking in memory.big.Int
memory usage patterns induce significant overhead from Go's garbage collector.It is interesting to see just how much overhead is incurred from garbage collection. Consider the following program[1], which allocates a slice of 100M big.Int
:
func main() {
a := make([]big.Int, 1e8)
for i := 0; i < 10; i++ {
start := time.Now()
runtime.GC()
fmt.Printf("GC took %s\n", time.Since(start))
}
runtime.KeepAlive(a)
}
This program tells us that garbage collection takes 190ms each time it’s invoked, just to iterate through the slice. We can reduce this to 100 microseconds by replacing big.Int
with [4]uint64
, as arrays (not slices) are ignored by the garbage collector:
a := make([][4]uint64, 1e8)
In many cryptographic applications (including our application to zero-knowledge proofs) the modulus is fixed and known in advance. This knowledge makes it practical to store multi-precision integers in fixed-size arrays, which allows us to avoid costly memory allocations.
Of course, the use of arrays instead of big.Int
means that we must implement our own arithmetic. But we need to do this anyway in order to take advantage of fast algorithms for modular arithmetic.
We could not have achieved state-of-the-art prformance any other way.
goff
?In this article we focus on modular multiplication, which is the most performance-critical operation in many cryptographic applications.
goff
automatically generates rudimentary benchmark code in element_test.go
. A deep dive into benchmarks can be found in our companion article Faster big-integer modular multiplication for most moduli. Those benchmarks show that goff
outperforms other optimized libraries written in Rust, C++, or assembly.
Most other goff
operations (addition, subtraction, etc) are also competitive with state-of-the-art. (The modular inverse operation is an exception–-we didn't bother to optimize it because it's rarely invoked in our setting.)
For a 6-word modulus, we shrinked execution time of the modular multiplication from roughly 500ns
to 50ns
, a 10x improvement. How did we do it?[1]
500ns
160ns
As discussed above, avoiding memory allocations using fixed size arrays instead of big.Int
yields a significant performance improvment.
math/bits
over assemblymath/bits
provides blazingly fast uint64
addition, subtraction and multiplication. The Go compiler can recognize some function signatures (like the ones in math/bits
) and replace them with the best platform-specific instructions.
Thanks to these Go compiler intrinsics, we quickly eliminated the need to write assembly code. There are several advantages to avoiding the use of assembly code. For example:
math/bits
functions.Go does not allow developers to force the compiler to inline a function, which can hurt performance in some cases. However, since goff
generates Go source code, we are able to bypass this restriction and force inlining simply by instructing goff
to replicate certain code snippets thoughout the codebase.
for
loops 160ns
90ns
The Go compiler implements every for
loop with JMP
and CMP
instructions, even if the number of iterations is known at compile-time. (Example: for i:=0; i<6; i++
.) To combat this, goff
code generation uses templates to unroll loops:
{{- range $i := .NbWordsIndexesFull}}
z[{{$i}}] = x[{{$i}}]
{{- end}}
Loop flattening may not always be the best option, and can become impractical as the code size grows due to caching issues. But we used it to considerable advantage in goff
.
90ns
65ns
In the algorithm we implemented for modular multiplication, most of the work is done by a function that computes:
Our initial implementations looked like this:
func madd2(a, b, c, d uint64) (hi uint64, lo uint64) {
a, b = bits.Mul64(a, b)
lo, c = bits.Add64(c, d, 0)
lo, hi = bits.Add64(b, lo, 0)
hi, _ = bits.Add64(a, c, hi)
return
}
The above code runs in 1.6ns
. You might think here there isn't much room for improvement. Surprisingly, adding an instruction made it run in 1.1ns
:
// madd2 hi, lo = a*b + c + d
func madd2(a, b, c, d uint64) (hi uint64, lo uint64) {
var carry uint64
hi, lo = bits.Mul64(a, b)
c, carry = bits.Add64(c, d, 0)
hi, _ = bits.Add64(hi, 0, carry)
lo, carry = bits.Add64(lo, c, 0)
hi, _ = bits.Add64(hi, 0, carry)
return
}
The assembly output of the first version showed SBBQ
and NEGQ
instructions that we didn't expect. Turns out CPUs have a special register to store the carry from the ADD
, and "saving it for later" is counter-productive.
65ns
50ns
Hunting these last nanoseconds came from an algorithmic improvment described in our companion article, and from trimming the edges.
Typically, the first iteration of a for
loop may have some variables always set to 0, or the final iteration may write directly to the result, avoiding a few extra copies.
Some tricks are obvious, like:
// IsZero returns z == 0
func (z *Element) IsZero() bool {
return (z[3] | z[2] | z[1] | z[0]) == 0
}
instead of
// IsZero returns z == 0
func (z *Element) IsZero() bool {
return z[3] == 0 && z[2] == 0 && z[1] == 0 && z[0] == 0
}
And some are more obscure. Like scoping variables (it's a hint to the compiler that it doesn't need to MOV
an un-needed value) or cache misses.
There is a lot to say about CPU caches–-here is good read, and not many (any?) tools in Go to help developers understand what's happening.
A widely known rule of thumb that helped us make goff
faster: consider memory access patterns. CPUs fetch an entire line in the cache at a time–-consider using what you access immediatly, and not scattered accross a function.
goff
with a 12-word (768-bit) modulus, declaring (1)
var c [3]uint64
or (2)
var c0, c1, c2 uint64
will result in an execution time of the modular multiplication of ~800ns
(1) or ~220ns
(2) !
50ns
33ns
(More on that here.
Our main focus is gnark
, a fast library for zero-knowledge proofs. Work on gnark
may drive more effort into goff
in the coming months. Next steps for goff
include:
goff
was written by Gautam Botrel, Gus Gutoski, and Thomas Piellard.We’re a research team at ConsenSys. If you are interested in our work (fast finite field arithmetic, elliptic curve pairings, zero-knowledge proofs), give us a shout.