--- title: goff - fast modular arithmetic in Go description: goff (go finite field) is a tool that generates pure Go source code for fast modular arithmetic operations for any modulus. Developed at ConsenSys by Gautam Botrel, Gus Gutoski and Thomas Piellard. image: https://i.imgur.com/iAj7POt.png --- # Introducing goff: fast modular arithmetic in Go <img style="display: block;margin: auto;" width="60%" src="https://i.imgur.com/UUppBSP.png"> #### We are proud to introduce goff (go finite field): a tool that generates pure Go source code for fast modular arithmetic operations for any modulus. ### goff at a glance * goff produces pure Go code (no assembly^[until [this PR](https://github.com/ConsenSys/goff/pull/13)]) that is competitive with state-of-the-art libraries written in C++, Rust, or assembly. * Modular multiplication is especially fast in goff. :fire: We discovered and implemented an algorithmic improvement that yields a 10%--15% speed gain over state-of-the-art algorithms. See our article [Faster big-integer modular multiplication for most moduli](https://hackmd.io/@zkteam/modular_multiplication) for details and benchmarks. * The source code for goff is [available on GitHub](https://github.com/consensys/goff) under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html). :::info **Edit (Sept 2020)**: [you will find up to date benchmarks here](https://github.com/ConsenSys/gurvy#benchmarks). ::: :::warning :warning: **Warning** :warning: goff has not been independently audited and is provided "as-is". Use goff at your own risk. goff offers no security guarantees such as constant-time implementation or side-channel attack resistance. ::: ## goff: a short example The command bash goff -o ./bn256/ -p bn256 -e Element -m 21888242871839275222246405745257275088696311157297823662689037894645226208583  outputs three Go source files in ./bn256/: * element.go * element_test.go * arith.go These files define a Go type (Element in this example) that behaves similarly to big.Int from the Go standard library. For example, modular multiplication has the following signature: go // Mul z = x * y mod q func (z *Element) Mul(x, y *Element) *Element  It can be used like so: go var a, b Element a.SetUint64(2) b.SetString("984896738") a.Mul(a, b) // alternatively: a.MulAssign(b) a.Sub(a, a) .Add(a, b) .Inv(a) b.Exp(b, 42) b.Neg(b)  :arrow_right: Visit our [GitHub repository](https://github.com/consensys/goff) for details on how to install and use goff. ## Why not use math/big? The Go standard library provides a math/big package capable of modular arithmetic. Why didn't we use it? Our journey started with math/big but we quickly noticed that math/big is 10 times slower than optimized libraries written in Rust or C++ for several reasons: * big.Int is a general-purpose library. It is not targeted for modular arithmetic, so core operations such as integer multiplication do not take advantage of fast algorithms for this task. * big.Int is meant to handle integers of varying size. Integers are stored as slices, and these slices are constantly expanding and shrinking in memory. * big.Int memory usage patterns induce significant overhead from Go's garbage collector. ### The cost of garbage collection It is interesting to see just how much overhead is incurred from garbage collection. Consider the following program^[adapted from [Avoiding high GC overhead with large heaps](https://blog.gopheracademy.com/advent-2018/avoid-gc-overhead-large-heaps/)], which allocates a slice of 100M big.Int: go= func main() { a := make([]big.Int, 1e8) for i := 0; i < 10; i++ { start := time.Now() runtime.GC() fmt.Printf("GC took %s\n", time.Since(start)) } runtime.KeepAlive(a) }  This program tells us that garbage collection takes 190ms each time it’s invoked, just to iterate through the slice. We can reduce this to 100 microseconds by replacing big.Int with [4]uint64, as arrays (not slices) are ignored by the garbage collector: go a := make([][4]uint64, 1e8)  ### The inescapable conclusion: build it ourselves In many cryptographic applications (including [our application](https://hackmd.io/@zkteam/gnark) to zero-knowledge proofs) the modulus is fixed and known in advance. This knowledge makes it practical to store multi-precision integers in fixed-size arrays, which allows us to avoid costly memory allocations. Of course, the use of arrays instead of big.Int means that we must implement our own arithmetic. But we need to do this anyway in order to take advantage of fast algorithms for modular arithmetic. We could not have achieved state-of-the-art prformance any other way. ## How fast is goff? In this article we focus on modular multiplication, which is the most performance-critical operation in many cryptographic applications. goff automatically generates rudimentary benchmark code in element_test.go. A deep dive into benchmarks can be found in our companion article [Faster big-integer modular multiplication for most moduli](https://hackmd.io/@zkteam/modular_multiplication#Benchmarks). Those benchmarks show that goff outperforms other optimized libraries written in Rust, C++, or assembly. Most other goff operations (addition, subtraction, etc) are also competitive with state-of-the-art. (The modular inverse operation is an exception---we didn't bother to optimize it because it's rarely invoked in our setting.) ### High performance Go: case study in modular multiplication For a 6-word modulus, we shrinked execution time of the modular multiplication from roughly 500ns to 50ns, a 10x improvement. How did we do it?^[[Dave Cheney's workshop materials](https://dave.cheney.net/high-performance-go-workshop/gophercon-2019.html) is a great resource for high-performance Go.] #### Avoid memory allocations, garbage collection ==500ns $\to$ 160ns== As discussed above, avoiding memory allocations using fixed size arrays instead of big.Int yields a significant performance improvment. #### Prefer math/bits over assembly [math/bits](https://golang.org/pkg/math/bits/) provides blazingly fast uint64 addition, subtraction and multiplication. The Go compiler can recognize some function signatures (like the ones in math/bits) and replace them with the best platform-specific instructions. Thanks to these [Go compiler intrinsics](https://dave.cheney.net/2019/08/20/go-compiler-intrinsics), we quickly eliminated the need to write assembly code. There are several advantages to avoiding the use of assembly code. For example: * Assembly is not easy to maintain, nor to test across platforms. * The Go compiler won’t inline assembly code, but will inline math/bits functions. Go does not allow developers to force the compiler to inline a function, which can hurt performance in some cases. However, since goff generates Go source code, we are able to bypass this restriction and force inlining simply by instructing goff to replicate certain code snippets thoughout the codebase. #### Flatten for loops ==160ns $\to$ 90ns== The Go compiler implements every for loop with JMP and CMP instructions, even if the number of iterations is known at compile-time. (Example: for i:=0; i<6; i++.) To combat this, goff code generation uses templates to unroll loops: go {{- range $i := .NbWordsIndexesFull}} z[{{$i}}] = x[{{$i}}] {{- end}}  Loop flattening may not always be the best option, and can become impractical as the code size grows due to caching issues. But we used it to considerable advantage in goff. #### Dissassemble and optimize low-level arithmetics ==90ns$\to$65ns== In the algorithm we implemented for modular multiplication, most of the work is done by a function that computes: $$(hi, lo) = a*b + c + d$$ Our initial implementations looked like this: go func madd2(a, b, c, d uint64) (hi uint64, lo uint64) { a, b = bits.Mul64(a, b) lo, c = bits.Add64(c, d, 0) lo, hi = bits.Add64(b, lo, 0) hi, _ = bits.Add64(a, c, hi) return }  The above code runs in 1.6ns. You might think here there isn't much room for improvement. Surprisingly, *adding* an instruction made it run in 1.1ns: go // madd2 hi, lo = a*b + c + d func madd2(a, b, c, d uint64) (hi uint64, lo uint64) { var carry uint64 hi, lo = bits.Mul64(a, b) c, carry = bits.Add64(c, d, 0) hi, _ = bits.Add64(hi, 0, carry) lo, carry = bits.Add64(lo, c, 0) hi, _ = bits.Add64(hi, 0, carry) return }  The assembly output of the first version showed SBBQ and NEGQ instructions that we didn't expect. Turns out CPUs have a [special register to store the carry](https://en.wikipedia.org/wiki/Carry_flag) from the ADD, and "saving it for later" is counter-productive. #### Do less ==65ns$\to$50ns== Hunting these last nanoseconds came from an algorithmic improvment described in [our companion article](https://hackmd.io/@zkteam/modular_multiplication#Implementation), and from trimming the edges. Typically, the first iteration of a for loop may have some variables always set to 0, or the final iteration may write directly to the result, avoiding a few extra copies. Some tricks are obvious, like: go // IsZero returns z == 0 func (z *Element) IsZero() bool { return (z[3] | z[2] | z[1] | z[0]) == 0 }  instead of go // IsZero returns z == 0 func (z *Element) IsZero() bool { return z[3] == 0 && z[2] == 0 && z[1] == 0 && z[0] == 0 }  And some are more obscure. Like scoping variables (it's a hint to the compiler that it doesn't need to MOV an un-needed value) or cache misses. #### Cache friendliness There is a lot to say about CPU caches---[here is good read](https://lwn.net/Articles/252125/), and not many (any?) tools in Go to help developers understand what's happening. A widely known rule of thumb that helped us make goff faster: consider memory access patterns. CPUs fetch an entire line in the cache at a time---consider using what you access immediatly, and not scattered accross a function. #### A strange example: 3 variables or a length-3 array? goff with a 12-word (768-bit) modulus, declaring (1) go var c [3]uint64  or (2) go var c0, c1, c2 uint64  will result in an execution time of the modular multiplication of ~800ns(1) or ~220ns(2) !:exploding_head: Yet, for moduli smaller than 11 words, (1) is faster than (2). We would love to know why. :slightly_smiling_face: #### Generating x64 assembly ==50ns$\to\$ 33ns== (:exploding_head:) More on that [here](https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf). ## What's next? Our main focus is [gnark](https://hackmd.io/@zkteam/gnark), a fast library for zero-knowledge proofs. Work on gnark may drive more effort into [goff](https://github.com/consensys/goff) in the coming months. Next steps for goff include: - templates (code generation) clean up - extensive testing - constant time version of some APIs - further optimizations for specific modulus sizes - security audit :+1: [Issues](https://github.com/consensys/goff/issues), new features suggestions and PRs are welcome. --- <img style="float:left;width:90px;margin-right:20px;" src="https://i.imgur.com/tYQ8i9r.png"> *goff was written by [Gautam Botrel](https://www.linkedin.com/in/gautam-botrel/), [Gus Gutoski](https://www.linkedin.com/in/ggutoski/), and [Thomas Piellard](https://www.linkedin.com/in/thomas-piellard-53b881129/).* *We’re a research team at [ConsenSys](https://consensys.net/). If you are interested in our work (fast finite field arithmetic, elliptic curve pairings, zero-knowledge proofs), [give us a shout](mailto:zkteam@consensys.net).*