--- title: goff - fast modular arithmetic in Go description: goff (go finite field) is a tool that generates pure Go source code for fast modular arithmetic operations for any modulus. Developed at ConsenSys by Gautam Botrel, Gus Gutoski and Thomas Piellard. image: https://i.imgur.com/iAj7POt.png --- # Introducing `goff`: fast modular arithmetic in Go <img style="display: block;margin: auto;" width="60%" src="https://i.imgur.com/UUppBSP.png"> #### We are proud to introduce `goff` (go finite field): a tool that generates pure Go source code for fast modular arithmetic operations for any modulus. ### `goff` at a glance * `goff` produces pure Go code (no assembly^[until [this PR](https://github.com/ConsenSys/goff/pull/13)]) that is competitive with state-of-the-art libraries written in C++, Rust, or assembly. * Modular multiplication is especially fast in `goff`. :fire: We discovered and implemented an algorithmic improvement that yields a 10%--15% speed gain over state-of-the-art algorithms. See our article [Faster big-integer modular multiplication for most moduli](https://hackmd.io/@zkteam/modular_multiplication) for details and benchmarks. * The source code for `goff` is [available on GitHub](https://github.com/consensys/goff) under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html). :::info **Edit (Sept 2020)**: [you will find up to date benchmarks here](https://github.com/ConsenSys/gurvy#benchmarks). ::: :::warning :warning: **Warning** :warning: `goff` has not been independently audited and is provided "as-is". Use `goff` at your own risk. `goff` offers no security guarantees such as constant-time implementation or side-channel attack resistance. ::: ## `goff`: a short example The command ```bash goff -o ./bn256/ -p bn256 -e Element -m 21888242871839275222246405745257275088696311157297823662689037894645226208583 ``` outputs three Go source files in `./bn256/`: * `element.go` * `element_test.go` * `arith.go` These files define a Go type (`Element` in this example) that behaves similarly to `big.Int` from the Go standard library. For example, modular multiplication has the following signature: ```go // Mul z = x * y mod q func (z *Element) Mul(x, y *Element) *Element ``` It can be used like so: ```go var a, b Element a.SetUint64(2) b.SetString("984896738") a.Mul(a, b) // alternatively: a.MulAssign(b) a.Sub(a, a) .Add(a, b) .Inv(a) b.Exp(b, 42) b.Neg(b) ``` :arrow_right: Visit our [GitHub repository](https://github.com/consensys/goff) for details on how to install and use `goff`. ## Why not use `math/big`? The Go standard library provides a `math/big` package capable of modular arithmetic. Why didn't we use it? Our journey started with `math/big` but we quickly noticed that `math/big` is 10 times slower than optimized libraries written in Rust or C++ for several reasons: * `big.Int` is a general-purpose library. It is not targeted for modular arithmetic, so core operations such as integer multiplication do not take advantage of fast algorithms for this task. * `big.Int` is meant to handle integers of varying size. Integers are stored as slices, and these slices are constantly expanding and shrinking in memory. * `big.Int` memory usage patterns induce significant overhead from Go's garbage collector. ### The cost of garbage collection It is interesting to see just how much overhead is incurred from garbage collection. Consider the following program^[adapted from [Avoiding high GC overhead with large heaps](https://blog.gopheracademy.com/advent-2018/avoid-gc-overhead-large-heaps/)], which allocates a slice of 100M `big.Int`: ```go= func main() { a := make([]big.Int, 1e8) for i := 0; i < 10; i++ { start := time.Now() runtime.GC() fmt.Printf("GC took %s\n", time.Since(start)) } runtime.KeepAlive(a) } ``` This program tells us that garbage collection takes 190ms each time it’s invoked, just to iterate through the slice. We can reduce this to 100 microseconds by replacing `big.Int` with `[4]uint64`, as arrays (not slices) are ignored by the garbage collector: ```go a := make([][4]uint64, 1e8) ``` ### The inescapable conclusion: build it ourselves In many cryptographic applications (including [our application](https://hackmd.io/@zkteam/gnark) to zero-knowledge proofs) the modulus is fixed and known in advance. This knowledge makes it practical to store multi-precision integers in fixed-size arrays, which allows us to avoid costly memory allocations. Of course, the use of arrays instead of `big.Int` means that we must implement our own arithmetic. But we need to do this anyway in order to take advantage of fast algorithms for modular arithmetic. We could not have achieved state-of-the-art prformance any other way. ## How fast is `goff`? In this article we focus on modular multiplication, which is the most performance-critical operation in many cryptographic applications. `goff` automatically generates rudimentary benchmark code in `element_test.go`. A deep dive into benchmarks can be found in our companion article [Faster big-integer modular multiplication for most moduli](https://hackmd.io/@zkteam/modular_multiplication#Benchmarks). Those benchmarks show that `goff` outperforms other optimized libraries written in Rust, C++, or assembly. Most other `goff` operations (addition, subtraction, etc) are also competitive with state-of-the-art. (The modular inverse operation is an exception---we didn't bother to optimize it because it's rarely invoked in our setting.) ### High performance Go: case study in modular multiplication For a 6-word modulus, we shrinked execution time of the modular multiplication from roughly `500ns` to `50ns`, a 10x improvement. How did we do it?^[[Dave Cheney's workshop materials](https://dave.cheney.net/high-performance-go-workshop/gophercon-2019.html) is a great resource for high-performance Go.] #### Avoid memory allocations, garbage collection ==`500ns` $\to$ `160ns`== As discussed above, avoiding memory allocations using fixed size arrays instead of `big.Int` yields a significant performance improvment. #### Prefer `math/bits` over assembly [`math/bits`](https://golang.org/pkg/math/bits/) provides blazingly fast `uint64` addition, subtraction and multiplication. The Go compiler can recognize some function signatures (like the ones in `math/bits`) and replace them with the best platform-specific instructions. Thanks to these [Go compiler intrinsics](https://dave.cheney.net/2019/08/20/go-compiler-intrinsics), we quickly eliminated the need to write assembly code. There are several advantages to avoiding the use of assembly code. For example: * Assembly is not easy to maintain, nor to test across platforms. * The Go compiler won’t inline assembly code, but will inline `math/bits` functions. Go does not allow developers to force the compiler to inline a function, which can hurt performance in some cases. However, since `goff` generates Go source code, we are able to bypass this restriction and force inlining simply by instructing `goff` to replicate certain code snippets thoughout the codebase. #### Flatten `for` loops ==`160ns` $\to$ `90ns`== The Go compiler implements every `for` loop with `JMP` and `CMP` instructions, even if the number of iterations is known at compile-time. (Example: `for i:=0; i<6; i++`.) To combat this, `goff` code generation uses templates to unroll loops: ```go {{- range $i := .NbWordsIndexesFull}} z[{{$i}}] = x[{{$i}}] {{- end}} ``` Loop flattening may not always be the best option, and can become impractical as the code size grows due to caching issues. But we used it to considerable advantage in `goff`. #### Dissassemble and optimize low-level arithmetics ==`90ns` $\to$ `65ns`== In the algorithm we implemented for modular multiplication, most of the work is done by a function that computes: $$(hi, lo) = a*b + c + d$$ Our initial implementations looked like this: ```go func madd2(a, b, c, d uint64) (hi uint64, lo uint64) { a, b = bits.Mul64(a, b) lo, c = bits.Add64(c, d, 0) lo, hi = bits.Add64(b, lo, 0) hi, _ = bits.Add64(a, c, hi) return } ``` The above code runs in `1.6ns`. You might think here there isn't much room for improvement. Surprisingly, *adding* an instruction made it run in `1.1ns`: ```go // madd2 hi, lo = a*b + c + d func madd2(a, b, c, d uint64) (hi uint64, lo uint64) { var carry uint64 hi, lo = bits.Mul64(a, b) c, carry = bits.Add64(c, d, 0) hi, _ = bits.Add64(hi, 0, carry) lo, carry = bits.Add64(lo, c, 0) hi, _ = bits.Add64(hi, 0, carry) return } ``` The assembly output of the first version showed `SBBQ` and `NEGQ` instructions that we didn't expect. Turns out CPUs have a [special register to store the carry](https://en.wikipedia.org/wiki/Carry_flag) from the `ADD`, and "saving it for later" is counter-productive. #### Do less ==`65ns` $\to$ `50ns`== Hunting these last nanoseconds came from an algorithmic improvment described in [our companion article](https://hackmd.io/@zkteam/modular_multiplication#Implementation), and from trimming the edges. Typically, the first iteration of a `for` loop may have some variables always set to 0, or the final iteration may write directly to the result, avoiding a few extra copies. Some tricks are obvious, like: ```go // IsZero returns z == 0 func (z *Element) IsZero() bool { return (z[3] | z[2] | z[1] | z[0]) == 0 } ``` instead of ```go // IsZero returns z == 0 func (z *Element) IsZero() bool { return z[3] == 0 && z[2] == 0 && z[1] == 0 && z[0] == 0 } ``` And some are more obscure. Like scoping variables (it's a hint to the compiler that it doesn't need to `MOV` an un-needed value) or cache misses. #### Cache friendliness There is a lot to say about CPU caches---[here is good read](https://lwn.net/Articles/252125/), and not many (any?) tools in Go to help developers understand what's happening. A widely known rule of thumb that helped us make `goff` faster: consider memory access patterns. CPUs fetch an entire line in the cache at a time---consider using what you access immediatly, and not scattered accross a function. #### A strange example: 3 variables or a length-3 array? `goff` with a 12-word (768-bit) modulus, declaring (1) ```go var c [3]uint64 ``` or (2) ```go var c0, c1, c2 uint64 ``` will result in an execution time of the modular multiplication of `~800ns`(1) or `~220ns`(2) !:exploding_head: Yet, for moduli smaller than 11 words, (1) is faster than (2). We would love to know why. :slightly_smiling_face: #### Generating x64 assembly ==`50ns` $\to$ `33ns`== (:exploding_head:) More on that [here](https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf). ## What's next? Our main focus is [`gnark`](https://hackmd.io/@zkteam/gnark), a fast library for zero-knowledge proofs. Work on `gnark` may drive more effort into [`goff`](https://github.com/consensys/goff) in the coming months. Next steps for `goff` include: - templates (code generation) clean up - extensive testing - constant time version of some APIs - further optimizations for specific modulus sizes - security audit :+1: [Issues](https://github.com/consensys/goff/issues), new features suggestions and PRs are welcome. --- <img style="float:left;width:90px;margin-right:20px;" src="https://i.imgur.com/tYQ8i9r.png"> *`goff` was written by [Gautam Botrel](https://www.linkedin.com/in/gautam-botrel/), [Gus Gutoski](https://www.linkedin.com/in/ggutoski/), and [Thomas Piellard](https://www.linkedin.com/in/thomas-piellard-53b881129/).* *We’re a research team at [ConsenSys](https://consensys.net/). If you are interested in our work (fast finite field arithmetic, elliptic curve pairings, zero-knowledge proofs), [give us a shout](mailto:zkteam@consensys.net).*