Introducing `goff`: fast modular arithmetic in Go

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

We are proud to introduce `goff` (go finite field): a tool that generates pure Go source code for fast modular arithmetic operations for any modulus.

`goff` at a glance

goff produces pure Go code (no assembly^[1]) that is competitive with state-of-the-art libraries written in C++, Rust, or assembly.
Modular multiplication is especially fast in goff.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
We discovered and implemented an algorithmic improvement that yields a 10%–15% speed gain over state-of-the-art algorithms. See our article Faster big-integer modular multiplication for most moduli for details and benchmarks.
The source code for goff is available on GitHub under the Apache 2.0 License.

Edit (Sept 2020): you will find up to date benchmarks here.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Warning

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

goff has not been independently audited and is provided "as-is". Use goff at your own risk. goff offers no security guarantees such as constant-time implementation or side-channel attack resistance.

`goff`: a short example

The command

goff -o ./bn256/ -p bn256 -e Element -m 21888242871839275222246405745257275088696311157297823662689037894645226208583

outputs three Go source files in ./bn256/:

element.go
element_test.go
arith.go

These files define a Go type (Element in this example) that behaves similarly to big.Int from the Go standard library.

For example, modular multiplication has the following signature:

// Mul z = x * y mod q
func (z *Element) Mul(x, y *Element) *Element

It can be used like so:

var a, b Element
a.SetUint64(2)
b.SetString("984896738")

a.Mul(a, b) // alternatively: a.MulAssign(b)

a.Sub(a, a)
 .Add(a, b)
 .Inv(a)
 
b.Exp(b, 42)
b.Neg(b)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Visit our GitHub repository for details on how to install and use goff.

Why not use `math/big`?

The Go standard library provides a math/big package capable of modular arithmetic. Why didn't we use it?

Our journey started with math/big but we quickly noticed that math/big is 10 times slower than optimized libraries written in Rust or C++ for several reasons:

big.Int is a general-purpose library. It is not targeted for modular arithmetic, so core operations such as integer multiplication do not take advantage of fast algorithms for this task.
big.Int is meant to handle integers of varying size. Integers are stored as slices, and these slices are constantly expanding and shrinking in memory.
big.Int memory usage patterns induce significant overhead from Go's garbage collector.

The cost of garbage collection

It is interesting to see just how much overhead is incurred from garbage collection. Consider the following program^[1], which allocates a slice of 100M big.Int:











func main() {
   a := make([]big.Int, 1e8)

   for i := 0; i < 10; i++ {
       start := time.Now()
       runtime.GC()
       fmt.Printf("GC took %s\n", time.Since(start))
   }

   runtime.KeepAlive(a)
}

This program tells us that garbage collection takes 190ms each time it’s invoked, just to iterate through the slice. We can reduce this to 100 microseconds by replacing big.Int with [4]uint64, as arrays (not slices) are ignored by the garbage collector:

a := make([][4]uint64, 1e8)

The inescapable conclusion: build it ourselves

In many cryptographic applications (including our application to zero-knowledge proofs) the modulus is fixed and known in advance. This knowledge makes it practical to store multi-precision integers in fixed-size arrays, which allows us to avoid costly memory allocations.

Of course, the use of arrays instead of big.Int means that we must implement our own arithmetic. But we need to do this anyway in order to take advantage of fast algorithms for modular arithmetic.

We could not have achieved state-of-the-art prformance any other way.

How fast is `goff`?

In this article we focus on modular multiplication, which is the most performance-critical operation in many cryptographic applications.

goff automatically generates rudimentary benchmark code in element_test.go. A deep dive into benchmarks can be found in our companion article Faster big-integer modular multiplication for most moduli. Those benchmarks show that goff outperforms other optimized libraries written in Rust, C++, or assembly.

Most other goff operations (addition, subtraction, etc) are also competitive with state-of-the-art. (The modular inverse operation is an exception–-we didn't bother to optimize it because it's rarely invoked in our setting.)

High performance Go: case study in modular multiplication

For a 6-word modulus, we shrinked execution time of the modular multiplication from roughly 500ns to 50ns, a 10x improvement. How did we do it?^[1]

Avoid memory allocations, garbage collection `500ns`
$\to$ `160ns`

As discussed above, avoiding memory allocations using fixed size arrays instead of big.Int yields a significant performance improvment.

Prefer `math/bits` over assembly

math/bits provides blazingly fast uint64 addition, subtraction and multiplication. The Go compiler can recognize some function signatures (like the ones in math/bits) and replace them with the best platform-specific instructions.

Thanks to these Go compiler intrinsics, we quickly eliminated the need to write assembly code. There are several advantages to avoiding the use of assembly code. For example:

Assembly is not easy to maintain, nor to test across platforms.
The Go compiler won’t inline assembly code, but will inline math/bits functions.

Go does not allow developers to force the compiler to inline a function, which can hurt performance in some cases. However, since goff generates Go source code, we are able to bypass this restriction and force inlining simply by instructing goff to replicate certain code snippets thoughout the codebase.

Flatten `for` loops `160ns`
$\to$ `90ns`

The Go compiler implements every for loop with JMP and CMP instructions, even if the number of iterations is known at compile-time. (Example: for i:=0; i<6; i++.) To combat this, goff code generation uses templates to unroll loops:

{{- range $i := .NbWordsIndexesFull}}
    z[{{$i}}] = x[{{$i}}]
{{- end}}

Loop flattening may not always be the best option, and can become impractical as the code size grows due to caching issues. But we used it to considerable advantage in goff.

Dissassemble and optimize low-level arithmetics `90ns`
$\to$ `65ns`

In the algorithm we implemented for modular multiplication, most of the work is done by a function that computes:

(h i, l o) = a * b + c + d

Our initial implementations looked like this:

func madd2(a, b, c, d uint64) (hi uint64, lo uint64) {
	a, b = bits.Mul64(a, b) 
	lo, c = bits.Add64(c, d, 0)
	lo, hi = bits.Add64(b, lo, 0)
	hi, _ = bits.Add64(a, c, hi)
	return
}

The above code runs in 1.6ns. You might think here there isn't much room for improvement. Surprisingly, adding an instruction made it run in 1.1ns:

// madd2 hi, lo = a*b + c + d
func madd2(a, b, c, d uint64) (hi uint64, lo uint64) {
	var carry uint64
	hi, lo = bits.Mul64(a, b)
	c, carry = bits.Add64(c, d, 0)
	hi, _ = bits.Add64(hi, 0, carry)
	lo, carry = bits.Add64(lo, c, 0)
	hi, _ = bits.Add64(hi, 0, carry)
	return
}

The assembly output of the first version showed SBBQ and NEGQ instructions that we didn't expect. Turns out CPUs have a special register to store the carry from the ADD, and "saving it for later" is counter-productive.

Do less `65ns`
$\to$ `50ns`

Hunting these last nanoseconds came from an algorithmic improvment described in our companion article, and from trimming the edges.

Typically, the first iteration of a for loop may have some variables always set to 0, or the final iteration may write directly to the result, avoiding a few extra copies.

Some tricks are obvious, like:

// IsZero returns z == 0
func (z *Element) IsZero() bool {
	return (z[3] | z[2] | z[1] | z[0]) == 0
}

instead of

// IsZero returns z == 0
func (z *Element) IsZero() bool {
	return z[3] == 0 && z[2] == 0 && z[1] == 0 && z[0] == 0
}

And some are more obscure. Like scoping variables (it's a hint to the compiler that it doesn't need to MOV an un-needed value) or cache misses.

Cache friendliness

There is a lot to say about CPU caches–-here is good read, and not many (any?) tools in Go to help developers understand what's happening.

A widely known rule of thumb that helped us make goff faster: consider memory access patterns. CPUs fetch an entire line in the cache at a time–-consider using what you access immediatly, and not scattered accross a function.

A strange example: 3 variables or a length-3 array?

goff with a 12-word (768-bit) modulus, declaring (1)

var c [3]uint64

or (2)

var c0, c1, c2 uint64

will result in an execution time of the modular multiplication of ~800ns(1) or ~220ns(2) !

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Yet, for moduli smaller than 11 words, (1) is faster than (2). We would love to know why.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Generating x64 assembly `50ns`
$\to$ `33ns` (

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

)

What's next?

Our main focus is gnark, a fast library for zero-knowledge proofs. Work on gnark may drive more effort into goff in the coming months. Next steps for goff include:

templates (code generation) clean up
extensive testing
constant time version of some APIs
further optimizations for specific modulus sizes
security audit

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Issues, new features suggestions and PRs are welcome.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

goff was written by Gautam Botrel, Gus Gutoski, and Thomas Piellard.

We’re a research team at ConsenSys. If you are interested in our work (fast finite field arithmetic, elliptic curve pairings, zero-knowledge proofs), give us a shout.

Introducing goff: fast modular arithmetic in Go

We are proud to introduce goff (go finite field): a tool that generates pure Go source code for fast modular arithmetic operations for any modulus.

goff at a glance

goff: a short example

Why not use math/big?

The cost of garbage collection

The inescapable conclusion: build it ourselves

How fast is goff?

High performance Go: case study in modular multiplication

Avoid memory allocations, garbage collection 500ns → 160ns

Prefer math/bits over assembly

Flatten for loops 160ns → 90ns

Dissassemble and optimize low-level arithmetics 90ns → 65ns

Do less 65ns → 50ns

Cache friendliness

A strange example: 3 variables or a length-3 array?

Generating x64 assembly 50ns → 33ns ( Image Not Showing Possible Reasons The image file may be corruptedThe server hosting the image is unavailableThe image path is incorrectThe image format is not supported Learn More → )

What's next?

Read more

What changes for BW6-761?

Yet another curve, but THE curve for your KZG!

BW6 over BLS12-381

Benchmarking pairing-friendly elliptic curves libraries

Introducing `goff`: fast modular arithmetic in Go

We are proud to introduce `goff` (go finite field): a tool that generates pure Go source code for fast modular arithmetic operations for any modulus.

`goff` at a glance

`goff`: a short example

Why not use `math/big`?

How fast is `goff`?

Avoid memory allocations, garbage collection `500ns`
$\to$ `160ns`

Prefer `math/bits` over assembly

Flatten `for` loops `160ns`
$\to$ `90ns`

Dissassemble and optimize low-level arithmetics `90ns`
$\to$ `65ns`

Do less `65ns`
$\to$ `50ns`

Generating x64 assembly `50ns`
$\to$ `33ns` (

Image Not Showing Possible Reasons
The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
Learn More →

)