_____     _           ____ _ __     __
|  ___|_ _| | _____   / ___| |\ \   / /
| |_ / _` | |/ / _ \ | |  _| | \ \ / /  
|  _| (_| |   <  __/ | |_| | |__\ V /   
|_|  \__,_|_|\_\___|  \____|_____\_/

You don't need an efficient endomorphism to implement GLV-like scalar multiplication in SNARK circuits

Introduction
- Other applications
Background
The fake GLV trick
- Implementation
  - Benchmark
  - Comparison

Introduction

P-256, also known as secp256r1 and prime256v1, is a 256-bit prime field Weierstrass curve standardized by the NIST. It is widely adopted in internet systems, which explains its myriad use cases in platforms such as TLS, DNSSEC, Apple’s Secure Enclave, Passkeys, Android Keystore, and Yubikey. The key operation in elliptic curves based cryptography is the scalar multiplication. When the curve is equipped with an efficient endomorphism it is possible to speed up this operation through the well-known GLV algorithm. P-256 does unfortunately not have an efficient endomorphism (see parameters) to enjoy this speedup.

Verifying ECDSA signatures on Ethereum through precompiled contracts, i.e. smart contracts built into the Ethereum protocol (there are only 9) is only possible with the secp256k1 curve and not the P-256.
Verifying ECDSA signatures on P-256 requires computing scalar multiplications in Solidity and is especially useful for smart-contract wallets, enabling hardware-based signing keys and safer, easier self-custody. Different solutions can bring P-256 signatures on-chain. There are primarily three interesting approaches: (zk)-SNARK based verifiers, smart contract verifiers (e.g. [Dubois23], Ledger/FCL (deprecated), smoo.th/SCL and daimo/p256verifier), and native protocol precompiles (EIP/RIP 7212).

Using SNARK (succinctness) properties, provides a great way to reduce gas cost for computation on Ethereum (e.g. ~232k gas for Groth16, ~285k gas for PLONK and ~185k gas for FFLONK). This is very competitive with (and sometimes better that) the currently gas-optimal smart contract verifier. Moreover one can batch many ECDSA verifications in a single proof, amortizing thus the gas cost. However verifying P-256 signatures in a SNARK circuit can be very expensive i.e. long proving time. This is because the field where the points on the P-256 curve lie is different than the field where the SNARK computation is usually expressed. To be able to verify the proof onchain through the procompile the SNARK field needs to be the BN254 scalar field. Different teams tried to implement the ECDSA verification on P-256 in a BN254 SNARK circuit efficiently. Among these: zkwebauthn/webauthn-halo2, https://github.com/zkwebauthn/webauthn-circom and PSE/circom-ecdsa-p256.

If P-256 had an efficient endomorphism we could have optimized the proving time a great deal!

In this note we show a way to implement a GLV-like scalar multiplications in-circuit without having an efficient endomorphism.

Other applications

This technique can be applied to any elliptic curve without an efficient endomorphism (e.g. Curve25519, P-384, MNT-753 (k=4, k=6), STARK curve,
$B$ of "cycle5", …). See this database for other curves.
This would question the choice of Bandersnatch (an embedded endomorphism-equipped curve over BLS12-381) over Jubjub (an embedded curve over BLS12-381 without endomorphism) for Ethereum Verkle trees.
This can speedup ECDSA verification in Starknet and Cairo (through the STARK curve).
This can speedup natively the folding step (à la Nova) of Ed25519 signatures through the 2-cycles proposed here by Aurore Guillevic.

Background

Standard scalar multiplication

Let

E

be an elliptic curve defined over the prime field

F_{p}

and let

r

be a prime divisor of the curve order

# E (F_{p})

(i.e. the number of points).
Let

s \in F_{r}

and

P (x, y) \in E (F_{p})

, we are interested in proving scalar multiplication

s \cdot P

over the

r

-torsion subgroup of

E

, denoted

E [r]

(i.e. the subset of points of order

r

The simplest algorithm is the standard left-to-right double-and-add:

INPUT: s = (s_{t−1},..., s_1, s_0), P ∈ E(Fp).
OUTPUT: sP.
1. Q ← ∞.
2. For i from t−1 downto 0 do
    2.1 Q ← 2Q.
    2.2 If s_i = 1 then Q ← Q + P.
3. Return(Q).

If/else branching is not possible in SNARK circuits so this is replaced by constant window table lookups inside the circuit. This can be achieved using polynomials which vanish at the constants that aren’t being selected, i.e. a 1-bit table lookup Q ← s_i * Q + (1 - s_i) * (Q+P). Hence this double-and-add algorithm requires

t

doublings,

t

additions and

t

1-bit table lookup.
This can be extended to windowed double-and-add, i.e. scanning more than a bit per iteration using larger window tables, but the multiplicative depth of the evaluation increases exponentially. We use affine coordinates for doubling/adding points because inverses cost as much as multiplications, i.e. instead of checking that

1 / x

y

we provide

y

out-circuit and check in-circuit that

x \cdot y = 1

. However since we start with

Q \leftarrow \infty

it is infeasible to avoid conditional branching since affine formulas are incomplete. Instead, we scan the bits right-to-left and assume that the first bit s_0 is 1 (so that we start at Q ← P), we double the input point P instead of the accumulator Q in this algorithm and finally conditionally subtract (using the 1-bit lookup) P if s_0 was 0.

INPUT: s = (s_{t−1},..., s_1, s_0), P ∈ E(Fp).
OUTPUT: sP.
1. Q ← P.
2. For i from 1 to t−1 do
    2.1 If s_i = 1 then Q ← Q + P.
    2.2 P ← 2P.
3. if s_0 = 0 then Q ← Q - P
4. Return(Q).

GLV scalar multiplication

However it is well known that if the curve is equipped with an efficient endomorphism then there exists a faster algorithm known as [GLV].

Example 1 : suppose that

E

has Complex Multiplication (CM) with discrimant

- D = - 3

, i.e.

E

is of the form

y^{2} = x^{3} + b

, with

b \in F_{p}

. This is the case of BN254, BLS12-381 and secp256k1 elliptic curves used in Ethereum. There is an efficient endomorphism

ϕ : E \to E

defined by

(x, y) \mapsto (ω x, y)

(and

O \mapsto O

) that acts on

P \in E [r]

ϕ (P) = λ \cdot P

. Both

ω

and

λ

are cube roots of unity in

F_{p}

and

F_{r}

respectively, i.e.

ω^{2} + ω + 1 \equiv 0 (\mod p)

and

λ^{2} + λ + 1 \equiv 0 (\mod r)

Example 2 : suppose that

E

has Complex Multiplication (CM) with discrimant

- D = - 8

, meaning that the endomorphism ring is

Z [\sqrt{- 2}]

. This is the case of the Bandersnatch elliptic curves specified in Ethereum Verkle trie. There is an efficient endomorphism

ϕ : E \to E

whose kernel is generated by a 2-torsion point. The map can be found by looking at 2-isogeneous curves and applying Vélu's formulas. For Bandersnatch it is defined by

(x, y) \mapsto (u^{2} \cdot \frac{x^{2} + w x + t}{x + w}, u^{3} \cdot y \cdot \frac{x^{2} + 2 w x + v}{(x + w)^{2}})

for some constants

u, v, w, t

(and

O \mapsto O

) that acts on

P \in E [r]

ϕ (P) = λ \cdot P

where

λ^{2} + 2 \equiv 0 (\mod r)

The GLV algorithm starts by decomposing

s

s = s_{0} + λ s_{1}

and then replacing the scalar multiplication

s \cdot P

s_{0} \cdot P + s_{1} \cdot ϕ (P)

. Because

s_{0}

and

s_{1}

are guaranteed to be

\leq \sqrt{r}

(see Sec.4 of [GLV] and Sec.4 of [FourQ] for an optimization trick), we can halve the size of the for loop in the double-and-add algorithm. We can then scan simultaenously the bits of

s_{0}

and

s_{1}

and apply the Strauss-Shamir trick. This results in a significant speed up but only when an endomorphism is available. For example the left-to-right double-and-add would become:

INPUT: s and P ∈ E(Fp).
OUTPUT: sP.
1. Find s1 and s2 s.t. s = s1 + 𝜆 * s2 mod r 
    1.1 let s1 = (s1_{t−1},..., s1_1, s1_0) 
    1.2 and s2 = = (s2_{t−1},..., s2_1, s2_0)
2. P1 ← P, P2 ← 𝜙(P) and Q ← ∞.
3. For i from t−1 downto 0 do
    3.1 Q ← 2Q.
    3.2 If s1_i = 0 and s2_i = 0 then Q ← Q.
    3.3 If s1_i = 1 and s2_i = 0 then Q ← Q + P1.
    3.4 If s1_i = 0 and s2_i = 1 then Q ← Q + P2.
    3.5 If s1_i = 1 and s2_i = 1 then Q ← Q + P1 + P2.
4. Return(Q).

Using the efficient endomorphism in-circuit is also possible (see [Halo, Sec. 6.2 and Appendix C] or [gnark implementation] for short Weierstrass curves and [arkworks] and [gnark] implementations for twisted Edwards). But one should be careful about some extra checks of the decomposition

s = s_{0} + λ s_{1} \mod r

(not the SNARK modulus). The integers

s_{0}, s_{1}

can possibly be negative in which case they will be reduced in-circuit modulo the SNARK field and not

r

The fake GLV trick

Remember that we are proving that

s \cdot P = Q

and not computing it. We can "hint" the result

Q

and check in-circuit that

s \cdot P - Q = O

. Now, if we can find

u, v \leq \sqrt{r}

such that

v \cdot s = u (\mod r)

then we can check instead that

(v \cdot s) \cdot P - v \cdot Q = O

which is equivalent to

u \cdot P - v \cdot Q = O

The thing now is that

u

and

v

are "small" and we can, similarly to the GLV algorithm, halve the size of the double-and-add loop and apply the Strauss-Shamir trick.

Solution: running the half-GCD algorithm (i.e. running GCD half-way) is sufficient to find

u

and

v

. We can apply the exact same trick for finding the lattice basis as in the GLV paper (Sec. 4). For completeness we recall the algorithm hereafter.
We apply the extended Euclidean algorithm to find the greatest common divisor of

r

and

s

(This gcd is 1 since

r

is prime.) The algorithm produces a sequence of equations

w_{i} \cdot r + v_{i} \cdot s = u_{i}

for

i = 0, 1, 2, \dots

where

w_{0} = 1, v_{0} = 0, u_{0} = r, w_{1} = 0, v_{1} = 1, u_{1} = s

, and

u_{i} \geq 0

for all

i

. We stop at the index

m

for which

u_{m} \geq \sqrt{r}

and take

u = u_{m + 1}

and

v = - v_{m + 1}

.
Note: By construction

u

is guaranteed to be a positive integer but

v

can be negative, in which case it would be reduced in-circuit modulo the SNARK modulus and not

r

. To circumvent this we return in the hint

u

v

and a

b = 1

v

is negative and

b = 0

otherwise. In-circuit we negate

Q

instead when

b = 1

Implementation

A generic implementation in the gnark library is available at gnark.io (feat/fake-GLV branch). For Short Weierstrass (e.g. P256) look at the scalarMulFakeGLV method in the emulated package and for twisted Edwards (e.g. Bandersnatch/Jubjub) look at the scalarMulFakeGLV method in the native package.

Benchmark

The best algorithm to implement scalar multiplication in a non-native circuit (i.e. circuit field ≠ curve field) when an efficient endomorphism is not available is an adaptation of [Joye07] (implemented in gnark here).
Next we compare this scalar multiplication with our fake GLV in a PLONKish vanilla (i.e. no custom gates) circuit (scs) over the BN254 curve (Ethereum compatible). We also give benchmarks in R1CS.

P-256	Old (Joye07)	New (fake GLV)
$[s] P$	738,031 scs 186,466 r1cs	385,412 scs 100,914 r1cs
ECDSA verification	1,135,876 scs 293,814 r1cs	742,541 scs 195,266 r1cs

Note here that the old ECDSA verification uses Strauss-Shamir trick for computing
$[s] P + [t] Q$ while the new version is merely two fake GLV multiplications and an addition.

Comparison

p256wallet.org is an ERC-4337 smart contract wallet that leverages zk-SNARKs for WebAuthn and P-256 signature verification. It uses PSE/circom-ecdsa-p256 to generate the webAuthn proof, and underneath PSE/circom-ecdsa-p256 to generate the ECDSA proof on P-256 curve. The github README reports 1,972,905 R1CS. Compiling our circuit in R1CS results in 195,266 R1CS. This is more than a 10x reduction, which is not only due to the fake GLV algorithm but also to optimized non-native field arithmetic in gnark.

Other curves

Similar results are noticed for other curves in short Weirstrass, e.g. P-384 and STARK curve:

P-384	Old (Joye07)	New (fake GLV)
$[s] P$	1,438,071 scs	782,674 scs
ECDSA verification	2,174,027 scs	1,419,929 scs

STARK curve	Old (Joye07)	New (fake GLV)
$[s] P$	727,033 scs	380,210 scs
ECDSA verification	1,137,459 scs	732,131 scs

and also in twisted Edwards e.g. Jubjub vs. Bandersnatch:

Jubjub	Old (2-bit double-and-add)	New (fake GLV)
$[s] P$	5,863 scs 3,314 r1cs	4,549 scs 2,401 r1cs

Bandersnatch	Old (GLV)	New (fake GLV)
$[s] P$	4,781 scs 2,455 r1cs	4,712 scs 2,420 r1cs

EDIT: Thanks to Ben Smith for reporting that a similar idea was proposed in [SAC05:ABGL+] for ECDSA verification. We note that, in our context, the trick applies to a single scalar multiplication and that the half GCD is free through the hint.

You don't need an efficient endomorphism to implement GLV-like scalar multiplication in SNARK circuits

Introduction

Other applications

Background

Standard scalar multiplication

GLV scalar multiplication

The fake GLV trick

Implementation

Benchmark

Comparison

Other curves

Read more

Multiset hashing based on elliptic curves

Notes about optimizing emulated pairing (part 2)

BN254 pairing in gnark vs. google and cloudflare

Elliptic nets