Hash to Curve halo2 Circuit

In this document we attempt to sketch a plan for turning the `hash_to_curve` ciphersuite `BLS12381G2_XMD:SHA-256_SSWU_RO_` into a halo2 circuit. The specification is given [here](https://datatracker.ietf.org/doc/html/draft-irtf-cfrg-hash-to-curve-09#name-bls12-381-g2). A `hash_to_curve` algorithm takes as its input a byte string of arbitrary length, and outputs a point on a prime-order subgroup of an elliptic curve group, in such a way that the distribution induced by this function is close (within the desired security parameter) to being uniformly random. Since this is a complex algorithm with many component functions, we will break it down piece by piece, designing chips for each of the subroutines. These will be linked together in the final circuit. # Parameters ``` BLS12381G2_XMD:SHA-256_SSWU_RO_ is defined as follows: encoding type: hash_to_curve (Section 3) E: y^2 = x^3 + 4 * (1 + I) base field F is GF(p^m), where p: 0x1a0111ea397fe69a4b1ba7b6434bacd764774b84f38512bf6730d2a0f6b0f6241eabfffeb153ffffb9feffffffffaaab m: 2(1, I) is the basis for F, where I^2 + 1 == 0 in F k: 128 expand_message: expand_message_xmd (Section 5.4.1) H: SHA-256 L: 64 f: Simplified SWU for AB == 0, Section 6.6.3 Z: -(2 + I) E': y'^2 = x'^3 + A' * x' + B', where A' = 240 * I B' = 1012 * (1 + I) iso_map: the isogeny map from E' to E given in Appendix C.3 h_eff: 0xbc69f08f2ee75b3584c6a0ea91b352888e2a8e9145ad7689986ff031508ffe1329c2f178731db956d82bf015d1212b02ec0ec69d7477c1ae954cbc06689f6a359894c0adebbf6b4e8020005aaa95551 ``` # `hash_to_curve` ``` hash_to_curve(msg) Input: msg, an arbitrary-length byte string. Output: P, a point in G. Steps: u = hash_to_field(msg, 2) Q0 = map_to_curve(u[0]) Q1 = map_to_curve(u[1]) R = Q0 + Q1 # Point addition P = clear_cofactor(R) return P ``` # `hash_to_field` Here is the pseudocode for `hash_to_field`, copied from the source document for the reader's convenience. Like almost all algorithms in the source document, it works in bytes. ``` hash_to_field(msg, count) Parameters: - DST, a domain separation tag (see discussion above). - F, a finite field of characteristic p and order q = p^m. - p, the characteristic of F (see immediately above). - m, the extension degree of F, m >= 1 (see immediately above). - L = ceil((ceil(log2(p)) + k) / 8), where k is the security parameter of the suite (e.g., k = 128). - expand_message, a function that expands a byte string and domain separation tag into a uniformly random byte string (see discussion above). Inputs: - msg, a byte string containing the message to hash. - count, the number of elements of F to output. Outputs: - (u_0, ..., u_(count - 1)), a list of field elements. Steps: len_in_bytes = count * m * L uniform_bytes = expand_message(msg, DST, len_in_bytes) for i in (0, ..., count - 1): for j in (0, ..., m - 1): elm_offset = L * (j + i * m) tv = substr(uniform_bytes, elm_offset, L) e_j = OS2IP(tv) mod p u_i = (e_0, ..., e_(m - 1)) return (u_0, ..., u_(count - 1)) ``` Substituting the parameters given in the last section into the pseudocode from the source document, we have the following. ``` len_in_bytes = count * 2 * 64 uniform_bytes = expand_message_xmd(msg, DST, len_in_bytes) for i in (0, ..., count - 1): for j in (0, 1): elm_offset = 64 * (j + i * 2) tv = substr(uniform_bytes, elm_offset, 64) e_j = OS2IP(tv) mod p u_i = (e_0, e_1) return (u_0, ..., u_(count - 1)) ``` Now the domain separation tag `DST` is a parameter which is not given explicitly in the source document, although they do provide guides for how to form one. However, `count`, the number of field elements to output, is an input to the function itself. Since the `hash_to_curve` function only uses `hash_to_field(msg, 2)`, we only have to consider the case `count = 2`. Making this substitution and unrolling the for loop, we have the following. ``` len_in_bytes = 256 uniform_bytes = expand_message_xmd(msg, DST, len_in_bytes) tv = substr(uniform_bytes, 0, 64) e_0 = OS2IP(tv) mod p tv = substr(uniform_bytes, 64, 64) e_1 = OS2IP(tv) mod p u_0 = (e_0, e_1) tv = substr(uniform_bytes, 128, 64) e_0 = OS2IP(tv) mod p tv = substr(uniform_bytes, 192, 64) e_1 = OS2IP(tv) mod p u_1 = (e_0, e_1) return (u_0, u_1) ``` Here `tv` stands for "temporary variable." If `byte_string` is a byte string, then `substr(byte_string, m, n)` is just the bytestring consisting of entries `m` to `m+n-1`, inclusive, of `byte_string`. Using Rust's slice syntax, if `byte_string` is an array of `u8`s, then `substr(byte_string, m, n)` is just `byte_string[m..m+n]`. The function `OS2IP` stands for "Octet Stream to Integer Primitive", which means that it converts a byte string of a known fixed length to an integer, interpreting the byte string as big-endian. See [this post on stack exchange](https://crypto.stackexchange.com/questions/37537/what-are-i2osp-os2ip-in-rsa-pkcs1) for more details. In our case, every number, string, or any piece of information inside the circuit whatsoever will neither be represented by bytes nor integers. Instead, everything is represented by field elements. So all invocations of `OS2IP` or `I2OSP` are essentially meaningless to us, except as a reminder to be careful about how we represent the data within the circuit. Suppose for the moment that we are representing the byte string `tv` within the circuit as a list of cells, with one byte stored in each cell. Each cell entry is of the form `(a,b)` where $a + bi \in \mathbb F_{p^2} = \mathbb F_p[X]/ \langle X^2 + 1 \rangle = \mathbb F_p(i)$. Suppose also that we are ignoring the second, "imaginary" coordinate, just to keep things simple for now (this is likely not the most efficient way to store and compute with bytes, but let's try to focus on getting the basic algorithm down before worrying about any optimizations). Then the way to constrain something like `e_0 = OS2IP(tv) mod p`, knowing that `tv` is a byte string of length 64, is to add the constraint that `e_0 = tv[0] + (256 * tv[1]) + ... + (256^{63} * tv[63])` The `mod p` part is done automatically by halo2, since, in $\mathbb F_p(i)$, $\Sigma_n (a_n + 0i) = (\Sigma_n a_n \mod p) + 0i$. So now we can clarify the pseudocode even further: ``` uniform_bytes = expand_message_xmd(msg, DST, 256) e_0 = uniform_bytes[0..64] e_1 = uniform_bytes[64..128] e_2 = uniform_bytes[128..192] e_3 = uniform_bytes[192..256] u = ((e_0, e_1), (e_2, e_3)) // This is (e_0 + e_1 * I, e_2 + e_3 * I) return u ``` Now, suppose for the moment that we are prohibited from working in bytes and have to work in bits instead. I believe this is what we will be forced to do, given how SHA256 acts on bits, but if anyone has any ideas for how to work in bytes, or in another system that is even more efficient, please let me know. The parameter `L` is derived from other parameters by the formula `L = ceil(ceil(log2(p)) + k / 8)`, which is the number of bytes needed to store a bit string large enough to fit one field element plus `k` additional bits of security. If we're working in bits instead of bytes, then the outer `ceil` and division by 8 goes away. The BLS12-381 curve is defined over a field with characteristic $p$, where $p$ is 381 bits wide, so `ceil(log2(p)) = 381`. This suggests that we should set `L = 381 + 128 = 509`. *However*, at some point we will need to test the `hash_to_curve` circuit against an existing native implementation by hashing the empty string (for example) to see if the ouput, when converted back to bytes, is what we expect. Therefore our parameters and intermediate computations should be exactly the bit representations of their counterparts in bytes, so we will set `L = 64 * 8 = 512` bits, plain and simple. Now, `len_in_bytes` is given to be `count * m * L`. The parameter `m` is the degree of the field extension and doesn't change, and we are still using `count = 2`. If we are working in bits, then `len_in_bits` should be `2 * 2 * L = 2 * 2 * 512 = 2048`. So the pseudocode for working in bits would be: ``` uniform_bits = expand_message_xmd(msg, DST, 1024) e_0 = uniform_bits[0..512] e_1 = uniform_bits[512..1024] e_2 = uniform_bytes[1024..1536] e_3 = uniform_bytes[1536..2048] u = ((e_0, e_1), (e_2, e_3)) // This is (e_0 + e_1 * I, e_2 + e_3 * I) return u ``` If we are storing `uniform_bits` as a list of cells, where `uniform_bits[n]` holds the field element $x_n + 0i$ where $x_n$ is the $n^{\text{th}}$ bit of `uniform_bits`, then the way to enforce that (for example) `e_0` is as claimed is to add the constraint that `e_0 = uniform_bits[0] + (2 * uniform_bits[1]) + ... + (2^{1023} * uniform_bits[1023])` Now definitely $2^n > p$ whenever $n \geq 382$, so the $n^{\text{th}}$ coefficient in this sum is actually $2^n \mod p$. I believe that halo2 does this conversion automatically, but if not then we need to pre-calculate the powers of 2 modulo $p$ up to $2^{1023}$ and hard-code that into our circuit. ## `expand_message_xmd` SHA256 operates on 32-bit words, and applies operations such as (cyclic shift right by 17 bits) and (shift right 3 bits). Since 3 and 17 are not multiples of 8, this means that SHA256 does not operate independently on each of the four bytes within a 32-bit word. To describe the operations using bytes, one would need to use a lookup table containing all $2^{64}$ possible pairs of 4-byte strings and the results of applying to them the bitwise shifts and logical operators required by SHA256. The same problem occurs if we decide to treat the 32-bit words as our basic units of computation. Thus, we have to work in bits, storing each bit that appears in the computation trace in a separate cell. However, the specification of `hash_to_curve` referenced here mainly operates on bytes. So we have to translate the byte logic into bit logic. The pseudocode given in the reference is the following: ``` expand_message_xmd(msg, DST, len_in_bytes) Parameters: - H, a hash function (see requirements above). - b_in_bytes, b / 8 for b the output size of H in bits. For example, for b = 256, b_in_bytes = 32. - r_in_bytes, the input block size of H, measured in bytes (see discussion above). For example, for SHA-256, r_in_bytes = 64. Input: - msg, a byte string. - DST, a byte string of at most 255 bytes. See below for information on using longer DSTs. - len_in_bytes, the length of the requested output in bytes. Output: - uniform_bytes, a byte string. Steps: ell = ceil(len_in_bytes / b_in_bytes) ABORT if ell > 255 DST_prime = DST || I2OSP(len(DST), 1) Z_pad = I2OSP(0, r_in_bytes) l_i_b_str = I2OSP(len_in_bytes, 2) msg_prime = Z_pad || msg || l_i_b_str || I2OSP(0, 1) || DST_prime b_0 = H(msg_prime) b_1 = H(b_0 || I2OSP(1, 1) || DST_prime) for i in (2, ..., ell): b_i = H(strxor(b_0, b_(i - 1)) || I2OSP(i, 1) || DST_prime) uniform_bytes = b_1 || ... || b_ell return substr(uniform_bytes, 0, len_in_bytes) ``` ### `I2OSP` Before we try to convert this pseudocode, let's think about `I2OSP` for a minute. The function `I2OSP` converts an unsigned integer into a big-endian byte array of a fixed size, and produces an error if the integer is too large. The first argument to the function `I2OSP` is the integer to be converted, and the second argument is the number of bytes that `I2OSP` will try to convert the integer into. So `I2OSP(x, n)` does two things: 1. Expresses $x$ as a string of $n$ bytes. 2. Asserts that $x$ is at most $n$ bytes long. If we want to verify a satisfying assignment to `I2OSP(x, n)`, we need a list `x_bytes` of $n$ cells each holding a field element which represents a byte (assume for simplicity's sake that we are just using the real component to store bytes as field elements), and another cell holding a value `x`. Suppose that `n` is a fixed circuit parameter. Then we need to constrain the following: 0. Each element of `x_bytes` actually holds a byte. 1. `x = x_bytes[0] + (256 * x_bytes[1]) + ... + (256^{n-1} * x_bytes[n-1])` Item 0. can be enforced with the constraint $\prod_{j=0}^{255} (x_i-j) = 0$ applied to each value $x_i$ held in a cell of `x_bytes`, or by using a lookup table. Now unfortunately we cannot replicate 2., because there is no natural way to express, *within* a fixed circuit, the relation that "the prover's input to cell `c` is the number of cells that they have to assign to express the value `x`." So we can't build *one* circuit or chip that takes inputs `x` and `n` and produces a correct proof if and only if the prover correctly computes `I2OSP(x, n)` in the circuit. This is a general problem with variable-length input in arithmetic circuits. However, what we *can* do is pass `n` as a circuit parameter, in which case halo2 will compile a different circuit `I2OSP_n(x)` for each `n` upon request. It's in the prover's best interest to compute the proof according to the same `n` that the verifier has in mind: otherwise, the prover will send the verifier an elliptic curve point which is a proof corresponding to an entirely different circuit, which will fail with high probability. Let's call the bit version of this `I2BSP_n` (Integer to Bit Stream Primitive, $n$ bits). Then `I2BSP_n(x)` could be a chip with access to a cell `x` and a list of cells `x_bits`, that enforces these constraints: 1. For all `i` in `0..n`, `x_bits[i].square() - x_bits[i] = 0` 2. `x = x_bits[0] + (2 * x_bits[1]) + ... + (2^{n-1} * x_bits[n-1])` Apologies for conflating the cells with the values held in them. Keeping these constraints in mind wherever we see `I2BSP_n`, let's rewrite the pseudocode. Below we have substituted in the parameters specific to our ciphersuite, converted byte logic to bit logic, and simplified a little. Recall that in `hash_to_field` we fixed `count = 2`, which meant that the requested output length of `expand_message_xmd` should be 2048 bits. So we also substitute `len_in_bits = 2048`. This implies that `ell = ceil(len_in_bits / 256) = 8`, which means that the requirement that `ell < 255` enforced by the second line is automatically satisfied, and the second line can be skipped. ``` expand_message_xmd(msg, DST, len_in_bits) Input: - msg, a bit string. - DST, a bit string of at most 2040 bits. See below for information on using longer DSTs. - len_in_bits, the length of the requested output in bits. Output: - uniform_bits, a bit string. Steps: DST_prime = DST || I2BSP_8(len(DST)) Z_pad = I2BSP_512(0) l_i_b_str = I2BSP_16(2048) msg_prime = Z_pad || msg || l_i_b_str || I2BSP_8(0) || DST_prime b_0 = SHA256(msg_prime) b_1 = SHA256(b_0 || I2BSP_8(1) || DST_prime) for i in (2, ..., 8): b_i = SHA256(strxor(b_0, b_(i - 1)) || I2BSP_8(i) || DST_prime) uniform_bits = b_1 || ... || b_ell return substr(uniform_bits, 0, 2048) ``` Actually, most instances of `I2BSP_n` are called on constants, so we can just denote these by an array. In the notation of the source document, `strxor` is just a bitwise xor of two bytestrings, so we'll just denote it `xor` with the understanding that it should be applied bitwise to two arrays of the same length. Using Rust's syntax for arrays and slices, we have: ``` DST_prime = DST || I2BSP_8(len(DST)) msg_prime = [0; 512] || msg || [0; 4] || [1] || [0; 19] || DST_prime b_0 = SHA256(msg_prime) b_1 = SHA256(b_0 || [0, 0, 0, 0, 0, 0, 0, 1] || DST_prime) uniform_bits = b_0 || b_1 for i in 2..9: b_i = SHA256(b_{i-2} xor b_{i-1} || I2BSP_8(i) || DST_prime) uniform_bits = uniform_bits || b_i return uniform_bits ``` Let `|s|` denote the length of a bit string `s`. By specification, `|DST| <= 2040`, so `|DST_prime| <= |DST| + 8 = 2048`. The input to SHA256 to calculate `b_{i+1}` is `256 + 8 + |DST_prime|` bits, so at most 2320 bits. The input block size to SHA256 is 512 bits, so to compute `b_1` through `b_4` would require four SHA256 hashes of at most five blocks each. Since we are trying to build a variable-length input circuit, `msg` may be considerably longer, and we may have to hash many more blocks. #### On the length of `msg` According to the specification of SHA256 itself, the maximum input size is $2^{64}$ bits. We know that `|msg_prime| = 512 + |msg| + 24 + |DST| + 8 <= |msg| + 2584`. So to ensure that `msg_prime` can be hashed, we need only to ensure that `|msg| <= 2^{64} - 2584`. In practice, I don't think we can hash messages anywhere near as long as SHA256's maximum input size. I'm not sure what the maximum table size is for halo2, but I think I remember one of the 0xPARC lecturers saying that the maximum number of rows that halo2 can handle is $2^{26}$. And $2^{26}$ bits is 64 Mebibits (Mib), which is about 8.3 Megabytes (MB). I don't know if this is a reasonable assumption, but suppose we need to cut that in half for the other parts of the circuit and computing overhead. Then our users would be able to hash about 4MB of data. I'm not sure what the demands are for our particular implementation, so maybe this is totally sufficient. However, maybe we want to have a smaller, faster circuit only capable of hashing around, say, 16Kib of data. Then, if the prover wants to hash more data, we make them construct a Merkle tree using several passes through the circuit with successive blocks of their message. If our users need to hash more than 4MB at a time, then we will have to take this approach (or something similar) anyway. #### Variable-length messages and zero knowledge At any rate, if we tell halo2 to make a variable-length input `hash_to_curve` circuit, halo2 will actually compile the circuit only after it knows the length of the input message. Different input message lengths will result in the selection of a different circuit. Since the circuit description itself is public, the verifier can deduce the length of the prover's input. This means that a variable-length input circuit proof would *not* be zero-knowledge. This is a big problem if the message is short: if the verifier knows that the message is $n$ bits long, then they can find the original message with at most $2^n$ guesses. There are a few possible ways to deal with this problem, and the best approach depends on how this circuit is meant to be used. 1. We could instead make the circuit fixed length. Under this scheme, if the prover wants to hash more data they would need to construct a Merkle tree, run several SHA256 circuits on the leaves and intermediate vertices, and use the elliptic curve point corresponding to the root hash as the signature. The verifier might gain some knowledge about the length of the original input from the height of the Merkle tree. We could fix this as well, since a relatively short Merkle tree can have more space at the root level than a reasonable prover would ever need to use. Another option would be for the prover to divide their message into blocks, run the circuit on each block, then somehow batch the resulting elliptic curve points together to come up with the final elliptic curve point. This could possibly be done by adding (maybe a random linear combination of?) the elliptic curve points corresponding to the individual blocks. 2. We could specify a minimum message length, long enough that the verifier would be unable to guess a random minimum-length message by brute force. Given that the message needs to be padded to the correct block size in the way specified by SHA256, if the message is very short then the verifier could brute force it anyway by applying the padding herself. This would still not be zero-knowledge, since longer messages requiring more SHA256 blocks could still be distinguished from shorter messages. But it does leak less information about the prover's input. 3. We could accept the fact that the circuit is not zero-knowledge, and warn the user that they should not input secrets, especially if the secrets are short. If the circuit does not need to be zero-knowledge, then this isn't a problem. ## `SHA256` https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf ### Padding and Parsing #### Padding Let $M$ be a message expressed as a bitstring of length $\ell$. The padding algorithm for SHA256 is as follows: 1. Append 1. 2. Append $k$ 0s, where $k$ is the minimum natural number such that $\ell + 1 + k = -64 \mod 512$. 3. Append 64 bits whose big-endian interpretation as a signed 64-bit integer is equal to $\ell$. Let us denote the padded message by $M'$, the $i^{\text{th}}$ bit of the original message by $m_i$, and the $i^{\text{th}}$ bit of the padded message by $m_i'$. Let $N$ be the number of 512-bit blocks of the padded message, so that $N = \frac{\ell + k + 65}{512} = \left \lceil \frac{\ell + 65}{512} \right \rceil$. We want to enforce the following relations: 1. $m_i' = m_i$ if $i < \ell$. 2. $m_i' = 1$ if $i = \ell$. 3. $m_i' = 0$ if $\ell + 1 \leq i < 512N-64$. 4. $\sum_{i=512N-64}^{512N-1} 2^{i - (512N-64)} m_i' = \ell$. If there is a maximum number of blocks that the prover can hash, then we can do this in a way which is actually zero knowledge. If $K$ is the maximum number of blocks, let $\text{max} = 512K$. We will show how to constrain an advice column $a$ so that $a_i = 1$ if $i < \ell$ and $a_i = 0$ if $\ell \leq i < \text{max}$. First, we add an auxiliary advice column $\text{diff}^a$, and suggest that the prover set $\text{diff}^a_0 = \ell$. Note that this is our *definition* of $\ell$: the prover can put whatever they want in $\text{diff}^a_0$, but if they put a value which is not equal to the length of their message, their proof should be invalid. Then for $i \in \{1, \ldots, \text{max} - 1\}$ we constrain $$\text{diff}^a_i = \text{diff}^a_{i-1} - 1$$ so that $\text{diff}^a_i$ is exactly $\ell - i$ for each $i$. Now we create another auxiliary column $\text{inv}^a$ and suggest that the prover set $\text{inv}^a_i = (\ell - i)^{-1}$ whenever $\ell \neq i$, that is, whenever $\text{diff}^a_i \neq 0$. We must have that either $\text{diff}^a_i = 0$ or $\text{diff}^a_i \cdot \text{inv}^a_i = 1$. To enforce this, we constrain $$\text{diff}_i (1 - \text{diff}_i \cdot \text{inv}_i) = 0$$ Now we want the $a$ column to read 1, 1, 1, .. until index $\ell$, and for it to read 0 from $\ell$ onwards. To enforce that the entries of the $a$ column remain 0 after the first 0 appears, we can multiply $a_i$ by $a_{i-1}$. We can then initialize $a_0$ to 1 so that multiplying by $a_{i-1}$ does not affect $a_i$ until $i = \ell$. In general, we should have that $a_i = a_{i-1}$ if $i \neq \ell$, and $a_i = 0$ if $i = \ell$. Given that $\text{inv}^a_i$ is constrained to be $(\text{diff}_i^a)^{-1}$ unless $\text{diff}^a_i = 0$ (i.e. unless $i = \ell$), we can enforce this relation with the following constraints: $$a_0 = \text{inv}^a_0 \cdot \text{diff}^a_0$$ $$a_i = a_{i-1}(\text{inv}_i \cdot \text{diff}_i)$$ for $i \in \{1, \ldots, \text{max}\}$. The first constraint is essentially just initializing $a_0$ to 1, but it also handles the corner case that $\ell = 0$, i.e. the empty message. Constraining $a$ in this way allows us to use it as a kind of selector column to enforce some constraints at indices less than $\ell$ and others at indices greater than or equal to $\ell$. The difference is that the values in column $a$ depend on $\ell$, which is a prover input, whereas selector columns are fixed. Using $a$ we can constrain the first $\ell$ entries of the padded message to be equal to the first $\ell$ entries of the original message. We also need to be able to talk about the last 64 entries of the padded message, as we need to constrain these to be a big-endian bit representation of $\ell$. So we will go about constructing another auxiliary column $b$, similar to how we constructed $a$. We want $b$ to be 1 until index $512N - 64$, and 0 from $512N - 64$ to $512N - 1$. So first we need to constrain $N$ appropriately, i.e. enforce that $N = \left \lceil \frac{\ell + 65}{512} \right \rceil$. Note that $N = \left \lceil \frac{\ell + 65}{512} \right \rceil$ means that $512(N-1) < \ell + 65 \leq 512N$, or $0 \leq 512N - \ell - 65 < 512$. This means that $512N - \ell - 65$ can be expressed in nine bits. In other words, $N = \sum_{j=0}^8 2^jn_j$ where $n_j$ is constrained to be either 0 or 1. If there is another $N' \neq N$ such that $512N' - \ell - 65$ can also be expressed in nine bits, then $512(N' - N) \mod p < 512$, where $p$ is the characteristic of our base field $\mathbb F_p(i)$. Note that we care about the characteristic rather than the total field size since, by constraining $N$ and $N'$ to have a bit decomposition, we are ensuring that their imaginary part is zero. Since $N' \neq N$, this means that $512(N' - N) \geq p$, so that $N' \geq \frac{N + p}{512} > \frac{p}{512} > \frac{2^{381}}{512} = 2^{372}$. However, from the SHA256 specification we have that $\ell < 2^{64}$, thus $N \leq \frac{2^{64}}{512} + 1< 2^{56}$. In practice, $N$ will have to be much smaller, since halo2 can only handle around $2^{26}$ rows: thus, if we want to store one bit per cell, the length $\ell$ of the message will have to be at most $2^{26}$ (minus a few rows of overhead), so that in practice we will have $N \leq 2^{17}$. Since we will later constrain $\ell$ to be represented in 64 bits, we might not have to constrain $N$ at this stage. On the other hand, we are going to use $N$ to select the 64 bits used to constrain $\ell$. So, just to be safe, we should constrain that $N < \frac{\text{max}}{512}$. For the moment we are agnostic about what $\text{max}$ should be, but we know $\text{max} \leq 2^{26}$, from which it follows that $N$ is the *only* possible solution to the equation $512N - \ell - 65 = \sum_{j=0}^8 2^jn_j$ where each $n_j$ is constrained to be a bit. So in summary, to enforce $N = \left \lceil \frac{\ell + 65}{512} \right \rceil$, we store auxiliary values $n_0, \ldots, n_8$ in an advice column and constrain $$n_j^2 = n_j$$ $$512N - \ell - 65 = \sum_{j=0}^8 2^jn_j$$ $$N \in \{1, \ldots, \tfrac{\text{max}}{512}\}$$ The last constraint can be enforced either by using a lookup table, or by constraining $N$ to have a bit decomposition. Next, we want a column $b$ holding values $b_i$ such that $b_i = 1$ when $i < 512N - 64$ and $b_i = 0$ when $512N - 64 \leq i < \text{max}$. Proceeding as in our construction of $a$, we form two auxiliary advice columns, $\text{diff}^b$ and $\text{inv}^b$. We constrain $$\text{diff}_0^b = 512N - 64$$ $$\text{diff}_i^b = \text{diff}_{i-1}^b - 1$$ for all $i \in \{1, \ldots, \text{max}-1 \}$, so that $\text{diff}_i^b$ is constrained to be $512N - 64 - i$ for all $i \in \{0, \ldots, \text{max} - 1\}$. Note that, whereas $\text{diff}^b_0$ was unconstrained and interpreted as the prover's self-reported message length, we actually *constrain* our initialization $\text{diff}_0^b$ to be $512N - 64$, which is itself tied to the value $\ell$ by the constraints on $N$ previously described. Next, we suggest that the prover set $\text{inv}^b_i = (512N - 64 - i)^{-1}$ whenever $i \neq 512N - 64$, and enforce this with the constraint $$\text{diff}_i^b (1 - \text{inv}_i^b \cdot \text{diff}_i^b) = 0$$ for all $i \in \{0, \ldots, \text{max} - 1\}$. Then we constrain $$b_0 = 1$$ $$b_i = b_{i-1} (\text{inv}_i^b \cdot \text{diff}_i^b)$$ Note that since $N \geq 1$ (there is at least one block, even for an empty message), the least $i$ for which $512N - 64 = i$ is 448. So, unlike in our construction of $a$, we can safely initialize $b_0 = 1$. Arguing as we did previously, this enforces that $b_i = 1$ for $i \in \{0, \ldots, 512N-65 \}$ and $b_i = 0$ for $i \in \{512N - 64, \ldots, \text{max} - 1 \}$. However, this is not exactly what we want. Column $b$ distinguishes indices less than $512N - 64$ from those greater than or equal to $512N - 64$, but cannot tell the difference between indices less than $512N$ and indices greater than or equal to $512N$. To fix this, consider the expression $b_i \oplus b_{i-64}$. When $i \in \{64, \ldots, 512N-64 \}$, both $b_i$ and $b_{i-64}$ are equal to 0, therefore $b_i \oplus b_{i-64} = 0$. When $i \in \{512N - 64, 512N - 1\}$, $b_i = 1$ but $b_{i-64} = 0$, so $b_i \oplus b_{i-64} = 1$. Then when $i \in \{512N, \ldots, \text{max} - 1\}$ we have that $b_i \oplus b_{i-64} = 0$ again. Later we will describe the constraint corresponding to the $\oplus$ operation on two bits. For now just note that there is such a polynomial constraint, and so we can avoid instantiating another advice column. Let us recall the conditions on the padded message $M'$. They are: 1. $m_i' = m_i$ if $i < \ell$. 2. $m_i' = 1$ if $i = \ell$. 3. $m_i' = 0$ if $\ell + 1 \leq i < 512N-64$. 4. $\sum_{i=512N-64}^{512N-1} 2^{i - (512N-64)} m_i' = \ell$. Note that $a_i = 1$ if and only if $i < \ell$, so we can enforce 1. by setting $m_i' = a_im_i + \ldots$, where $\ldots$ is yet to be determined but should be 0 when $i < \ell$. Now $i = \ell$ if and only if $\text{diff}^a_i = 0$, and $i \neq \ell$ if and only if $\text{inv}^a_i \cdot \text{diff}^a_i = 1$. So we can enforce 1. and 2. by requiring $m_i' = a_im_i + (1 - \text{inv}^a_i \cdot \text{diff}^a_i) + \ldots$, where $\ldots$ is equal to 0 whenever $i \leq \ell$. Next, we know that $b_i \oplus b_{i-64} = 1$ when $i \in \{512N-64, \ldots, 512N-1\}$ and $b_i \oplus b_{i-64} = 0$ otherwise. So if we constrain $$m_i' = a_im_i + (1 - \text{inv}^a_i \cdot \text{diff}^a_i) + (b_i \oplus b_{i-64}) m_i'$$ then the last term will be 0 whenever $\ell + 1 \leq i < 512N - 64$, ensuring 3. Recall that $\ell$ is the prover's input to $\text{diff}^a_0$. To enforce relation 4. we need to pair the right 64 bits with the right powers of two, without informing the verifier which 64 bits we're selecting. Since exponentiation is not polynomial (of low degree), we can't constrain the exponentiation using the indices directly. One solution to this is to use a fixed column, call it $\text{exp}$, such that $\text{exp}_i = 2^{(i \mod 64)}$ for all $i \in \{0, \ldots, \text{max} - 1 \}$. The idea is that, regardless of where the last 64 bits of the padded message are, they will begin at an index divisible by 64. So we can ensure that 4. holds by constraining $$\text{diff}^a_0 = \sum_{i=0}^{\text{max}-1} (b_i \oplus b_{i-64}) \text{exp}_i$$ #### Parsing In the NIST SHA256 specification, the padded message (which we will refer to as $M$ from now on to maintain consistency with the notation in the specification document) is divided into blocks of 512 bits each, denoted $M^{(1)}, \ldots, M^{(N)}$. These in turn are divided into sixteen 32-bit words, so for each $i \in \{1, \ldots, N\}$ we have $M^{(i)} = M_0^{(i)} || \cdots || M_{15}^{(i)}$. If we are hiding the message length, the problem we now face is that we have blocks $M^{(1)}, \ldots, M^{(N)}, M^{(N+1)}, \ldots, M^{(K)}$ but only need $M^{(1)}, \ldots, M^{(N)}$ in the hashing algorithm. However, we can store the hash of each sequential block $H^{(1)}, \ldots, H^{(K)}$ and select $H^{(N)}$ using constraints. ### Constraints for basic operations on 32-bit words The official SHA256 specification by NIST uses operations on 32-bit words to compute the hash. To make a SHA256 circuit, we need to constrain those operations. We describe how to do that here. In what follows, let $x = (x_0, x_1, \ldots, x_{31})$ $y = (y_0, y_1, \ldots, y_{31})$ $z = (z_0, z_1, \ldots, z_{31})$ $w = (w_0, w_1, \ldots, w_{31})$ be 32-bit words, where $w_i, x_i, y_i, z_i \in \{0, 1\}$ for all $i \in \{0, \ldots, 31\}$. The convention for the next section is that we are trying to constrain $w$ to be the result of some bitwise operation on some subset of $x$, $y$, and $z$. If $x$, $y$, or $z$ is given directly by the prover, or is the result of some fallible operation, we may have to constrain $x_i$, $y_i$ or $z_i$ to be a bit by enforcing the constraint $x_i^2 = x_i$ (or similarly for $y_i$, $z_i$). However, given the constraints we are about to describe, $w_i$ will always be a bit if $x_i$, $y_i$, and $z_i$ are all bits; otherwise the circuit will crash/the proof will be incorrect. Therefore we never need to *constrain* that $w_i$ is a bit, and can skip the constraint $w_i^2 = w_i$. #### $\lnot$ (bitwise NOT) To constrain $w = \lnot x$, constrain $$w_i = 1-x_i$$ for all $i \in \{0, \ldots, 31\}$. In most cases we will not have to write $w$ out separately using 32 new cells in the circuit and constrain it; instead we can just use $1-x_i$ directly in the context of our calculation. #### $\oplus$ (bitwise XOR) To constrain $w = x \oplus y$, constrain $$2w_i = (2x_i - 1)(2y_i - 1) + 1$$ for all $i \in \{0, \ldots, 31\}$. #### $\land$ (bitwise AND) To constrain $w = x \land y$, constrain $$w_i = x_iy_i$$ for all $i \in \{0, \ldots, 31\}$. ##### $\text{ROTR}^n$ If $x = (x_0, x_1, \ldots, x_{31})$ is a 32-bit word, then $\text{ROTR}(x)$ is defined to be $(x_{31}, x_0, x_1, \ldots, x_{30})$. $\text{ROTR}^n$ is just $\text{ROTR}$ applied $n$ times. So to constrain $w = \text{ROTR}^n(x)$, we can use copy constraints to enforce that $$ w_i = x_{(i-n) \mod 32}$$ for all $i \in \{0, \ldots, 31\}$. However, in many instances there will be no need to write $w$ using 32 new cells in the circuit. Instead, we can just directly refer to $x_{(i-n) \mod 32}$ in any calculation that uses $\text{ROTR}^n$. #### $\text{SHR}^n$ $\text{SHR}(x)$ is defined to be $(0, x_0, \ldots, x_{30})$, and $\text{SHR}^n$ is just $\text{SHR}$ applied $n$ times. So to constrain $w = \text{SHR}^n(x)$, constrain $$w_i = 0 \text{ if } i < n$$ and use copy constraints to constrain $$w_i = x_{i-n} \text{ if } i \geq n$$ Like with $\text{ROTR}^n$, in many cases we won't need to write out $y$ and constrain it; instead, we can use either $0$ or $x_{i-n}$ in the calculation where appropriate. #### $\text{Ch}(x, y, z)$ By definition, $$\text{Ch}(x, y, z) = (x \land y) \oplus (\lnot x \land z)$$ Combining the constraints for $\lnot$, $\land$, and $\oplus$ described previously, to enforce that $w = \text{Ch}(x, y, z)$ we can constrain $$2w_i = (2x_iy_i - 1)(2(1-x_i)z_i - 1) + 1$$ for all $i \in \{0, \ldots, 31 \}$. #### $\text{Maj}(x, y, z)$ By definition, $$\text{Maj}(x, y, z) = (x \land y) \oplus (x \land z) \oplus (y \land z)$$ Combining the constraints for $\oplus$ and $\land$ described previously, we can constrain $w = \text{Maj}(x, y, z)$ by constraining $$2w_i = (2x_iy_i - 1)(2x_iz_i - 1)(2y_iz_i - 1) + 1$$ for all $i \in \{0, \ldots, 31\}$. Note that $\text{Maj}$ is indeed a *majority term*: in other words, $\text{Maj}(x_i, y_i, z_i) = 1$ if and only if at least two of $x_i, y_i, z_i$ are equal to $1$. #### $+$ (addition modulo $2^{32}$) Remember that in the NIST SHA256 specification, 32-bit words are considered *big-endian*. Therefore $x_{31}$ is the least significant bit of $x_i$, and to apply the addition algorithm we need to start from there. Let $x$ and $y$ be the summands, and $w$ the sum. Following the usual addition algorithm, we introduce a carry, $c = (c_0, c_1, \ldots, c_{31})$. Initially we set the carry to zero, so $c_{31} = 0$. The least significant bit of the sum is $x_{31} \oplus y_{31} = x_{31} \oplus y_{31} \oplus 0 = x_{31} \oplus y_{31} \oplus c_{31}$. Now the next carry is $1$ exactly when $x_{31} + y_{31} + c_{31} \geq 2$, in other words, when at least two out of three of the previous summand bits and the carry is equal to $1$. This means that $c_{30} = \text{Maj}(x_{31}, y_{31}, c_{31})$. Now, when it comes to adding the last bits $x_0$, $y_0$, and $c_0$, we have no need for an additional carry, since the addition is being done modulo $2^{32}$. At this point in the chip construction, we already have a $\text{Maj}$ gate that we can reuse. We could construct another auxiliary 32-bit word $v$ and constrain $v_i = x_i \oplus y_i$ and $w_i = v_i \oplus c_i$ using our previously-established $\oplus$ gate. Perhaps this is the most efficient way. However, taking the bitwise XOR of three inputs comes up pretty often in SHA256, so at this stage it may be useful to establish a ternary XOR gate $\bigoplus (x, y, z)$. This is the approach we're taking in this document, but it could easily be revised to favor using an additional 32 cells and reusing the two-input $\oplus$ gate. Using polynomial constraints previously established for $\text{Maj}$, our constraints are: $$c_0 = 0$$ $$c_{i+1} = \text{Maj}(x_i, y_i, c_i)$$ $$2w_i = (2x_i - 1)(2y_i - 1)(2c_i - 1) + 1$$ where the second constraint is enforced for all $i \in \{0, \ldots, 30 \}$, and the third is a new $\bigoplus(x_i, y_i, c_i)$ gate applied for all $i \in \{0, \ldots, 31\}$. We wrote these constraints using the "natural" addition algorithm, which uses an additional 64 cells of space (32 for the sum $w$ and 32 for the carry $c$). It may be possible to save a 32 cells of space by absorbing the role of the carry $c$ into the constraints for $w$. We think this is a bad idea, though, since it would increase the degree and complexity of the gates needed to constrain $w$. #### $\Sigma_0^{256}$ According to the specification, $\Sigma_0^{256}$ is defined by $$\Sigma_0^{256}(x) = \text{ROTR}^2(x) \oplus \text{ROTR}^{13}(x) \oplus \text{ROTR}^{22}(x)$$ So we can constrain $w = \Sigma_0^{256}(x)$ by constraining $$w_i = \bigoplus (x_{(i-2) \mod 32}, x_{(i-13) \mod 32}, x_{(i-22) \mod 32})$$ using the three-input $\bigoplus$ gate established in the previous section for all $i \in \{0, \ldots, 31\}$. #### $\Sigma_1^{256}$ We have $$\Sigma_1^{256}(x) = \text{ROTR}^6(x) \oplus \text{ROTR}^{11}(x) \oplus \text{ROTR}^{25}(x)$$ which we can constrain by enforcing $$w_i = \bigoplus (x_{(i-6) \mod 32}, x_{(i-11) \mod 32}, x_{(i-25) \mod 32})$$ for all $i \in \{0, \ldots, 31\}$. #### $\sigma_0^{256}$ Now $\sigma_0^{256}$ is defined by $$\sigma_0^{256}(x) = \text{ROTR}^7(x) \oplus \text{ROTR}^{18}(x) \oplus \text{SHR}^{3}(x)$$ Using the piecewise definition of $\text{SHR}$ given above, we have that $\text{SHR}^3(x_i) = 0$ if $i < 3$ and $\text{SHR}^3(x_i) = x_{i-3}$ otherwise. We also have the identity $b \oplus 0 = b$ for any bit $b$. Therefore we can constrain $w = \sigma_0^{256}(x)$ by enforcing $$w_i = x_{(i-7) \mod 32} \oplus x_{(i-18) \mod 32} \text{ for } i \in \{0, 1, 2\}$$ $$w_i = \bigoplus (x_{(i-7) \mod 32}, x_{(i-18) \mod 32}, x_{i-3}) \text{ for } i \in \{3, \ldots, 31\}$$ #### $\sigma_1^{256}$ Similarly, $\sigma_1^{256}$ is defined by $$\sigma_1^{256}(x) = \text{ROTR}^{17}(x) \oplus \text{ROTR}^{19}(x) \oplus \text{SHR}^{10}(x)$$ which we can constrain by enforcing $$w_i = x_{(i-17) \mod 32} \oplus x_{(i-19) \mod 32} \text{ for } i \in \{0, \ldots, 9\}$$ $$w_i = \bigoplus (x_{(i-17) \mod 32}, x_{(i-19) \mod 32}, x_{i-10}) \text{ for } i \in \{10, \ldots, 31\}$$ ### Initialization The hash is divided into eight 32-bit words $H_0^{(0)}, \ldots H_7^{(0)}$, which are initialized to known constants. These can be hard-coded into the circuit either as expressions, or as entries in a fixed column. Using a fixed column would probably be more convenient. We will need a fixed column anyway to store constants $K_0, \ldots, K_{63}$, so if the number of fixed columns is an issue we can store $H_0^{(0)}, \ldots H_7^{(0)}$ in the same fixed column. Each message block $M^{(i)}$ is divided up into 16 32-bit words $M_t^{(i)}$ for $t \in \{0, \ldots, 15\}$. Then $H^{(i)}$ is computed from $H^{(i-1)}$ and $M^{(i)}$ for each message block $M^{(i)}$. ### Hashing the $i^{\text{th}}$ block #### Preparing the message schedule The message schedule consists of 32-bit words $W_t$ where $t \in \{0, \ldots, 63\}$. Each $W_t$ is defined by $$W_t = M_t^{(i)} \text{ for } t \in \{0, \ldots, 15\}$$ $$W_t = \sigma_1^{256}(W_{t-2}) + W_{t-7} + \sigma_0^{256}(W_{t-15}) + W_{t-16} \text{ for } t \in \{16, \ldots, 63\}$$ Unfortunately, we cannot save space by referring directly to the $i^{\text{th}}$ block of $M$, and will have to constrain $W_t$ for $t \in \{0, \ldots, 15\}$ via copy constraints. This is because we will want each region to represent a round of the hashing algorithm: if we tried to include everything in the same region as the original padded message, the offset indices would quickly become large and messy. We can constrain each application of $\sigma$ with only one intermediate 32-bit word, and $+$ with two, so in total we need eight 32-bit words (including $W_t$ itself) to appropriately constrain $W_t$ for each $t \in \{16, \ldots, 63\}$. This comes out to 8 * 32 * (64 - 16) = 12,288 cells to compute the last 48 words of the message schedule. How should these cells be stored in the circuit? For a SHA256 chip it is natural to have eight advice columns for the eight working variables $a$ through $h$ (see next section). Since eight intermediate words are needed to calculate each $W_t$, the message schedule can fit into existing advice columns in this way: |$a$ |$b$ |$c$ |$d$ |$e$ |$f$ |$g$ |$h$ | |-----|----|----|----|----|----|----|----| |$W_0$| | | | | | | | |$\vdots$| | | | | | | | |$W_{15}$| | | | | | | | |$W_{16}$| $w_{16}^{(0)}$|$w_{16}^{(1)}$|$w_{16}^{(2)}$|$w_{16}^{(3)}$|$w_{16}^{(4)}$|$w_{16}^{(5)}$|$w_{16}^{(6)}$| |$\vdots$| | | | | | | | |$W_{63}$| $w_{63}^{(0)}$|$w_{63}^{(1)}$|$w_{63}^{(2)}$|$w_{63}^{(3)}$|$w_{63}^{(4)}$|$w_{63}^{(5)}$|$w_{63}^{(6)}$| where $w_t^{(j)}$ are intermediate 32-bit words used to constrain $W_t$. This uses a total of 64 * 32 = 2048 rows in 8 advice columns. It may seem that a lot of space is wasted in the first 16 * 32 rows, but this is to ensure that the same relative references are used to calculate $W_t$ for each $t \in \{16, \ldots, 63 \}$. #### The eight working variables The next step of SHA256 is to initialize eight working variables $a, b, c, d, e, f, g, h$ to the previous hash value: $a = H_0^{(i-1)}, \ldots, h = H_7^{(i-1)}$. Then for $t \in \{0, \ldots, 63\}$ we must constrain $$T_1 = h + \Sigma_1^{256}(e) + \text{Ch}(e, f, g) + K_t^{256} + W_t \\ T_2 = \Sigma_0^{256}(a) + \text{Maj}(a, b, c) \\ h = g \\ g = f \\ f = e \\ e = d + T_1 \\ d = c \\ c = b \\ b = a \\ a = T_1 + T_2$$ Here $\{K_0^{256}, \ldots, K_{63}^{256}\}$ is a set of constants specific to the SHA256 algorithm. Definitely we will want to design a chip to do this intermediate computation. Therefore we cannot attempt to save space by, for example, referring directly to $c$ in the previous round to stand for $d$ in the next round. Keeping track of such references for all 64 rounds would be impossible anyway. So we need to use copy constraints to constrain the equality relations. The second consequence of this is that we cannot hard-code the constants $K_t$ as expressions; instead we will have to do it by instantiating them in a fixed column. But this is more elegant anyway. Since $W_t$ only refers to the current round, we can pass it in as an input to the chip and don't need use space copying it inside the chip. Each application of $\text{Ch}$, $\Sigma$, or $\text{Maj}$ requires one intermediate 32-bit word, and each application of $+$ requires two. Therefore we need ten 32-bit words to constrain $T_1$ (including $T_1$ itself), and four to constrain $T_2$ (including $T_2$). Once we have these written down, no additional intermediate 32-bit words need to be assigned to constrain $e$ and $a$. This brings us to a total of 14 intermediate 32-bit words, or 14 * 32 = 448 cells. Since we have eight working variables, each with their own advice column, and 14 intermediate values, this suggests a chip design like the following. |$v_0$|$v_1$|$v_2$|$T_1$| |$w_0$|$w_1$|$w_2$| | --- | --- | --- | --- | --- | --- | --- | --- | |$w_3$|$w_4$|$w_5$|$w_6$|$w_7$|$w_8$|$T_2$| | | $a$ | $b$ | $c$ | $d$ | $e$ | $f$ | $g$ | $h$ | Here $v_0, v_1, v_2, w_0, \ldots, w_8, T_1, T_2$ are intermediate 32-bit words, and $a$ through $h$ are the *outputs* of the chip. These are constrained by referring to the values located outside the chip, which are passed to the chip as inputs. These inputs include the previous values of $a$ through $h$, as well as $W_t$ from the previously-computed message schedule and $K_t$ from the fixed column. Using this design requires 3 * 32 = 96 rows and 8 advice columns. One could make the chip more self-contained by filling in the empty cells with $K_t$ and $W_t$: this would not use any more space but would introduce 64 new copy constraints per round. Since this chip would be assigned once for each $t \in \{0, \ldots, 63\}$, computing $H^{(i)}$ from $H^{(i-1)}$ requires 8 advice columns and 64 * 96 = 6,144 rows. Once we have the output of the last iteration of this chip, we constrain $H^{(i+1)}_0 = a + H^{(i)}, \ldots, H_7^{(i+1)} = h + H_7^{(i)}$. Once this has been done for $i \in \{1, \ldots, N\}$, where $N$ is the number of blocks, the final SHA256 hash of the message is $$H_0^{(N)} || \cdots || H_7^{(N)}$$ (TODO: describe how to store the $i^{\text{th}}$ hashed block in an advice column and select the indices of the result using $N$) # `map_to_curve` ## Simplified SWU for $AB = 0$. TODO: use this instead. ``` The following is a straight-line implementation of the Simplified SWU mapping that applies to any curve over GF(q) where q = 9 (mod 16). This includes the curve isogenous to BLS12-381 G2 (Section 8.8.2). map_to_curve_simple_swu_9mod16(u) Input: u, an element of F. Output: (xn, xd, yn, yd) such that (xn / xd, yn / yd) is a point on the target curve. Constants: 1. c1 = (q - 9) / 16 # Integer arithmetic 2. c2 = sqrt(-1) 3. c3 = sqrt(c2) 4. c4 = sqrt(Z^3 / c3) 5. c5 = sqrt(Z^3 / (c2 * c3)) Steps: 1. tv1 = u^2 2. tv3 = Z * tv1 3. tv5 = tv3^2 4. xd = tv5 + tv3 5. x1n = xd + 1 6. x1n = x1n * B 7. xd = -A * xd 8. e1 = xd == 0 9. xd = CMOV(xd, Z * A, e1) # If xd == 0, set xd = Z * A 10. tv2 = xd^2 11. gxd = tv2 * xd # gxd == xd^3 12. tv2 = A * tv2 13. gx1 = x1n^2 14. gx1 = gx1 + tv2 # x1n^2 + A * xd^2 15. gx1 = gx1 * x1n # x1n^3 + A * x1n * xd^2 16. tv2 = B * gxd 17. gx1 = gx1 + tv2 # x1n^3 + A * x1n * xd^2 + B * xd^3 18. tv4 = gxd^2 19. tv2 = tv4 * gxd # gxd^3 20. tv4 = tv4^2 # gxd^4 21. tv2 = tv2 * tv4 # gxd^7 22. tv2 = tv2 * gx1 # gx1 * gxd^7 23. tv4 = tv4^2 # gxd^8 24. tv4 = tv2 * tv4 # gx1 * gxd^15 25. y = tv4^c1 # (gx1 * gxd^15)^((q - 9) / 16) 26. y = y * tv2 # This is almost sqrt(gx1) 27. tv4 = y * c2 # check the four possible sqrts 28. tv2 = tv4^2 29. tv2 = tv2 * gxd 30. e2 = tv2 == gx1 31. y = CMOV(y, tv4, e2) 32. tv4 = y * c3 33. tv2 = tv4^2 34. tv2 = tv2 * gxd 35. e3 = tv2 == gx1 36. y = CMOV(y, tv4, e3) 37. tv4 = tv4 * c2 38. tv2 = tv4^2 39. tv2 = tv2 * gxd 40. e4 = tv2 == gx1 41. y = CMOV(y, tv4, e4) # if x1 is square, this is its sqrt 42. gx2 = gx1 * tv5 43. gx2 = gx2 * tv3 # gx2 = gx1 * Z^3 * u^6 44. tv5 = y * tv1 45. tv5 = tv5 * u # This is almost sqrt(gx2) 46. tv1 = tv5 * c4 # check the four possible sqrts 47. tv4 = tv1 * c2 48. tv2 = tv4^2 49. tv2 = tv2 * gxd 50. e5 = tv2 == gx2 51. tv1 = CMOV(tv1, tv4, e5) 52. tv4 = tv5 * c5 53. tv2 = tv4^2 54. tv2 = tv2 * gxd 55. e6 = tv2 == gx2 56. tv1 = CMOV(tv1, tv4, e6) 57. tv4 = tv4 * c2 58. tv2 = tv4^2 59. tv2 = tv2 * gxd 60. e7 = tv2 == gx2 61. tv1 = CMOV(tv1, tv4, e7) 62. tv2 = y^2 63. tv2 = tv2 * gxd 64. e8 = tv2 == gx1 65. y = CMOV(tv1, y, e8) # choose correct y-coordinate 66. tv2 = tv3 * x1n # x2n = x2n / xd = Z * u^2 * x1n / xd 67. xn = CMOV(tv2, x1n, e8) # choose correct x-coordinate 68. e9 = sgn0(u) == sgn0(y) # Fix sign of y 69. y = CMOV(-y, y, e9) 70. return (xn, xd, y, 1) ``` ``` c1 = 0x2a437a4b8c35fc74bd278eaa22f25e9e2dc90e50e7046b466e59e49349e8bd050a62cfd16ddca6ef53149330978ef011d68619c86185c7b292e85a87091a04966bf91ed3e71b743162c338362113cfd7ced6b1d76382eab26aa00001c718e3 c2 = 0x0 + 0x1 * I c3 = 0x135203e60180a68ee2e9c448d77a2cd91c3dedd930b1cf60ef396489f61eb45e304466cf3e67fa0af1ee7b04121bdea2 + 0x06af0e0437ff400b6831e36d6bd17ffe48395dabc2d3435e77f76e17009241c5ee67992f72ec05f4c81084fbede3cc09 * I c4 = 0x136753aead177603ecbfaf2395ee800fb38ef1737f8232e72bb1880c78ae1cabd529aa5c0667f539924950420e408e1b + 0x11eb95120939a15aed4b108ad51262f33bf72acf3adb46259d28f0306d0e27ffe7d29afc46792c103e535c80de7bc0f6 * I c5 = 0x0f5d0d63d2797471e6d39f306cc0dc0ab85de3bd9f39ce46f3649ac0de9e844417cc8de88716c1fd323fa68040801aea + 0x0ab1c2ffdd6c253ca155231eb3e71ba044fd562f6f72bc5bad5ec46a0b7a3b0247cf08ce6c6317f40edbc653a72dee17 * I ``` --- ``` Operations: 1. tv1 = inv0(Z^2 * u^4 + Z * u^2) 2. x1 = (-B / A) * (1 + tv1) 3. If tv1 == 0, set x1 = B / (Z * A) 4. gx1 = x1^3 + A * x1 + B 5. x2 = Z * u^2 * x1 6. gx2 = x2^3 + A * x2 + B 7. If is_square(gx1), set x = x1 and y = sqrt(gx1) 8. Else set x = x2 and y = sqrt(gx2) 9. If sgn0(u) != sgn0(y), set y = -y 10. return (x, y) ``` ### `inv0` By definition, $\text{inv0}(x)$ returns the inverse of $x$ if $x$ is invertible, and returns 0 otherwise. In the reference document they recommend implementing this by computing $\text{inv0}(x) = x^{q-2}$ where $q$ is the size of the field. This is unnecessary in our case, however. Notice that the purpose of $\text{inv0}$ in this function is to condition the value of $x_1$ on whether or not $\text{tv}_1$ is invertible. So as usual we can create an auxiliary value $\text{inv}_1$ and suggest that the prover set $\text{inv}_1 = \text{tv}^{-1}$. We can then constrain $x_1$ by constraining $$x_1 = (\text{tv}_1 \cdot \text{inv}_1)(-B/A)(1 + \text{tv}_1) + (1 - \text{tv}_1 \cdot \text{inv}_1)(B/ZA)$$ If $\text{tv}_1$ is invertible, then $\text{tv}_1 \cdot \text{inv}_1 = 1$, the second term vanishes, and $x_1 = (-B/A)(1+\text{tv}_1)$. Otherwise, $\text{tv}_1 \cdot \text{inv}_1 = 0$, the first term vanishes, and $x_1 = B/ZA$. ### `is_square` ``` is_square(x) Parameters: - F, an extension field of characteristic p and order q = p^2 with basis (1, I). Input: x, an element of F. Output: True if x is square in F, and False otherwise. Constants: 1. c1 = (p - 1) / 2 # Integer arithmetic Procedure: 1. tv1 = x_1^2 2. tv2 = I * x_2 3. tv2 = tv2^2 4. tv1 = tv1 - tv2 5. tv1 = tv1^c1 6. e1 = tv1 != -1 # Note: -1 in F 7. return e1 ``` Everything here is pretty easy to constrain except until line 5, in which we must raise $\text{tv}_1$ to the power $c_1$. Exponentiation is not a (low degree) polynomial operation in finite fields, so we will have to build constraints for this. We can, however, decompose $c_1$ into bits and use the square-and-multiply algorithm to constrain the result of exponentiation by $c_1$. Thankfully, $c_1 = \frac{p-1}{2}$ is a constant, and we can instantiate its binary decomposition in a fixed column. For the time being let us denote the temporary variable we are trying to exponentiate by $t$, and the exponent by $c$. Let $c_0, \ldots, c_{379}$ be the big endian binary decomposition of $c$. We initialize by constraining $t_{-1} = 1$. Then we want to enforce the relations 1. $t_i = t_{i-1}^2$ if $c_i = 0$. 2. $t_i = t_{i-1}^2 \cdot t$ if $c_i = 1$. So our constraints should be $$t_{-1} = 1$$ $$t_i = t_{i-1}^2 (c_i(t-1)+1)$$ Then the final value $t_{379}$ will be constrained to be $t^c$, as desired. ### `sqrt` ``` sqrt_9mod16(x) Parameters: - F, a finite field of characteristic p and order q = p^m. Input: x, an element of F. Output: z, an element of F such that (z^2) == x, if x is square in F. Constants: 1. c1 = sqrt(-1) in F, i.e., (c1^2) == -1 in F 2. c2 = sqrt(c1) in F, i.e., (c2^2) == c1 in F 3. c3 = sqrt(-c1) in F, i.e., (c3^2) == -c1 in F 4. c4 = (q + 7) / 16 # Integer arithmetic Procedure: 1. tv1 = x^c4 2. tv2 = c1 * tv1 3. tv3 = c2 * tv1 4. tv4 = c3 * tv1 5. e1 = (tv2^2) == x 6. e2 = (tv3^2) == x 7. tv1 = CMOV(tv1, tv2, e1) # Select tv2 if (tv2^2) == x 8. tv2 = CMOV(tv4, tv3, e2) # Select tv3 if (tv3^2) == x 9. e3 = (tv2^2) == x 10. z = CMOV(tv1, tv2, e3) # Select the sqrt from tv1 and tv2 11. return z ``` There is a subtlety to this algorithm and its use in the `map_to_curve` function. Fixing $c_1, c_2, c_3$ such that $c_1^2 = -1$, $c_2^2 = c_1$, and $c_3^2 = -c_1$, this algorithm deterministically produces a square root $z$ of a given input $x$. The chosen $z$ depends on our initial choices of $c_1, c_2, c_3$. So it would seem that $c_1, c_2, c_3$ should be given concrete values in the specification so that calling functions will follow a certain expected behavior. Instead, we are free to choose $c_1, c_2, c_3$. This apparent ambiguity is resolved in the calling functions themselves. Any function that calls `sqrt` also calls `sgn0` at some point, to determine whether the square root found by the `sqrt` function is "positive" or "negative" according to `sgn0`. The given algorithm is fairly close to how one would specify it in a circuit, but it is not especially human-readable. For convenience, let us exercise our freedom as implementors and make some arbitrary choices. Suppose that $c_1 = i$. Suppose that $c_2 = \omega$ is a primitive eighth root of unity in $\mathbb F_p(i)$, so that $\omega^2 = i$. And suppose that $c_3 = \omega^3$, so that $c_3^2 = \omega^6 = -i$. Also, let $a = c_4 = \frac{q+7}{16} = \frac{p^2 + 7}{16}$. Then what the algorithm says is that, for $x \in \mathbb F_p(i)$, 1. $z = \omega x^a$ if $x^{2a - 1} = -i$. 2. $z = \omega^3 x^a$ if $x^{2a - 1} = i$. 3. $z = ix^a$ if $x^{2a-1} = -1$. 4. $z = x^a$ if $x^{2a-1} = 1$. Note: all operations are performed in $\mathbb F_p(i)$ and all numbers are field elements *except* $a$, which is an integer. Through some experimentation on https://www.boxentriq.com/code-breaking/big-number-calculator , the author has found that the following constants ``` c1 = 0x0 + 0x1 * I c2 = 0x135203e60180a68ee2e9c448d77a2cd91c3dedd930b1cf60ef396489f61eb45e304466cf3e67fa0af1ee7b04121bdea2 + 0x6af0e0437ff400b6831e36d6bd17ffe48395dabc2d3435e77f76e17009241c5ee67992f72ec05f4c81084fbede3cc09 * I c3 = 0x135203e60180a68ee2e9c448d77a2cd91c3dedd930b1cf60ef396489f61eb45e304466cf3e67fa0af1ee7b04121bdea2 + 0x135203e60180a68ee2e9c448d77a2cd91c3dedd930b1cf60ef396489f61eb45e304466cf3e67fa0af1ee7b04121bdea2 * I c4 = 0x2a437a4b8c35fc74bd278eaa22f25e9e2dc90e50e7046b466e59e49349e8bd050a62cfd16ddca6ef53149330978ef011d68619c86185c7b292e85a87091a04966bf91ed3e71b743162c338362113cfd7ced6b1d76382eab26aa00001c718e4 ``` satisfy the conditions $c_1^2 = -1$, $c_2^2 = c_1$, $c_3^2 = -c_1$, and $c_4 = \frac{p^2+7}{16}$, as well as our simplifying assumptions made above. ### `sgn0` `sgn0` is a way of assigning a "sign" to a field element. Unlike in the integers, the decision to call certain field elements positive and others negative is somewhat arbitrary, as there is no ordering of a finite field compatible with the addition and multiplication operations. So why would we want to assign a "sign" to each element? Because for any ordered pair of field elements $(x,y)$ where $y^2 = x^3 + ax + b$, i.e. for any elliptic curve point, we have that $(x, -y)$ is also an elliptic curve point. For some algorithms, we want to be able to designate which of $y, -y$ is "positive" in a deterministic way, even if that choice is arbitrary. This is the function of `sgn0`. In the reference document they give a `sgn0` algorithm for general finite fields, but they also give simplified versions for prime fields and degree-2 extension fields. Since we are working in a degree-2 extension field $\mathbb F_p(i)$, we use the simplified algorithm below. ``` sgn0_m_eq_2(x) Input: x, an element of GF(p^2). Output: 0 or 1. Steps: sign_0 = x_0 mod 2 zero_0 = x_0 == 0 sign_1 = x_1 mod 2 return sign_0 OR (zero_0 AND sign_1) # Avoid short-circuit logic ops ``` In plain English, we examine the real part $x_0$ of a field element $x = x_0 + x_1i$. If $x_0 \neq 0$, then we declare $x$ to be positive if $x_0$ is even, and negative if $x_0$ is odd. If $x_0 = 0$, then we examine the imaginary part $x_1$. Then we declare $x$ to be positive if $x_1$ is even and negative if $x_1$ is odd. #### `mod 2` The first problem with the `sgn0` algorithm given in the specification is that "mod 2" has no meaning inside a finite field. In the integers, $x \mod 2 = 0$ if $x$ is divisible by 2. However, in a finite field, 2 is invertible, so *every* element is divisible by 2. So we need a way of determining whether an element of a prime field is even or odd when viewed as an integer. Let $x \in \mathbb F_p$, where $p$ is a prime which is 381 bits long. The naiive way of checking whether $x$ is even or odd is to have the prover give us a list of values $x_0, \ldots, x_{380}$, thought of as the big-endian binary representation of $x$, constrain $$x_i^2 = x_i \text{ for } i \in \{0, \ldots, 380\}$$ $$x = \sum_{i=0}^{380} 2^{380 - i} x_i $$ and check whether $x_{380} = 1$ or $x_{380} = 0$, the idea being that if $x_{380} = 0$, then $x$ is even, otherwise $x$ is odd. The problem with this approach is that, since the sum is taken mod $p$, the binary representation of $x$ may not be unique. However, the binary representation of $x$ *as an integer less than $p$* is unique. In order to check that the prover filled in the "canonical" binary representation which does not exceed $p$, we have to compare the given binary representation of $x$ with the binary representation of $p$ itself. Let $(p_0, \ldots, p_{380})$ be the binary representation of $p$ as an integer. (This is a constant of the protocol independent of our particular proof, and can be stored in a fixed column, so we don't have to worry about its correctness). If $(x_0, \ldots, x_{380})$ is *lexicographically* less than $(p_0, \ldots, p_{380})$, then we have the canonical binary representation of $x$, and can declare that $x$ is even if and only if $x_{380} = 0$. By the way, this is why we chose to use the *big-endian* binary representation of $x$, because the notation is easier when comparing lexicographically from most significant to least significant bit. Note that since $p_i$ and $x_i$ are bits, $p_i - x_i = 1$ if $p_i > x_i$, $p_i - x_i = 0$ when $p_i = x_i$, and $p_i = x_i = -1$ when $p_i < x_i$. Let $j$ be the least $i$ such that $p_i - x_i \neq 0$. If no such $j$ exists, then $p = x$. If $p_j - x_j = 1$, then none of the remaining bits matter, and we can immediately conclude that $p > x$. Similarly, if $p_j - x_j = -1$, then again none of the remaining bits matter, and we can conclude that $p < x$. One way to capture this idea in a circuit is to have an auxiliary column containing values $y_0, \ldots, y_{380}$, which we would like to constrain so that 1. $y_i = p_i - x_i$ for all $i \leq j$. 2. $y_i = y_j$ for all $i \geq j$. Then to compare $p$ and $x$, we only need to look at $y_{380}$. If $y_{380} = 0$, then $p = x$ since $j$ does not exist and so we never got to step 2. If $y_{380} = 1$, then $p > x$, and if $y_{380} = -1$, then $p < x$. The following conditions are equivalent to 1. and 2. 1. $y_0 = p_0 - x_0$. 2. For all $i \in \{1, \ldots, 380\}$, $y_i = y_{i-1}$ if $y_{i-1} \neq 0$, otherwise $y_i = p_i - x_i$. The equivalence is clear since, as soon as $p_i - x_i$ hits a nonzero value (at index $j$ by definition of $j$), that value is copied to $y_{j+1}$, which is now nonzero so it gets copied to $y_{j+2}$, and so on for all $i \geq j$. This rephrasing is convenient because it allows us to express the comparison in polynomial constraints as follows. 1. $y_0 = p_0 - x_0$. 2. $y_i = (1 - y_{i-1}^2)(p_i - x_i) + y_{i-1}$ for all $i \in \{1, \ldots, 380\}$. Note that $y_i \in \{-1, 0, 1\}$ for all $i \in \{0, \ldots, 380\}$ (by induction using the above constraints), so that $y_i^2 = 1$ if and only if $y_i \neq 0$. Thus if $y_{i-1} \neq 0$ the second constraint says that $y_i = y_{i-1}$; on the other hand if $y_{i-1} = 0$, the second constraint says that $y_i = p_i - x_i$, as desired. Now if $y_{380} = 0$ or $y_{380} = -1$, then $x$ overflows $p$. However, since we only allotted 381 bits for the prover to represent $x$, it is impossible for $x$ to overflow $p$ twice. More formally, $x < 2^{382}$ and $p > 2^{381}$, so we always have $0 \leq x < 2p$. So if $x$ as given is not the canonical representation of $x \bmod p$, then at least we know that $x - p$ is. Since $p$ is odd, this means that, if $y_{380} = -1$, then $x \bmod p$ is even if $x_{380} = 1$ and odd if $x_{380} = 0$. If $x = p$ then $x \bmod p = 0$, so $x$ is even. For the moment let us denote $x \bmod 2$ by $s$. Then, in summary, we have 1. If $y_{380} = 1$, then $s = 0$ if $x_{380} = 0$ and $s = 1$ if $x_{380} = 1$. 2. If $y_{380} = 0$, then $s = 0$. 3. If $y_{380} = -1$, then $s = 0$ if $x_{380} = 1$ and $s = 1$ if $x_{380} = 0$. The reader can check that this logic is encapsulated by the following polynomial constraint: $$2s = y_{380}(2x_{380}+y_{380}-1)$$ --- Returning to the original `sgn0` function, if $x = x_0 +x_1i$, then we can constrain $s_0 = x_0 \bmod 2$ and $s_1 = x_1 \bmod 2$ as above. For line 2 of the function we need to constrain a variable $z$ so that $z = 1$ if $x_0 = 0$ and $z = 0$ otherwise. To do this, have the prover fill in an auxiliary value $z'$ and constrain it by $$x_0(1-z'x_0) = 0$$ so that either $x_0 = 0$ or $z' = x_0^{-1}$. Then constrain $$z = 1-z'x_0$$ If $x_0 \neq 0$, then $z' = x_0^{-1}$ and $z = 1 - x_0^{-1}x_0 = 0$, but if $x_0 = 0$, then $z = 1$, as desired. Note that if $a$ and $b$ are bits, then we can enforce $c = a \vee b$ by the constraint $$c = a + b - ab$$ Thus if $s$ is the final output of our `sgn0` function, we constrain $$s = s_0 + zs_1 - zs_0s_1$$ ## `iso_map` # `clear_cofactor` https://crypto.stackexchange.com/questions/37537/what-are-i2osp-os2ip-in-rsa-pkcs1