Efficient Substring Matching Circuit in Halo2

# Efficient Substring Matching Circuit in Halo2 ## Introduction Currently, our substring matching circuit in Halo2(https://github.com/zkemail/halo2-regex) requires O($m n\log{n}$) constraints, where $n, m$ denote max bytes and the number of substring definitions. This is because it computes a variable shift of the characters, which consumes O($n\log{n}$) constraints, for each substring definition. Here, I propose a new design of the substring matching circuit whose constraints are $O(n)$, independent of $m$! ## Table Design This is an example of the table design. | Characters (Advice) | States (Advice) | SubstrId (Advice) | IsIdNotZero(Advice) | ValidChars (Lookup) | ValidCurStates (Lookup) | ValidNextStates (Lookup) | ValidSubstrId (Lookup)| | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | | f | 0 | 0 | 0 | f | 0 | 1 | 0 | | r | 1 | 0 | 0 | r | 1 | 2 | 0 | | o | 2 | 0 | 0 | o | 2 | 3 | 0 | | m | 3 | 0 | 0 | o | 3 | 4 | 0 | | a | 4 | 1 | 1 | a | 4 | 5 | 1 | | l | 5 | 1 | 1 | b | 4 | 5 | 1 | | $...$ | $...$ | $...$ | $...$ | $...$ | $...$ | $...$ | $...$ | | o | 5 | 1 | 1 | k | 4 | 5 | 1 | | m | 5 | 1 | 1 | l | 4 | 5 | 1 | | t | 6 | 0 | 0 | m | 4 | 5 | 1 | | $...$ | $...$ | $...$ | $...$ | $...$ | $...$ | $...$ | $...$ | | s | 40 | 0 | 0 | a | 5 | 5 | 1 | | u | 41 | 0 | 0 | b | 5 | 5 | 1 | | b | 42 | 0 | 0 | c | 5 | 5 | 1 | | j | 43 | 0 | 0 | d | 5 | 5 | 1 | | e | 44 | 0 | 0 | e | 5 | 5 | 1 | | c | 45 | 0 | 0 | f | 5 | 5 | 1 | | t | 46 | 0 | 0 | g | 5 | 5 | 1 | | : | 47 | 0 | 0 | h | 5 | 5 | 1 | | H | 48 | 2 | 1 | j | 5 | 5 | 1 | | e | 49 | 2 | 1 | k | 5 | 5 | 1 | | l | 49 | 2 | 1 | l | 5 | 5 | 1 | | $...$ | $...$ | $...$ | $...$ | $...$ | $...$ | $...$ | $...$ | This table performs both of the regex checking and the substring matching in our current regex circuit. Here, I introduced a new concept, a substring id. When the table handles $m$ substring definitions, a state transition (apair of ValidCurStates and ValidNextStates) in the $i$-th ($1\leq i \leq m$) substring definition has the substring id $i$. The transition that does not belong to any definitions has the id 0. We make lookup constraints that (Characters, States[cur], States[next], SubstrId) are equal to one of (ValidChars, ValidCurStates, ValidNextStates, ValidSubstrId). The table also has a IsIdNotZero column, where each row value is 1 iff the value of SubstrId in the same row is 0 and 1 otherwsie. We outputs $Characters * IsIdNotZero$ for each row as a public input. It implies that it reveals the characters in any substrings and masks 0 for the other characters. In our example, the public input is [NULL,NULL,...,NULL,a,l,i,c,e,@,c,o,m,NULL,...,NULL,H,e,l,l,o,NULL,...,NULL]. ("alice@com" is the from email address and "Hello" is the subject.) The prover sends a verifier an initial position, an end position, and the substring bytes for each substring definition. The verifier can construct the same masked string by padding NULL between substrings. However, in some cases, it is not practical because the verifier has to specify O($n$) values as the public input. To solve this problem, the prover and the verifier computes the hash of the masked substring, which is used as the public input. While it still requires the verifier to consume O($n$) computation for the hash function, its gas on Ethereum will be lower than that of assigning EC points for all bytes.