ZK Regex - HackMD

# ZK Regex ## Introduction *This is a draft version of the official post [https://katat.me/blog/ZK+Regex](https://katat.me/blog/ZK+Regex)* The [zk regex concept](https://blog.aayushg.com/posts/zkemail#regex-deterministic-finite-automata-in-zk) originated from the [zk-email project](https://github.com/zk-email-verify/zk-email-verify). I found it unique in the zk circuit world and reckon it would play an important role in use cases of attestations from web2 on web3. So I dived into it to understand how it works, including how the regular expression works and how to convert it to a circom circuit. This process, to me, sheds light on how regex works internally (even though I have been using it for years) via NFA / DFA. Meanwhile, I refactored the circuit compiler a bit, into a standalone project [zk-regex](https://github.com/zk-email-verify/zk-regex), to try to encapsulate it in a way that makes it easier for developers to use and better aligns with regex usages in conventional programming languages. This post will focus on the introduction of why zk regex is needed and its advantage over common search via substring. Then it will try to explain the theory behind regex and how a regex circuit works to assert patterns via some examples, diagrams, and mathematical artifacts. Going through these might also help you to understand why zk helps bridge the web2 and web3, how a circuit is built to pave the way, and get inspired to use zk regex for potential use cases. ## Why ZK needs regex In web3, where decentralized applications (dApps) are built on blockchain platforms like Ethereum, there is often a need to attest that a piece of content follows specific string patterns without revealing the content itself. This is typically done to prove authority from web2 in web3, but it can also serve other purposes, such as enforcing constraints for integrity. Being able to attest to materials in web2 on web3, without middlemen, will serve as a bridge between web2 and web3. It would serve a critical role in porting the existing web2 infrastructures to web3, accelerating the web3 adoption. However, submitting the whole content to a smart contract to attest to its pattern would leak the content and be counterproductive to the purposes of on-chain infrastructures. This is where ZK Regex comes in. ZK Regex allows for proving the pattern of a piece of content without revealing the content itself. This is achieved through the succinctness of zero-knowledge tech, which allows a prover to prove that they got content following a particular set of patterns without submitting the entire content. Unlike other methods, such as substring searching, ZK Regex can prove the pattern of a piece of content without having to check every possible substring. In other words, Regex has much better encapsulations of the searching purposes. It can be used for both flexibility and privacy. The attester only requires the content to follow a particular set of patterns instead of any extract substrings. Thus, regex searching is better aligned with the purposes of zero-knowledge proof, enabling data to be processed at a higher degree of security and privacy. ## How regex works In a nutshell, there is a process of converting regular expression(regex) to DFA, such that the matching algorithm can run through a specified content and generate data points of various matches. The basic workflow for converting a regex into a DFA is as follows: 1. Parse the expression string: The first step is to parse the regex expression string into a data structure, such as a parse tree or an abstract syntax tree (AST), representing the expression's structure. This process involves breaking down the expression into component parts, such as characters, operators, and sub-groups, and representing them in a structured format. 2. Convert the parse tree to a NFA: The next step is to use the parse tree to construct a NFA that recognizes the language defined by the regex. An NFA is a finite-state machine with multiple transitions from a single state for a given input symbol. 3. Convert the NFA to a DFA: The NFA can then be converted into a DFA, a finite state machine with precisely one transition from a state for each possible input symbol. The conversion process involves using the NFA to construct a new DFA that recognizes the same language but deterministically. Here won't dive into the details of the process from regex to DFA. I recommend checking out [this blog post](https://dmitrysoshnikov.medium.com/building-a-regexp-machine-part-1-regular-grammars-d4986b585d7e) to understand the theory behind the regular expression. This will focus on the explanation of converting DFA to a circom circuit. ### DFA formal definition To try to convey the features behind DFA for the regex matching algorithm, we describe its definition in a concise and standardized way. $$M = (S, \Sigma, \delta, s_0, F)$$ where: - $S$ is a finite set of states. - $\Sigma$ is the input alphabet, a finite set of symbols. - $\delta: S \times \Sigma \rightarrow S$ is the transition function that takes a state and a symbol as inputs and returns a new state. - $s_0 \in S$ is the initial state. - $F \subseteq S$ is the set of final or accepting states. In other words, a DFA is a directed graph where the nodes are states and the edges are labeled with input symbols that recognize a pattern. ### DFA transition example Let's use the regular expression `d(a|b)*c` as an example to illustrate the state transitions with the math notations defined above. First, we can represent the DFA as a 5-tuple: $(S, \Sigma, \delta, s_0, F)$, where: - $S = {s_0,s_1,s_2,s_3}$, where $s_0$ is the initial state and $s_3$ is the accept state defined in $F$. - $\Sigma = {d, a, b, c}$ - $\delta$: We can represent this function using a table, where the rows represent the states and the columns represent the input symbols. The entry $s_j$ in row $i$ and column $j$ is the next state $s_{j+1}$ when the DFA is in state $j$ and reads the input symbol at row $i$. Using the $M$ defined above, we can prove that the DFA accepts the string `dabc`. #### step-by-step proof 1. Start in the initial state $s_0$ 2. $\delta(d, s_0)=s_1$ 3. $\delta(a, s_1)=s_2$ 4. $\delta(b, s_1)=s_2$ 5. $\delta(a, s_2)=s_2$ 6. $\delta(b, s_2)=s_2$ 7. $\delta(c, s_2)=s_3$ Since we have reached the final state $s_3$, the DFA accepts the string `dabc`. The following table and graph representations provide another visual perspectives to help understand the concept. #### table representation | δ | $s_0$ | $s_1$ | $s_2$ | | --- | ----- | ----- | ----- | | d | $s_1$ | | | | a | | $s_2$ | | | b | | $s_2$ | | | c | | | $s_3$ | #### graph representation ```mermaid graph LR s0((s0)) -- d --> s1 s1((s1)) -- a --> s2 s1((s1)) -- b --> s2 s2 -- a --> s2 s2 -- b --> s2 s2 -- c --> s3 s3((s3)) ``` ## Transition fulfillment representation In the circom circuit, what we really want to deal with are the data points, such as the number of entire matches, the number of sub-group matches, and the positions of the matches etc. Therefore, we need a convenient representation of these data points in circom circuits. At the core, the relationships between character inputs and state transitions can be used to derive these data points. Let's define these relationships as fulfillments $f: X \times Y \rightarrow {0,1}$, such that - $f=1$ when $X = Y$ - $f=0$ when $X \neq Y$ Wrapping up, we have $f(\delta(char(i), s_{j-1}),s_j)$ to indicate whether or not a state is fulfilled with an input character. Then these indications, as the following table shows, can be used to derive the desired data points introduced above. | index | char | $s_1$ | $s_2$ | $s_3$ | | --- | ---- | ----- | ----- | ----- | | 1 | d | 1 | | | | 2 | a | | 1 | | | 3 | b | | 1 | | | 4 | c | | | 1 | With the data points from the circom execution, they can be used for the constraints at the outer layers. For example, the table reflects the following data points. - index 1 is the start position of the entire match, not because of $index=1$ but induced from $f(\delta(char(1), s_{0}),s_1)=1$ - fulfillments at $s_2$ reflects the substring `bc` match in the index range $[2,3]$. That is the reveal string for the regular expression part `(a|b)*`. - fulfillments at $s_3$ reflects the entire regular expression `d(a|b)*c` is matched, since $s_3$ is the accept state. Note that in a normal regular expression algorithm, without the fulfillments at the accept state $s_3$ in this example, the fulfillments at previous states are not valid. ## Fulfillment representation to circuit Using a circom language data structure `states[i][j]` to record the results of $f(\delta(char(i), s_{j-1}),s_j)$, here is the key section of generated circuit code based on the rules derived from the $\delta$ state transition table: ```js for (var i = 0; i < num_bytes; i++) { /* test s1 fulfillment */ // matching char d eq[0][i] = IsEqual(); eq[0][i].in[0] <== in[i]; eq[0][i].in[1] <== 100; and[0][i] = AND(); and[0][i].a <== states[i][0]; and[0][i].b <== eq[0][i].out; states[i+1][1] <== and[0][i].out; /* test s2 fulfillment */ /** transition from different state **/ // matching char a eq[1][i] = IsEqual(); eq[1][i].in[0] <== in[i]; eq[1][i].in[1] <== 97; // matching char b eq[2][i] = IsEqual(); eq[2][i].in[0] <== in[i]; eq[2][i].in[1] <== 98; and[1][i] = AND(); and[1][i].a <== states[i][1]; multi_or[0][i] = MultiOR(2); multi_or[0][i].in[0] <== eq[1][i].out; multi_or[0][i].in[1] <== eq[2][i].out; and[1][i].b <== multi_or[0][i].out; /** transition from same state **/ // matching char a eq[3][i] = IsEqual(); eq[3][i].in[0] <== in[i]; eq[3][i].in[1] <== 97; // matching char b eq[4][i] = IsEqual(); eq[4][i].in[0] <== in[i]; eq[4][i].in[1] <== 98; and[2][i] = AND(); and[2][i].a <== states[i][2]; multi_or[1][i] = MultiOR(2); multi_or[1][i].in[0] <== eq[3][i].out; multi_or[1][i].in[1] <== eq[4][i].out; and[2][i].b <== multi_or[1][i].out; /** testing both transition paths **/ multi_or[2][i] = MultiOR(2); multi_or[2][i].in[0] <== and[1][i].out; multi_or[2][i].in[1] <== and[2][i].out; states[i+1][2] <== multi_or[2][i].out; /* test s3 fulfillment */ // matching char c eq[5][i] = IsEqual(); eq[5][i].in[0] <== in[i]; eq[5][i].in[1] <== 99; and[3][i] = AND(); and[3][i].a <== states[i][2]; and[3][i].b <== eq[5][i].out; states[i+1][3] <== and[3][i].out; } ``` ## Welcome I hope this post has shed some light on the theory behind zk-regex and provided practical insights into the relationship between theory and the generated circom circuit code. If this has motivated you to dive further into zk-regex, that's great! You can visit [zk-regex](https://github.com/zk-email-verify/zk-regex) to play around with the code and don't hesitate to share any suggestions or questions you might have.