Constraint efficient neural networks

# Constraint efficient neural networks The recent interest in merging zero-knowledge cryptography and machine learning has led to progress at an impressive clip. In April of last year we had projects generating proofs for the final layer of a [small multi-layer perceptron trained on MNIST](https://0xparc.org/blog/zk-mnist), as of May 2023 we're getting entire transformer based models into circuits (see [here](https://hackmd.io/mGwARMgvSeq2nGvQWLL2Ww) and [here](https://github.com/ddkang/zkml)). We're naturally arriving to a point where we should begin exploring architectures that are "easier" to prove. To quantify this ease of proving I want to introduce a concept or metric I've been calling constraint efficiency. Given $N$ network parameters, what is then the number of constraints $M$ generated within a ZK-circuit. Obviously $M$ depends on a number of factors, in particular: - the particular gates and constraints used to represent a neural network layer. Some frameworks will tradeoff a larger number of constraints for greater proving efficiency by way for instance of "accumulated arguments" (see the accumulated dot product argument [here](https://hackmd.io/mGwARMgvSeq2nGvQWLL2Ww) for an example) . - The architecture of the network. Convolutional layers with large strides will generate less constraints than their fully-connected counterparts. - The proof system used. Proof systems with lookup arguments might have less constraints than those without. To help keep this enquiry focused we should instigate explorations where the proof system and the gates used to represent layers are fixed. More formally > Given a proof system $\mathcal{P}$ and a set of gates $\{g_i\}$ what are architectures that minimize the ratio $\frac{M}{N}$ ? ### Some numbers In our recent work getting [nanoGPT into Halo2-KZG](https://hackmd.io/mGwARMgvSeq2nGvQWLL2Ww), we noticed that transformers seemed to generate disproportionately larger $\frac{M}{N}$ ratios than we had experienced before. For instance for nanoGPT for an embedding size of 64 we get an $\frac{M}{N}$ ratio of $\approx 64$. - A 1 million parameter model generates 64 million constraints. - A 250k parameter model generates 16 million constraints By contrast a 4-layer convolutional network (with some ReLU peppered troughout) with 3047 parameters generates 13152 constraints, giving us a ratio $\frac{M}{N}$ of $\approx 4$. Quite the difference ! ### Current explorations - [Zero Gravity](https://hackmd.io/@benjaminwilson/zero-gravity) explores the efficiency of weightless neural network in generating ZK-SNARKS. - [Privacy Scaling Explorations](https://mirror.xyz/privacy-scaling-explorations.eth/K88lOS4XegJGzMoav9K5bLuT9Zhn3Hz2KkhB3ITq-m8) have launched an initiative to research ZK-efficient neural network architectures. ### Call to action We'd love to see some explorations of efficient architectures for SNARKs. If you're keen to discuss this with me / the EZKL community join us [here](https://t.me/+CoTsgokvJBQ2NWQx).