In this note, I discuss the initial prototype implementation and benchmark results, give informal conclusions and suggest next steps for this project. Prototype codebase and benchmarks are available at [ChainSafe/lumenos](https://github.com/ChainSafe/lumenos) ## Code walkthrough ### Stack - FHE - [Lattigo v6](https://github.com/tuneinsight/lattigo) - Delegated proof - [Ligero](https://eprint.iacr.org/2022/1608) - Proof of Decryption - [Blind zkSNARKs: Section 7](https://eprint.iacr.org/2024/1684#page=36) via [Lazer](https://github.com/lazer-crypto/lazer) ### Server - Ligero [commit](https://github.com/ChainSafe/lumenos/blob/bddfe3dd0708f02bc2f7b5f84ca0326f4c11728d/fhe/ligero.go#L95) routine consists of [homomorphic NTT](https://github.com/ChainSafe/lumenos/blob/bddfe3dd0708f02bc2f7b5f84ca0326f4c11728d/fhe/ntt.go#L12) and Merkalization. - NTT here is a tiled algorithm where the first 2,4,8 levels are unwinded butterfly blocks and larger arbitrary size blocks are evaluated according [Daisuke Takahashi's "Six-Step NTT Algorithm"](https://www.ccs.tsukuba.ac.jp/wp-content/uploads/sites/14/2022kisti-ws-takahashi.pdf#page=11) (2D-NTT) - Ciphertexts are batched column-wise so that only one NTT instance is evaluated over rows. - Ligero [prove](https://github.com/ChainSafe/lumenos/blob/bddfe3dd0708f02bc2f7b5f84ca0326f4c11728d/fhe/ligero.go#L194) routine consists of homomorphic dot products $ct[\langle r_i,M_{i,j}\rangle]$ (also $ct[\langle b_i,M_{i,j}\rangle]$ for eval phase) and generating Merkle proofs for queried columns $ct[\hat{M}_{i,j}] i \in \lambda$. - Row slots in batched column ciphertexts are ct-pt multiplied by batched challenge column vector, then summed via [InnerSum](https://github.com/tuneinsight/lattigo/blob/84f6bc33cb5bd086f595ea3cc2b63f4dd74b2662/schemes/bgv/evaluator.go#L1527) using Galois automorphism (slot rotation). This operation puts the inner sum result into the first slot, other garbage slots can be masked or discarded via ring switch (see [Gama et al. section 7.4](https://eprint.iacr.org/2024/1684.pdf#page=41)). - Before sending to the client, ciphertexts with dot products and queries columns are mod switched to lower Q modulus via [Rescale](https://github.com/tuneinsight/lattigo/blob/84f6bc33cb5bd086f595ea3cc2b63f4dd74b2662/schemes/bgv/evaluator.go#L1415); this reduces communication and decryption overhead for client, but limits further FHE operations for these “rescaled” ciphertexts. - If the client requested [ring switch](https://github.com/ChainSafe/lumenos/blob/bddfe3dd0708f02bc2f7b5f84ca0326f4c11728d/fhe/ring_switch.go#L8-L9) to a specified LogN, the server also applies ring switch evaluation key to discard garbage slots. - Note: ring switch is currently incomplete on the server side, because it corrupts data on batched ciphertexts, and the missing [SlotsToCoeffs](https://github.com/tuneinsight/lattigo/blob/84f6bc33cb5bd086f595ea3cc2b63f4dd74b2662/circuits/ckks/dft/dft.go#L306) routine to "un-batch" ciphertexts. I therefore refer to it as an experimental feature. ### Client (designated verifier) - After receiving encrypted proof, the client [decrypts](https://github.com/ChainSafe/lumenos/blob/bddfe3dd0708f02bc2f7b5f84ca0326f4c11728d/fhe/ligero.go#L381) it with limited concurrency. - For [verifiable decryption](https://github.com/ChainSafe/lumenos/blob/bddfe3dd0708f02bc2f7b5f84ca0326f4c11728d/vdec/prover.go#L101) (Proof-of-Decryption), I used Cosic’s [reference implementation](https://github.com/KULeuven-COSIC/blind_zkSNARKs/tree/main/proof-of-decryption) via C to Go bindings. There are two versions: [BFV](https://github.com/ChainSafe/lumenos/blob/bddfe3dd0708f02bc2f7b5f84ca0326f4c11728d/vdec/c/src/vdec.c#L139) (slower and less optimized) and [GBFV](https://github.com/ChainSafe/lumenos/blob/bddfe3dd0708f02bc2f7b5f84ca0326f4c11728d/vdec/c/src/vdec_gbfv.c#L142) (faster [rotation routine](https://github.com/ChainSafe/lumenos/blob/95650a007ace2f7733f82e9432c9295e1fee8eb4/vdec/c/src/vdec_gbfv.c#L1055)). - PoD cost is amortized via [ciphertext batching](https://eprint.iacr.org/2024/1684.pdf#page=39) on the client side, as it must connect with query Merkle proofs. Batched ciphertext is then rescaled to a single 60 bit modulus and used as witness in PoD prover. - PoD prover uses a larger [68bit modQ](https://github.com/ChainSafe/lumenos/blob/bddfe3dd0708f02bc2f7b5f84ca0326f4c11728d/vdec/c/src/vdec_params.h#L9-L10), smaller ~62-66 bit modulus should likely be okay. ### Public verifier - Normal Ligero [verifier](https://github.com/ChainSafe/lumenos/blob/bddfe3dd0708f02bc2f7b5f84ca0326f4c11728d/fhe/ligero.go#L517); PoD proof verification isn’t implemented on Go side yet, currently PoD prover just returns result (1/0) of verification. ## **Conclusions** - Server FHE cost is **bottlenecked by column random dot products** (ct-pt mul & inner sum for each column). - Encrypted proof size is also dominated by encrypted dot products, but for each dot product ciphertext only the first slot contains useful information, while the rest are garbage slots that can be masked and discarded via ring switch. - Since ciphertexts are batched column-wise, the current implementation prefers tall matrices to wider ones. - The alternative construction—batch row-wise and perform NTT evaluation via matrix transformation (akin to [ckks\_dft](https://github.com/tuneinsight/lattigo/blob/84f6bc33cb5bd086f595ea3cc2b63f4dd74b2662/circuits/ckks/dft/dft.go#L2) circuit)—could potentially reduce inner sum overhead as it won't need Galois automorphism for inner sum (instead needed for NTT). - **Client decryption** is also bottlenecked by inner products. - In the experimental benchmarks, the server applies a ring switch to discard some garbage slots (as described in the Blind SNARK [paper p.41](https://eprint.iacr.org/2024/1684.pdf#page=41)) - This dramatically lowers the decryption cost (-165x) and encrypted proof size (-32x) - **Verifiable decrypt (Proof-of-Decryption)** is generally the most expensive part for the client. - For the most part, it's fixed cost thanks to ciphertext batching, scaling linearly with LogN. Additional measurements are needed for higher query counts e.g. [BCI+20.](https://github.com/ChainSafe/lumenos/blob/75a599bf06d8c1e6f3faa5f4fcdf66ba6b5da533/fhe/ligero.go#L73C6-L73C27) - Two PoD versions yield different client overhead: BFV ~22s, GBFV ~3s. Preliminary, this is due to a faster [rotation routine](https://github.com/ChainSafe/lumenos/blob/95650a007ace2f7733f82e9432c9295e1fee8eb4/vdec/c/src/vdec_gbfv.c#L1055) and overall more optimized code. Further investigation is needed to better understand this difference. - Compared to local proof generation results, FHE-SNARK shows cautious promise: - For larger configurations (8192x4096, 16384x4096), it results in **190-1000x less overhead**. - *With experimental features (ring switch + GBFV PoD)*, this difference is more pronounced: it takes **2,000x less time to decrypt and generate PoD than generate Ligero proof locally**. - Communication cost and RAM usage are the main reasons for caution: - Without ring-switch, encrypted proofs are growing to multiple GiBs (2.2 GB for 8192x4096 config). Note: with ring-switch (to LogN=10), it's down to roughly 200MB. - **RAM usage is high** (5-6GB peak for 8192x4096 config). *Most of it comes from poor memory management in PoD prover* (w/o PoD RAM peaks at 1.8GB for 8192x4096 with ring switch). - More things to note: - Lattigo currently supports a max plaintext modulus of 60bit, but secure proof generation requires larger fields for soundness. This can be done by adding GBFV support to work over Goldilocks prime. Larger extension fields would have to be simulated in FHE circuit. - Current BFV params choice is likely not optimal or not fully secure. - The code for PPD-Ligero and Local generation is far from optimized. ## Next steps - BFV SlotsToCoeffs for correct ring switch - Garbage slots in dot products cause the most overhead; experimental benchmarks prove ring switch effectiveness. - Reference implementation: [FeanorTheElf/galois-bootstrapping](https://github.com/FeanorTheElf/galois-bootstrapping/blob/main/seal/seal/transform.cpp) based on [<u>eprint/2023/1304</u>](https://eprint.iacr.org/2023/1304.pdf) - GBFV support for Lattigo v6 - GBFV version of PoD is more optimal. - should also unlock larger plaintext moduli, theoretically because other Lattigo components might not support that. - Reference implementation: [KULeuven-COSIC/Bootstrapping\_BGV\_BFV](https://github.com/KULeuven-COSIC/Bootstrapping_BGV_BFV/commit/438e728ce0c3d5b57db2ecc84e36a409a1bb2a48) - Further PoD analysis; overhaul / rewrite for optimization and wider hardware support. - Cosic’s reference implementation has many memory management issues; fixing these would show clearer picture of minimal client-side resources. - It’s limited to newer Intel CPUs supporting AVX512 and AES instruction sets — Adding ARM64 intrinsics would allow testing on mobile devices via Android/iOS wrappers. - Alternatively, a full Rust rewrite: there are many new prominent lattice-based ZK libs e.g. [larkworks](https://github.com/zhenfeizhang/larkworks), [Lazarus](https://github.com/lattice-complete/Lazarus), [lattirust](https://github.com/lattirust/lattirust). However, C/C++ implementation might be preferred for mobile device support due to better support in Kotlin/Swift tooling.