[ZKTokyo Slide] Publicly Verifiable, Private & Collaborative AI Training - HackMD

<style> section { text-align: left; } </style> ## Publicly Verifiable, Private & Collaborative AI Training #### April 26, 2025 @ ZKTokyo #### Yuriko / X:@yurikonishijima --- ![Screenshot 2025-04-25 at 21.07.56](https://hackmd.io/_uploads/ByRE3wtJxg.png) --- ![Screenshot 2025-04-25 at 21.26.10](https://hackmd.io/_uploads/SJBrluFkgg.png) --- ## Train To Earn? --- ### What if you don't have to reveal your data but still able to contribute to training a model? --- ## ->💡Federated learning + ZKP --- ### What is Federated Learning? ![Screenshot 2025-04-28 at 08.59.48](https://hackmd.io/_uploads/rkGoH2nygg.png) --- ## Why zk? ### Client: to prove correct execution of training (+ masking) a local model ### Server: to prove correct execution of aggregation of the local model updates --- Traditionally... Verifiability is not required in federated learning, because entities make a business alignment before collaboration ↓ For **mutually distrusted parties in a decentralized network** to collaborate on training, it's necessary --- <img src="https://hackmd.io/_uploads/rJenk9tyeg.png" width="850"> --- #### 1. Local Training on the client side - multi-class classification task - [Iris dataset](https://archive.ics.uci.edu/dataset/53/iris) 🌸 - training algorithm: [logistic regression circuit](https://github.com/hashcloak/noir-mpc-ml/blob/master/src/ml.nr) (imported the one built by hashcloak for their [co-noir-ml](https://github.com/hashcloak/noir-mpc-ml-report/tree/main) project) - clients train on their raw data --- ![Screenshot 2025-04-25 at 23.40.45](https://hackmd.io/_uploads/H1GTy5Kyxg.png) --- <img src="https://hackmd.io/_uploads/rJenk9tyeg.png" width="850"> --- ### Benchmark comparison <div style="display: flex; align-items: baseline;"> <h4 style="margin: 0;">co-noir ML</h4> <small style="margin-left: 10px;"> <a href="https://github.com/hashcloak/noir-mpc-ml?tab=readme-ov-file#training-using-co-noir">implementation</a> </small> </div> <div style="margin-top: -20px; margin-bottom: 40px;"> <img src="https://hackmd.io/_uploads/S156ycF1xe.png" width="650"> </div> <div style="display: flex; align-items: baseline;"> <h4 style="margin: 0;">Verifiable Federated Learning</h4> <small style="margin-left: 10px;"> <a href="https://github.com/yuriko627/vfl-demo">implementation</a> </small> </div> <div style="margin-top: -20px;"> <img src="https://hackmd.io/_uploads/HykWlqt1ee.png" width="350"> </div> --- #### 2. Masking the model - This is the only cryptographic part! (except for zk) - Why masking a model?: Input recovery attack - Gradient Inversion Attack, Membership Inference Attack, Property Inference Attack, etc... - In production of FL, Differential Privacy is used (cuz it's more efficient) --- ### So, how can we mask models in such a way that the server can calculate a **aggregation of raw models** without knowing each individual values? ... We need additive homormorphism...MPC? FHE? 🥺 --- ### No, just one-time pad 😳 I guess you can call it a type of MPC, but no decryption at the end, because masks will naturally cancel out with each other ✨ --- ### How does that work? --- ![Screenshot 2025-04-22 at 18.41.03](https://hackmd.io/_uploads/SJqa48Skgg.png) --- ![Screenshot 2025-04-19 at 15.19.12](https://hackmd.io/_uploads/H13dFrrJxg.png) --- For example, - client1: masked model $M_1$ = raw model $R_1$ + $m_{1,2}$ - $m_{3,1}$ - client2: masked model $M_2$ = raw model $R_2$ + $m_{2,3}$ - $m_{1,2}$ - client3: masked model $M_3$ = raw model $R_3$ + $m_{3,1}$ - $m_{2,3}$ Then, when a server sums up the masked models $M_n$, $M_1$ + $M_2$ + $M_3$ = $R_1$ + $m_{1,2}$ - $m_{3,1}$ + $R_2$ + $m_{2,3}$ - $m_{1,2}$ + $R_3$ + $m_{3,1}$ - $m_{2,3}$ = $R_1$ + $R_2$ + $R_3$ --- Privacy on raw models $R_n$: each client can only calculate masks with their own neighbors. For example, - client1 does not know $m_{2,3}$ => cannot reconstruct neither $R_2$ or $R_3$ - client2 does not know $m_{3,1}$ => cannot reconstruct neither $R_1$ or $R_3$ - client3 does not know $m_{1,2}$ => cannot reconstruct neither $R_1$ or $R_2$ --- <img src="https://hackmd.io/_uploads/Sk9PT4Wkeg.png" height="500"> #### I used this [ECDH Library](https://github.com/privacy-scaling-explorations/zk-kit.noir/tree/main/packages/ecdh) inside `zk-kit.noir` library set developed by PSE --- <div> ```mermaid sequenceDiagram participant Client_n participant Blockchain participant Server Client_n-->>Client_n: Train local model R_n, Generate training proof π_train_n Client_n->>Blockchain: Submit (π_train_n + public key pk_n) Blockchain-->>Blockchain: if π_train_n verified, then pk_n registered Client_n->>Blockchain: Fetch pk_{n+1} (right neighbor) and pk_{n-1} (left neighbor) Client_n-->>Client_n: Locally compute shared masks m_right_n=sk_n*pk_{n+1}, m_left_n=sk_n*pk_{n-1},<br>Mask the model: R_n + m_right_n - m_left_n, Generate masking proof π_mask_n Client_n->>Blockchain: Submit masked model M_n + proof π_mask_n Blockchain-->>Blockchain: if π_mask_n verified, then M_n registered Server->>Blockchain: Fetch masked models M_n for all n Server-->>Server: Aggregate local models, <br> Generate aggregation proof π_agg Server->>Blockchain: Submit global model M_g + proof π_agg Blockchain-->>Blockchain: if π_agg verified, then M_g registered Client_n->>Blockchain: Fetch global model M_g ``` </div> --- ### Fixed-point arithmetic range check inside Noir circuit --- ### ML: decimal numbers <> ZK: BN254 field - first 126 bits: positive numbers - middle 2 bits: unused - last 126 bits: negative numbers Let's see [example dataset]( https://github.com/yuriko627/vfl-demo/blob/main/clients/client1/training/Prover.toml) for client1 --- ### Safe addition You don't want to overflow! For a + b = `c` you want: bitsize(`c`) <=126 => constraint: bitsize(a) <= 125 && bitsize(b) <= 125 --- ### Safe multiplication You don't want to overflow! For a * b = `c` you want: bitsize(`c`) <=126 => constraint: bitsize(a) + bitsize(b) <= 125 --- #### Local Model Aggregation: <div style="margin-top: -30px;"> ![Screenshot 2025-04-26 at 02.04.18](https://hackmd.io/_uploads/HkyU-nYkex.png) </div> <div style="margin-top: -20px; margin-bottom: -30px;"> ![Screenshot 2025-04-26 at 01.39.50](https://hackmd.io/_uploads/r17FoiFJgl.png) </div> <small>You can customize to add `assert_bitsize::<n>` before arithmetic operations</small> --- ### Future Research Direction --- ### 1. Training dataset validation ![Screenshot 2025-04-23 at 12.23.56](https://hackmd.io/_uploads/SJUkCBLJxg.png) <small style="margin-top: -20px">[Reference](https://www.youtube.com/watch?v=mdMpQMe5_KQ)</small> --- ### 2. Replacing local training to fine-tuning --- ### 3. Dropouts tolerance ![Screenshot 2025-04-26 at 01.49.28](https://hackmd.io/_uploads/rJVgRjtkgg.png) This [paper](https://arxiv.org/pdf/2205.06117) says each client has to communicate with $O(log(n))$ number of other clients (where $n$ is the total number of clients), then the server can tolerate clients dropout for aggregation --- ### 4. Rewarding system ![Screenshot 2025-04-26 at 01.46.01](https://hackmd.io/_uploads/ByFA2jF1lx.png) <small style="margin-top: -20px">[Reference](https://www.vana.org/posts/model-influence-functions-measuring-data-quality)</small> ... or the key question is "who taught AI about it first?" => Time-ranking based compensation is might be appropriate