Distributed Private Machine Learning Benchmarks

# Distributed Private Machine Learning Survey A research repository to benchmark various distributed private machine learning solutions for fraud detection in banking. ## Problem Statement ### Key Challenge: Tracking Money Mules Across Banks Money mule accounts are frequently used in money laundering operations between Thai banks. These accounts serve as intermediaries to receive and forward fraudulent funds, making them difficult to trace. While banks can identify and block mule accounts within their own systems, the challenge arises when funds move across different banks: 1. The National Interbank Transaction Management and Exchange (ITMX) system facilitates interbank transfers but lacks visibility into personally identifying information (PII). 2. Transaction data is forwarded without PII, making it difficult to identify and block suspicious transfers in real-time. 3. Once fraudulent funds reach a receiving bank, they are quickly transferred to other accounts or withdrawn as cash. 4. By the time victims report the fraud, the money trail is lost and funds are unrecoverable. ### Privacy and Performance Requirements Any solution must address these critical constraints: 1. **Data Privacy:** Banks cannot share customer data or PII with each other 2. **Inference Privacy:** No data leakage should occur during the fraud detection process 3. **Performance:** The system must operate efficiently to detect and prevent fraudulent transfers in real-time This repository explores distributed private machine learning approaches to enable cross-bank fraud detection without compromising customer privacy. ![problem](https://hackmd.io/_uploads/BJrpu6kxee.png) ## Possible Solutions ### SecureBoost [paper link](https://arxiv.org/abs/1901.08755) ![secureboost](https://hackmd.io/_uploads/ryLkY6yxeg.png) - Private Federated Learning for XGBoost - Uses Pallier Homomorphic Encryption (addition / multiplication) - It's only possible because the gradient (1st order derivative) and hessian (2nd order derivative) only uses +/x - Chatty and computationally expensive model. - Requires passive party (store data only) to store a lookup table regarding split threshold used in xgboost calculation. Hard for banks to adopt. ### Differential Privacy using Flower & Diffprivlib + EZKL Co-Snarks This approach is split into two phases: **Training**: Federated Learning w/ Differential Privacy. [We use Bounded Laplace for DP. Link to More details](https://arxiv.org/abs/1808.10410) Clients train on their own data with a public model and produce the weight updates to ITMX. This leaks anonymized information about the dataset distribution, but not the PII of specific customers. **Production Inference**: EZKL w/ Co-Snarks (WIP) As sending bank and receiving bank will need to send over the PII to the ITMX for inference, we need a way to make the PII private. ITMX uses EZKL to create a ZKP version of the trained model. When a interbank transfer happens, a Co-SNARK session is setup between the sending bank (provides data), receiving bank (provides data), and ITMX (provides model). The banks keep their PII private, and the ITMX keeps the global model weights private. At the end of the Co-SNARK session, all players ITMX decides to reject or accept the transaction given the result of inference. There are some caveats to the Co-SNARK approach on EZKL as of 30 Apr 2025, due to some non-determinism in the setup there's some challenges in secret sharing values. As such, the technique isn't possible to be implemented on all models. ### Eguard [paper link](https://arxiv.org/html/2411.05034v1) ![eguard](https://hackmd.io/_uploads/SyK8Fpkell.png) - Run Eguard on the embeddings - Eguard projects input into a projection projection network to obfuscate data while maintaining some - Send the relatively private embeddings to the ITMX server - Not as a private as cryptography solutions as 5-6% of tokens can be retrieved. This is run on text tokenizers. - PII information might have a lower entropy and might leak more information