Notes from Turing Synthetic Data workshop

--- tags: QUIPP --- # Notes from Turing Synthetic Data workshop 05 Nov 2020 ## Differentially Private Mixture of Generative Neural Networks Emiliano De Cristofaro ### The membership inference problem - Adversary wants to know if data of a target has been used to train a model. Not just an issue for generative models. Shokin et al, S&P 2017 show this for discriminative models. - Related to overfitting - White box attack: all model parameters known to attacker - Black box attack: Attacker only has API access to generate examples. Has to infer discriminator properties. - Emiliano and colleagues experimented with several image datasets. - White box accuracy versy high (up to ~100%?) - Black box accuracy between 0.4-0.8 - Giving attacker partial knowledge of training data (5-10%) -> almost same accuracy as white box approach. - Clustering: based on composition theorem in differential privacy. Approach: cluster and add noise to each of these clusters. - Need techniques heavily tailored to datasets and to tasks Q/A: - The problem of privacy versus utility was particularly pronounced for k-anonymity and l-diversity. Do you think that generative modeling solves this duality?. - In DP, do you add noise to the parameters or the original dataset? Add noise to the parameters of the generative model - The failure of k-anonymity came from the fact that it was difficult to capture the datasets that the adversary has access to. In differential privacy, we model the worst-case scenario. #### Differential privacy as a solution Links: - Clustering work: https://arxiv.org/pdf/1709.04514.pdf - Turing report: https://emilianodc.com/PAPERS/PPGM-report.pdf (part of https://www.turing.ac.uk/research/research-projects/evaluating-privacy-preserving-generative-models-wild) ## Conditional Sig-Wesserstain GANs Hao Ni - UCL - Reference: https://arxiv.org/abs/2006.05421 - ESig, Signatory, and signature are three Python libraries for signature computation. - Worth comparing this with CTGAN and the other libraries mentioned in the talk. Q/A: - Link to CG-MMN: https://papers.nips.cc/paper/2016/file/0245952ecff55018e2a459517fdb40e3-Paper.pdf ## QUIPP Greg ### Q/A Chris Williams: - CTGAN paper, TVAEs do even better compared to CTGAN. From the paper: - TVAE outperforms CTGAN in several cases, but GANs do have several favorable attributes, and this does not indicate that we should always use VAEs rather than GANs to model tables. The generator in GANs does not have access to real data during the entire training process; thus, we can make CTGAN achieve differential privacy [14] easier than TVAE. - Derek commented on TVAE and implementing differential privacy, apparently the results degraded quite significantly. - What is disclosure risk? Maria Liakata: - How should we be engaging with this project if we are developing methods for synthetic data generation in our Turing projects? - Great! We have private data involving language and heterogeneous user generated content over time. We will get in touch to arrange a meeting as we are embarking on the synthetic data generation. Blanka: - What kind of data would be most useful for you for this? ## A rough perspeective on Market Generators Blanka NH - Motivated by the work: https://arxiv.org/abs/1802.03042 - Main paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3632431 - Github repo for the paper: https://github.com/imanolperez/market_simulator - Possibly a good combination of a VAE method and an interesting dataset to try? Plus she said she is interested in QUIPP. ## FCA Data Sprint: Synthetic Network Generation Andrew Elliot - Synthetic data for Digital sandbox (https://www.digitalsandboxpilot.co.uk), different data types including, Transaction network data, lending and entity data - FCA data sprint - Deloitte had access to real dataseet: - TriBank data - Based on SME data - Iterative way of working where Deloitte provided info about deviations and these were used by DSs to improve utility metrics. - 20+ graph evaluation metrics to evaluatee FCA data quality - Simple account statistics (e.g., In-degree, ...), structural measures (e.g., strongly/weakly connected components, Triadic census, path structures ...) - We don't necessarily need a match on everything (all the metrics), it really depends on what the data is going to be used for. ## Private data collection via local Diff. Privacy Graham Cormode Intro (general comments on DP): - DP by design can lead to low utility (note: we have seen this mentioned in the literature) - Different models are needed for different types of data ## DeepSynth: ML guided programme synthesis Nathanael Fijalkow - ATI - Reference: Data Generation for Neural programming by example - Examples: FlashFill (Gulwani et al, 2011), TF-coder, DeepCoder paper ## Moderated session Polls: - https://presenter.ahaslides.com/share/1604530945954-tzx5klhlua?presenting=true - Most people are researching (generating/evaluating synthetic datasets) or want to use synthetic datasets. - Most people wrote their own codes in different languages (Python, Matlab, ...) - Popular generating models: GAN, VAEs, conditional models / imputation - What privacy metric are you interested in or using? DP/LDP, reidentification - What utility metrics are useful for the type of data you work with? No answers

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.