# How Faithful is Your Synthesized Data? Sample-level Metrics for Evaluating and Auditing Generative Models [Paper link](https://arxiv.org/abs/2102.08921) [Slide link](https://docs.google.com/presentation/d/14ho-zy6PJiTOrwNWB-Lw9b87jyo1n4Ly5UFZQ7vHQfs/edit?usp=sharing) ## 1 Motivation Evaluating generative models is a complicated and unsolved problem. Likelihood functions could be the evaluation metrics. However, many generative models, such as VAE and GANs, do not have tractable likelihood. Besides, likelihood score scales badly in high dimensions and it obscures modes of model failure into a single score [1]. Thus, we need domain and model-agnostic evaluation metric for generative models. To be specific, the metric should be able to evaluate three qualities: Fidelity, measuring if the generated samples resemble real samples; Diversity, measuring if the generated samples are diverse enough to cover the variability of the real data; Generalization, measuring if the generated data are not mere copies of the real samples. ## 2 Overview This paper introduces a 3-dimensional metric, ($\alpha$-Precision, $\beta$-Recall, Authenticity), that classifies the fidelity, diversity, and generalization performance of any generative model. The main contribution is the $\alpha$-Precision and $\beta$-Recall, which generalizes the standard precision and recall metrics introduced in [2] by inspecting the density level sets for the real and synthetic distributions. It also proposes Authenticity metric to quantify the likelihood of a synthetic sample being copied from training data. ## 3 Introduction ### 3.1 How is this metric different The proposed precision and recall metric assumes a fraction $1-\alpha$ of the real data are outliers and $\alpha$ of the real data are typical for $\alpha \in [0,1]$. So $\alpha$-Precision is the fraction of synthetic samples that resemble the “most typical” fraction $\alpha$ of real samples. Similarly, $1-\beta$ of the synthetic data are outliers and $\beta$ of the synthetic data are the typical ones. So $\beta$-Recall is the fraction of real samples covered by the most typical fraction $\beta$ of synthetic samples. To compute both metrics, data is embedded into hyperspheres with typical samples concentrated around the centers, whereas outliers closer to the boundaries. Standard precision and recall introduced in [2] is a special case of $\alpha$-Precision and $\beta$-Recall with $\alpha=\beta=1$. By accouting the density levels of the distributions, the proposed metric can overcome many drawbacks of the standard precision and recall, such as lack of robustness to outliers and incapable to differentiate types of model failure (mode collapse and mode invention). ### 3.2 Pipeline Denote real and generated data as $X_r \sim \mathbb{P}_r$ and $X_g \sim$ $\mathbb{P}_g$ respectively, where $X_r, X_g \in \mathcal{X}$, with $\mathbb{P}_r$ and $\mathbb{P}_g$ being the real and generative distributions. The real and synthetic data sets are $\mathcal{D}_{\text {real }}=\left\{X_{r, i}\right\}_{i=1}^n$ and $\mathcal{D}_{\text {synth }}=\left\{X_{g, j}\right\}_{j=1}^m$. The goal of the metric is to find: $$\mathcal{E}(\mathcal{D}_{\text {real }}, \mathcal{D}_{\text {synth }}) \triangleq(\underbrace{\alpha \text {-Precision }}_{\text {Fidelity }}, \underbrace{\beta \text {-Recall }}_{\text {Diversity }}, \underbrace{\text { Authenticity }}_{\text {Generalization }})$$ ![pipe](https://hackmd.io/_uploads/SkzJOw8S6.png) Figure 1. Illustration for the evaluation and auditing pipelines. As illustrated in the Figure 1, the evaluation pipeline starts by embedding $X_r$ and $X_g$ into meaningful feature space through $\Phi$, then classify the embedded data to obtain $\alpha$-Precision, $\beta$-Recall, and Authenticity. In the following section, $\alpha$-Precision, $\beta$-Recall, and Authenticity will be defined. The embedding and classifiers will be introduced in Methodology. For the auditing pipeline, the authors show how we can reject low quality samples with the proposed metrics. ### 3.3 $\alpha$-Precision and $\beta$-Recall Let $\widetilde{X}_r=\Phi\left(X_r\right)$ and $\widetilde{X}_g=\Phi\left(X_g\right)$ be the embedded real and synthetic data. For simplicity, $\mathbb{P}_r$ and $\mathbb{P}_g$ refer to both the distributions over raw and embedded features interchangeably. Let $\mathcal{S}_r=\operatorname{supp}\left(\mathbb{P}_r\right)$ and $\mathcal{S}_g=\operatorname{supp}\left(\mathbb{P}_g\right)$, where $\operatorname{supp}(\mathbb{P})$ is the support of $\mathbb{P}$. The $\alpha$-support is the minimum volume subset of $\mathcal{S}=\operatorname{supp}(\mathbb{P})$ that supports a probability mass of $\alpha$ i.e., $$ \mathcal{S}^\alpha \triangleq \min _{s \subseteq \mathcal{S}} V(s), \text { s.t. } \mathbb{P}(s)=\alpha, $$ where $V(s)$ is the volume (Lebesgue measure) of $s$, and $\alpha \in[0,1]$. One can think of an $\alpha$-support as dividing the full support of $\mathbb{P}$ into "normal" samples concentrated in $\mathcal{S}^\alpha$, and "outliers" residing in $\overline{\mathcal{S}}^\alpha$, where $\mathcal{S}=\mathcal{S}^\alpha \cup \overline{\mathcal{S}}^\alpha$. <!-- The distance between $X$ and the closest sample in the training data set $\mathcal{D}_{\text {real }}$, i.e., $$ d\left(X, \mathcal{D}_{\text {real }}\right)=\min _{1 \leq i \leq n} d\left(X, X_{r, i}\right), $$ where $d$ is a distance metric defined over the input space $\mathcal{X}$. --> Given the support defined, $P_{\alpha}$ is defined as the probability that a synthetic sample resides in the $\alpha$-support of the real data distribution, i.e., $$ P_\alpha \triangleq \mathbb{P}\left(\widetilde{X}_g \in \mathcal{S}_r^\alpha\right), \text { for } \alpha \in[0,1] . $$ Similarly, $R_\beta$ is the probability that a real sample resides in the $\beta$-support of the synthetic data distribution, i.e., $$ R_\beta \triangleq \mathbb{P}\left(\widetilde{X}_r \in \mathcal{S}_g^\beta\right), \text { for } \beta \in[0,1]. $$ Figure 2 clearly illustrates how the proposed $P_{\alpha}$ and $R_{\beta}$ overcome the shortcomings of standard precision and recall. In Figure 2, $\mathbb{P}_r$ is colored in blue and $\mathbb{P}_g$ is colored in red. Shaded areas represent the probability mass covered by $\alpha$ and $\beta$-supports. For the mode collapose in $2(a)$, standard precision $P_{1}=1$, whereas the $P_{\alpha}$ curve is suboptimal as generated distribution does not cover all $\alpha$-supports. For the mode invention in $2(b)$, both standard $P_{1}=1$ and $R_{1}=1$, while generated distribution does not cover all $\alpha$-supports and real distribution does not cover all $\beta$-supports. So poor $P_{\alpha}$ and $R_{\beta}$ curves better represent the result. For the density drift in $2(c)$, intuitively, this is the best of the three models, but it is inferior to $2(b)$ if considering $P_{1}$ and $R_{1}$. ![P-R_curve](https://hackmd.io/_uploads/HJdppFrSa.png) Figure 2. Interpretation of the $P_\alpha$ and $R_\beta$ curves. The grey dashed line in part $2(d)$ is when the model is optimal, so the area between the real curve and the line is the difference to the optimal model. Thus the author defines the integrated $P_\alpha$ and $R_\beta$ metrics as $I P_\alpha=$ $1-2 \Delta P_\alpha$ and $I R_\beta=1-2 \Delta R_\beta$, where $$ \Delta P_\alpha=\int_0^1\left|P_\alpha-\alpha\right| d \alpha, \Delta R_\beta=\int_0^1\left|R_\beta-\beta\right| d \beta $$ ### 3.4 Authenticity Generalization is independent of precision and recall since a model can achieve perfect fidelity and diversity without truly generating any samples, simply by resampling training data. Authenticity score $A \in$ $[0,1]$ is proposed to quantify the rate by which a model generates new samples. To pin down a mathematical definition for $A$, $\mathbb{P}_g$ is reformulated as a mixture of densities as follows: $$ \mathbb{P}_g=A \cdot \mathbb{P}_g^{\prime}+(1-A) \cdot \delta_{g, \epsilon}, $$ where $\mathbb{P}_g^{\prime}$ is the generative distribution conditioned on the synthetic samples not being copied, and $\delta_{g, \epsilon}$ is a noisy distribution over training data. In particular, $\delta_{g, \epsilon}$ is defined as $\delta_{g, \epsilon}=\delta_g * \mathcal{N}\left(0, \epsilon^2\right)$, where $\delta_g$ is a discrete distribution that places an unknown probability mass on each training data point in $\mathcal{D}_{\text {real }}, \epsilon$ is an arbitrarily small noise variance, and $*$ is the convolution operator. Essentially, this equation assumes that the model flips a (biased coin), pulling off a training sample with probability $1-A$ and adding some noise to it, or innovating a new sample with probability $A$. ## 4 Methodology ### 4. Estimating the Evaluation Metric Metric: $\widehat{\mathcal{E}}=\left(\widehat{P}_\alpha, \widehat{R}_\beta, \widehat{A}\right)$ Each metric is calcualted by averaging over all of the samples: $\widehat{P}_\alpha=\frac{1}{m} \sum_j \widehat{P}_{\alpha, j}$, $\widehat{R}_\beta=\frac{1}{n} \sum_i \widehat{R}_{\beta, i}$, $\widehat{A}=\frac{1}{m} \sum_j \widehat{A}_j$ This requires three binary classifer: $f_P, f_R, f_A: \widetilde{\mathcal{X}} \rightarrow\{0,1\}$, where $\widehat{P}_{\alpha, j}=f_P\left(\widehat{X}_{g, j}\right), \widehat{R}_{\beta, i}=f_R\left(\widehat{X}_{r, i}\right)$ and $\widehat{A}_j=f_A\left(\widehat{X}_{g, j}\right)$ 1. $\alpha$-Precision classifier $f_p$ Given generated samples, $f_p$ determines whether it belongs to $\widehat{\mathcal{S}}_r^\alpha$: $f_P\left(\widetilde{X}_g\right)=\mathbf{1}\left\{\widetilde{X}_g \in \widehat{\mathcal{S}}_r^\alpha\right\}$ Train an evaluation embedding $\Phi$ to cast $\mathcal{S}_r$ into a hypersphere with radius $r$ to minimize $L=\sum_i\ell_i$, where $\ell_i=r^2+\frac{1}{\nu} \max \left\{0,\left\|\Phi\left(X_{r, i}\right)-c_r\right\|^2-r^2\right\}$ This loss is based on one-class SVM [3]. The evaluation embedding squeezes real data into the minimum-volume hypersphere centered around $c_r$. $\widehat{\mathcal{S}}_r$ is estimated as: $\widehat{\mathcal{S}}_r^\alpha=\boldsymbol{B}\left(c_r, \widehat{r}_\alpha\right), \widehat{r}_\alpha=\widehat{Q}_\alpha\left\{\left\|\widetilde{X}_{r, i}-c_r\right\|: 1 \leq i \leq n\right\}$, where $\boldsymbol{B}\left(c, \widehat{r}\right)$ is a Euclidean ball, and $\widehat{Q}_\alpha$ is the $\alpha$-quantile function. Therefore, $f_p\left(\widetilde{X}_g\right)=\mathbf{1}\left\{\left\|\widetilde{X}_g-c_r\right\| \leq \widehat{r}_\alpha\right\}$ 2. $\beta$-Recall classifier $f_R$ $f_R\left(\widetilde{X}_r\right)=\mathbf{1}\left\{\widetilde{X}_r \in \widehat{\mathcal{S}}_g^\beta\right\}$ $\boldsymbol{B}\left(c_g, \widehat{r}_\beta\right)$, where $c_g=\frac{1}{m} \sum_j \widetilde{X}_{g, j}$, $\widehat{r}_\beta=\widehat{Q}_\beta\left\{\left\|\widetilde{X}_{g, j}-c_g\right\|: 1 \leq j \leq m\right\}$ $f_R$ determines whether a real sample $\widetilde{X}_{r, i}$ has a generated sample $\widetilde{X}_{g, j^*}^\beta$ near it. $f_R\left(\widetilde{X}_{r, i}\right)=\mathbf{1}\left\{\widetilde{X}_{g, j^*}^\beta \in \boldsymbol{B}\left(\widetilde{X}_{r, i}, \mathrm{NND}_k\left(\widetilde{X}_{r, i}\right)\right)\right\}$, where $\widetilde{X}_{g, j^*}^\beta$ is the synthetic sample in $\boldsymbol{B}\left(c_g, \widehat{r}_\beta\right)$ closest to $\widetilde{X}_{r, i}$, and $\mathrm{NND}_k(\widetilde{X}_{r, i})$ is the disctance between $\widetilde{X}_{r, i}$ and its k-nearest neighbor in $\mathcal{D}_{\text {real}}$. 3. Autheticity classifier $d_{g, j}=d\left(\widetilde{X}_{g, j}, \widetilde{X}_{r, i^*}\right)$, $d_{g,j}$ is the distance between $\widetilde{X}_{g, j}$ and the closest real data $\widetilde{X}_{r, i^*}$ $d_{g,i^*}$ is the distance between $\widetilde{X}_{r, i^*}$ and the closest real data except itself. $a_j=\mathbf{1}\left\{d_{g, j} \leq d_{r, i^*}\right\}$ $f_A$ issues $A_j = 1$ if $a_j = 0$, and $A_j = 0$ otherwise ## 5 Experiments ### 5.1 Evaluating & auditing generative models for synthesizing COVID-19 data #### 5.1.1 Dataset, Models and baselines: **Dataset**: SIVEP-Gripe[3] a database of 99,557 COVID patients in Brazil, including sensitive data such as ethnicity. **Models**: GAN, VAE, Wasserstein GANs with a gradient penalty(WGAN-GP) [5], and ADS-GAN,which is specifically designed to prevent patient identifiablity in generated data [6]. **Baselines**: Fr´echet Inception Distance (FID) [7], Precision/Recall (P1/R1) [2], Density/Coverage (D/C) [8],Parzen window likelihood (PW) [9] andWasserstein distance (W) as baselines. ![image](https://hackmd.io/_uploads/ryFDwUSSp.png) **Figure 3.** Predictive modeling with synthetic data. $(a)$ The rank the 4 generative models. $(b)$ Hyper-parameter tuning for ADS-GAN. (Dashed lines are linear regression lines.) $(c)$ Post-hoc auditing of ADS-GAN. #### 5.1.2 Ranking generative and predictive models On each synthetic data set from these 4 models, fit a (predictive) logistic regression model to predict patient-level COVID-19 mortality. Theoretically, a better synthetic data set leads to a better predictive model, whose AUC-ROC is higher. The ground-truth ranking is: ADS-GAN, WGAN-GP, VAE, GAN. Figure 3(a) shows the different rankings according to different metrics. The AUC-ROC is the best model's performance in that ranking. Among these rankings, only $IP_\alpha$ leads to the all-correct ranking. #### 5.1.3 Hyper-parameter tuning & the privacy-utility tradeoff. Take the best performing model in our previous experiment: ADS-GAN. This model has a hyper-parameter $\lambda \in R$ that determines the importance of the privacy-preservation loss function. smaller values of λ make the model more prone to overfitting, and hence privacy leakage. Increasing λ improves privacy at the expense of precision. Figure 3(b) shows the trade off between privacy and precision using different $\lambda$ values. #### 5.1.4 Improving synthetic data via model auditing. By discarding unauthentic or imprecise samples, this work can improve the synthetic data in a post-hoc way. This does not only lead to nearly optimal precision and authenticity for the curated data (Figure 3$(c)$), but also improves the AUC-ROC of the predictive model fitted to audited data (from 0.76 to 0.78 for the audited ADS-GAN synthetic data, p < 0.005) ### 5.2 Diagnosing generative distributions of MNIST ![image](https://hackmd.io/_uploads/ryJWe9rBp.png) **Figure 4.** (a) Diagnosing mode collapse in MNIST data. (b) Results for the hide-and-seek competition. Fit a conditional GAN (CGAN) model [13] on the MNIST data set, and generate 1,000 samples for each of the digits 0-9. First sample 1,000 instances of each digit from the CGAN, and then delete individual samples of digits 1 to 9 with a probability $P_{drop}$, and replace the deleted samples with new samples of the digit 0 to complete a data set of 10,000 instances. Figure 4(a) shows that all these metrics can show the discrepency of $P_r$ and $P_g$ when changing $P_{drop}$, but only the metrics in this work can show the changing of fidelity and diversity, and it is also sensitive to the increase of $P_{drop}$ compared with the standard Precision & Recall. ### 5.3 Revisiting the Hide-and-Seek challenge for synthesizing time-series data The authors use their metric to re-evaluate the generative models submitted to the NeurIPS 2020 Hide-and-Seek competition [10]. The winning submission was a very simplistic model that adds Gaussian noise to the real data to create new samples. In this competition, participants were required to synthesize intensive care time-series data based on real data from the AmsterdamUMCdb database. To evaluate the metrics on time-series data, they trained a Seq-2-Seq embedding that is augmented with our One-class representations to transform time-series into fixed feature vectors. The winning submission comes out as one of the least authentic models, despite performing competitively in terms of precision and recall. This highlights the detrimental impact of using naive metrics for evaluating generative models—based on the competition results, clinical institutions seeking to create synthetic data sets may be led to believe that Submission 1 in Figure 4(b) is the right model to use. ### 5.4 Image generation benchmarks The authors generate 10,000 samples from StyleGAN2-ADA [11] and DDPM [12] models pre-trained on the CIFAR-10 data set, and evaluate the $IP_\alpha$ and $IR_\beta$ metrics for both samples with respect to ground-truth samples in the original CIFAR-10 data. In Table1, StyleGAN2-ADA shows a better overall performance but DDPM has a better $\alpha$-Precision. The metrics in this work show a more nuanced picture for the comparison. ![image](https://hackmd.io/_uploads/rkCi6LrHa.png) **Table 1.** Comparison between StyleGAN2-ADA and DDPM on the CIFAR-10 data set using the proposed metrics. ## 6 Discussion ### 6.1 Strengths: 1. Introducing density into Precision&Recall is a novel idea. 2. The problem this work trying to solve is an important and challenging problem relevant to the generative modeling community. 3. The paper is well-written and the method is clear. ### 6.2 Weaknesses: 1. This method requires to train an additional embedding to calculate $\alpha$-Precision and $\alpha$-Recall, while metrics like FID can directly utilize pretrained models. The performance of the metric relies on the trained embedding. 2. This methods calculate the $\alpha$-Precision and $\alpha$-Recall using Euclidean distance in the embedding space. However, there could be embedding space that is not feasible to use Euclidean distance, which leads to less accurate metrics. 3. Error: Tha authors claim the embedding cast the real set into an isotropic density hypersphere, but it is not possible to be isotropic according to the loss: $\ell_i=r^2+\frac{1}{\nu} \max \left\{0,\left\|\Phi\left(X_{r, i}\right)-c_r\right\|^2-r^2\right\}$ 3. The experiments are inadequate. (a) The main experiment is based on a Covid-19 dataset rather than standard image generation datasets such as FFQH and CelebA-HQ. And they didn't use the state-of-the-art models in the main experiment. (b) In the "Hyper-parameter tuning & the privacy-utility tradeoff" experiment and Figure3(b), there is a sharp change in the last three nodes, but this phenomenon is not explained. If removing these 3 nodes, the other nodes are almost at a horizontal line. (c) In the "Improving synthetic data via model auditing" experiment and Figure3$(c)$, the authenticity improves to almost 1, but the improvement of performance is marginal: from 0.76 to 0.78. And the variance is not provided. (d) The result of "Comparison between StyleGAN2-ADA and DDPM" in Table1 is counter-intuitive. Their metrics show that DDPM has higher precision but lower recall compared with StyleGAN2-ADA. Theoretically, diffusion model should have a better generalizability than GAN. Their result may be under an unfair comparison. ### 6.3 Improvements: 1. Add more experiments using the standard datasets and the state-off-the-art image generation models. 2. Add ablation study on the distance metric. 3. Add qualitative experiments to assess the outliers that are ruled out by the $\alpha$ and $\beta$. ### References: [1] Theis, L., Oord, A. v. d., and Bethge, M. A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844, 2015. [2] Sajjadi, M. S., Bachem, O., Lucic, M., Bousquet, O., and Gelly, S. Assessing generative models via precision and recall. Advances in Neural Information Processing Systems, 31:5228–5237, 2018. [3] Scholkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. Estimating the support of a highdimensional distribution. Neural computation, 13(7): 1443–1471, 2001. [4] SIVEP-Gripe. http://plataforma.saude.gov.br/coronavirus/dadosabertos/. In Ministry of Health. SIVEP-Gripe public dataset, (accessed May 10, 2020; in Portuguese), 2020. [5] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. Improved training of wasserstein gans. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5769–5779, 2017. [6] Yoon, J., Drumright, L. N., and Van Der Schaar, M. Anonymization through data synthesis using generative adversarial networks (ads-gan). IEEE Journal of Biomedical and Health Informatics, 2020. [7] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a nash equilibrium. 2017. [8] Naeem, M. F., Oh, S. J., Uh, Y., Choi, Y., and Yoo, J. Reliable fidelity and diversity metrics for generative models. International Conference on Machine Learning (ICML), 2020. [9] Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. Better mixing via deep representations. In International conference on machine learning, pp. 552–560. PMLR, 2013. [10] Jordon, J., Jarrett, D., Yoon, J., Barnes, T., Elbers, P., Thoral, P., Ercole, A., Zhang, C., Belgrave, D., and van der Schaar, M. Hide-and-seek privacy challenge. arXiv preprint arXiv:2007.12087, 2020. [11] Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., and Aila, T. Training generative adversarial networks with limited data. arXiv preprint arXiv:2006.06676, 2020. [12] Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In NeurIPS, 2020. [13] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Highresolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807, 2018.