Sampling Benchmark ICML

## Introduction  - Introduce Sampling Problem - We are interested in computing expectations, obtaining approximate samples, and estimating the log normalization constant. Mention where these different objectives have use cases, e.g., estimating log-normalizer = statistical physics. - Explain why we need this paper - A lot of methods only compare to a small subset of tasks and use different performance criteria - Some methods only consider tasks where e.g. the log normalizer is known. Others only consider tasks where ground truth samples from the target are available to compute discrepancies between a learned model and the target distribution using IPMs such as Wasserstein distance or MMD. - Talk about mode collapse - There does not exist a metric to quantify mode collapse. There is forward ESS, but it is overly sensitive towards a model not having support on target samples. - In this work, we provide a more comprehensive evaluation protocol for variational methods on a broad set of tasks, including tasks with known $\log Z$, ground truth samples, and neither of them. - We put the most recent methods in perspective by providing a unified view of how these metrics are computed (By this I mean that all metrics except IMPs are based on log-ratios, i.e., importance weights. This should be highlighted and used to cluster methods according to how they compute the importance weights) ## Related Work - Explain ‘simple’ VI, i.e., on non-extended state space using reverse KL - Introduce VI on an extended state space, incl. sequential importance sampling, AIS, SMC, and SDE-based methods - Introduce VI with other divergences, such as $\alpha$ or log-variance ## Evaluating Variational Methods - Without access to $\log Z$ and samples from $\pi$ - ESS and ELBO - With access to $\log Z$ - $\Delta \log Z$ - With access to samples from $\pi$ - IPMs (Wasserstein, MMD) - Forward ESS, (EUBO?) - With access to mode descriptors - Compute the entropy over modes ## Benchmarking Methods - Introduce methods, but not one by one. Cluster them according to some common properties to provide a unified perspective. - Proposal - Tractable density models: MFVI, GMMVI, NFVI - Straightforward to compute the importance weights - Sequential Importance Sampling (SIS) Methods: SMC, AFT, CRAFT, SNF, FAB (I chose the SIS here instead of AIS to distinguish that the incremental importance sampling weights from AIS differ from those of Flow Transport methods. From my understanding, SMC and FAB use AIS incremental importance sampling weights whereas AFT, CRAFT, and SNF additionally use a flow to transport samples in the incremental importance sampling weights. - Importance weights are computed by using incremental importance sampling weights - Girsanov Methods (maybe there is a better name?): DDS, PIS, DIS (or more general Schroedinger Bridges, cite Lorenz and Julius) - Importance weights are computed by using Girsanov’s theorem on the RND - Langevin Diffusion-based Methods: ULA, UHA, MCD, LDVI, CMCD - Importance weights are computed by using the Markov Chain discretization of the RND ## Benchmarking Targets - Mostly finished. Nothing special or difficult here. ## Evaluation - We need to think of a way to present the results appealingly. Seems a bit difficult since we have tons of tasks, baselines, and performance criteria. ## Introduction (2) - Introduce sampling problem - Mention that there is a growing interest in developing sampling methods - i) Most methods compare to different tasks - ii) They use different performance criteria: Some use density-ratio-based criteria such as the evidence lower bound or effective sample size. Others use Integral probability metrics between empirical measures when samples from the target are available such as Wasserstein distance or maximum mean discrepency. - Point out shortcomings of existing approaches: - ELBO and ESS fail to detect mode collapse, which can be misleading for target densities with well-separated modes - IPMs require design choices such as kernels for MMD or cost functions for Wasserstein which can heavily bias the result which is unwanted for evaluation - The contribution of this work is as follows - We provide a comprehensive task suite for evaluating sampling methods - We introduce recently methods from a unified perspective. That is, they all provide tractable density-ratios on a marginal or extended state space. We use this perspective to introduce novel performance criteria, that are sensitive to mode-collapse. - We thoroughly evaluate all methods on the task suite which provides valuable insights for - a) methods (Efficiency, Robustness, Mode Collapse, and Sample Quality) - b) performance criteria (We analyze which performance criteria are suited for different settings) **Remark:** I think this section should not be technical at all. It should give a high-level overview of why this paper is needed. The only formula should be at the beginning for introducing the sampling problem. ## Related Work - Intro to MCMC for tackling sampling problem. AIS and SMC extensions additionally use annealing - Intro to VI on marginal and extended state space (Mean Field, GMMVI, NFs for marginal and ULA, UHA, MCD, LDVI, CMCD, DDS, PIS, DIS etc.. for extended). Mention that these methods are all based on a forward-backward Markov Chains. Briefly mention that VI is possible with other divergences (VI with alpha div, vargrad). - Intro to combinations of VI and MCMC/AIS (AFT, SNF, CRAFT and FAB) **Remark:** I think this section should not be that technical. It should introduce the notation for the extended state space and maybe the notation for annealing (i.e. geometric average between proposal and target) ## Performance Criteria Density Ratio-based - ELBO is commonly used to compare different methods. Mention that it makes methods comparable since it is a lower bound on $\log Z$. Mention that it is related to the forward KL. Introduce the ELBO on the extended state space. Mention that it is a lower bound on the ELBO. Mention that they are equivalent if the forward and backward processes are aligned. - Explain shortcomings, i.e., that it is not sensitive to mode-collapse. - Importance weighted estimate on $\log Z$: Not meaningful if the ground truth partition function is unknown. Also not sensitive to mode-collapse - (Reverse) Effective Sample Size: Introduction and shortcomings Integral Probability Metrics - Introduce MMD and Wasserstein. Mention shortcomings ## Evaluating Mode-Collapse - Forward KL is sensitive to Mode collapse. Introduce EUBO and extended EUBO and their connection to log Z. Mention Jeffreys divergence as EUBO - ELBO. Maybe put a small illustrating figure in here. Mention that it is not explored how to compute the density ratio when samples are coming from the target which is needed to compute EUBO - Introduce the entropy thingy as a heuristic approach for detecting mode-collapse ## Methods - Mention that we cluster according to how the density ratios are computed. See old notes. → Rest should be rather easy