# Importance Weighting Problem Let $f:\{1,\ldots,K\}\rightarrow \mathbb{R}$ be a scalar function over $K$ values. Let $\theta$ be parameters of a $K$-dimensional Dirichlet distribution, and $\pi$ a draw from it. (the sample from the Dirichlet, $\pi$, is therefore a probability distribution over $K$ outcomes, such that $\sum_{k=1}^{K} \pi_{k}=1$). $$ \pi \sim \mathcal{Dir}(\theta) $$ Let $a$ be a sample from $\pi$, that is $$ a \sim \pi $$ It's not diffuclt to see that the marginal distribution of $a$ is \begin{align} p(a\vert \theta) &= \int p(a\vert \pi)p(\pi\vert \theta)\text{d}\pi \\ &= \mathbb{E}_{\pi \sim \mathcal{Dir}(\theta)} \pi_a \\ &= \frac{\theta_a}{\sum_{k=1}^K \theta_k} \end{align} ### Estimating expectation I'm interested in estimating $$ F(\theta) = \mathbb{E}_{a \sim p(a\vert \theta)} f(a) $$ If I could sample from $p(a\vert \theta)$, this would be easy to estimate as a Monte Carlo average. $$ F(\theta) = \mathbb{E}_{a \sim p(a\vert \theta)} f(a) \approx \frac{1}{N}\sum_{n=1}^{N} f(a_n) $$ where \begin{align} \pi_n &\sim \mathcal{Dir}(\theta), \text{i.i.d.}\\ a_n &\sim \pi_n \end{align} ## Importance sampling However, I will assume that I can't sample from $p(\theta)$, only from $p(\tilde{\theta})$. So my samples are: \begin{align} \tilde{\pi}_n &\sim \mathcal{Dir}(\tilde{\theta}), \text{i.i.d.}\\ \tilde{a}_n &\sim \tilde{\pi}_n \end{align} Using importance sampling I can estimate my quantity of interest as follows: \begin{align} F(\theta) &= \mathbb{E}_{a \sim p(a\vert \theta)} \\ &= \sum_{a=1}^{K} p(a \vert \theta)f(a) \\ &= \sum_{k=1}^{K} p(a\vert \tilde{\theta}) \frac{p(a\vert \theta)}{p(a\vert \tilde{\theta})} f(a) \\ &= \mathbb{E}_{a \sim p(a\vert \tilde{\theta})} \frac{p(a\vert \theta)}{p(a\vert \tilde{\theta})} f(a) \end{align} Now I have expressed my quantity of interest as an expectation over $p(a\vert \tilde{\theta})$, so I can use my samples $\tilde{a}_n$, as follows: \begin{align} F(\theta) &= \mathbb{E}_{a \sim p(a\vert \tilde{\theta})} \frac{p(a\vert \theta)}{p(a\vert \tilde{\theta})} f(a) \\ &\approx \frac{1}{N} \sum_{n=1}^N \frac{p(\tilde{a}_n\vert \theta)}{p(\tilde{a}_n\vert \tilde{\theta})} f(\tilde{a}_n). \end{align} So, instead of simply averaging $f(a)$ over the samples, I weight each sample by the _importance weight_ $w(a) = \frac{p(a\vert \theta)}{p(a\vert \tilde{\theta})}$. This is a widely used method for estimating expectations under distributions when you only have samples from another distribution. The problem with it, it is a high variance estimator. Let's call this estimator $\tilde{F}_{IS}(\theta)$ ## Alternative importance sampling In this example, we sample $a$ in a two-stage process: first we sample $\pi$ and then we sample $a$. The IS estimator above doesn't exploit this, it completely ignores $\pi$. We can construct a similar estimator that does take $\pi$ into account: \begin{align} F(\theta) &= \mathbb{E}_{a \sim p(a\vert \theta)} \\ &= \sum_{a=1}^{K} p(a \vert \theta)f(a) \\ &= \sum_{a=1}^{K} \int p(a, \pi \vert \theta) f(a) \text{d}\pi \\ &= \sum_{k=1}^{K} \int p(a\vert \tilde{\theta}) \frac{p(a, \pi \vert \theta)}{p(a, \pi\vert \tilde{\theta})} f(a) \pi \\ &= \mathbb{E}_{a \sim \pi, \pi \sim \mathcal{Dir}(\theta)} \frac{p(a, \pi\vert \theta)}{p(a, \pi\vert \tilde{\theta})} f(a) \end{align} Based on this, an empirical estimator can be constructed: \begin{align} F(\theta) &= \mathbb{E}_{a \sim p(a\vert \theta)} f(a)\\ &= \mathbb{E}_{a \sim \pi, \pi \sim \mathcal{Dir}(\theta)} \frac{p(a, \pi\vert \theta)}{p(a, \pi\vert \tilde{\theta})} f(a) \\ &\approx \frac{1}{N} \sum_{n=1}^{N}\frac{p(\tilde{a}_n, \tilde{\pi}_n\vert \theta)}{p(\tilde{a}_n, \tilde{\pi}_n\vert \tilde{\theta})} f(a) \end{align} Let's call this estimator $\tilde{F}_{IS2}(\theta)$. Note that this estimator is very similar to the previous one, except that the importance weights now depend not just on $a$ but on $\pi$ also. ## Questions ### Closed form importance weights in each estimator ### Which estimator is better? ### How do the importance weights behave in the two estimators? ### Which estimator has lower variance?