# Importance Weighting Problem
Let
$f:\{1,\ldots,K\}\rightarrow \mathbb{R}$ be a scalar function over $K$ values.
Let $\theta$ be parameters of a $K$-dimensional Dirichlet distribution, and $\pi$ a draw from it. (the sample from the Dirichlet, $\pi$, is therefore a probability distribution over $K$ outcomes, such that $\sum_{k=1}^{K} \pi_{k}=1$).
$$
\pi \sim \mathcal{Dir}(\theta)
$$
Let $a$ be a sample from $\pi$, that is
$$
a \sim \pi
$$
It's not diffuclt to see that the marginal distribution of $a$ is
\begin{align}
p(a\vert \theta) &= \int p(a\vert \pi)p(\pi\vert \theta)\text{d}\pi \\
&= \mathbb{E}_{\pi \sim \mathcal{Dir}(\theta)} \pi_a \\
&= \frac{\theta_a}{\sum_{k=1}^K \theta_k}
\end{align}
### Estimating expectation
I'm interested in estimating
$$
F(\theta) = \mathbb{E}_{a \sim p(a\vert \theta)} f(a)
$$
If I could sample from $p(a\vert \theta)$, this would be easy to estimate as a Monte Carlo average.
$$
F(\theta) = \mathbb{E}_{a \sim p(a\vert \theta)} f(a) \approx \frac{1}{N}\sum_{n=1}^{N} f(a_n)
$$
where
\begin{align}
\pi_n &\sim \mathcal{Dir}(\theta), \text{i.i.d.}\\
a_n &\sim \pi_n
\end{align}
## Importance sampling
However, I will assume that I can't sample from $p(\theta)$, only from $p(\tilde{\theta})$. So my samples are:
\begin{align}
\tilde{\pi}_n &\sim \mathcal{Dir}(\tilde{\theta}), \text{i.i.d.}\\
\tilde{a}_n &\sim \tilde{\pi}_n
\end{align}
Using importance sampling I can estimate my quantity of interest as follows:
\begin{align}
F(\theta) &= \mathbb{E}_{a \sim p(a\vert \theta)} \\
&= \sum_{a=1}^{K} p(a \vert \theta)f(a) \\
&= \sum_{k=1}^{K} p(a\vert \tilde{\theta}) \frac{p(a\vert \theta)}{p(a\vert \tilde{\theta})} f(a) \\
&= \mathbb{E}_{a \sim p(a\vert \tilde{\theta})} \frac{p(a\vert \theta)}{p(a\vert \tilde{\theta})} f(a)
\end{align}
Now I have expressed my quantity of interest as an expectation over $p(a\vert \tilde{\theta})$, so I can use my samples $\tilde{a}_n$, as follows:
\begin{align}
F(\theta) &= \mathbb{E}_{a \sim p(a\vert \tilde{\theta})} \frac{p(a\vert \theta)}{p(a\vert \tilde{\theta})} f(a) \\
&\approx \frac{1}{N} \sum_{n=1}^N \frac{p(\tilde{a}_n\vert \theta)}{p(\tilde{a}_n\vert \tilde{\theta})} f(\tilde{a}_n).
\end{align}
So, instead of simply averaging $f(a)$ over the samples, I weight each sample by the _importance weight_ $w(a) = \frac{p(a\vert \theta)}{p(a\vert \tilde{\theta})}$.
This is a widely used method for estimating expectations under distributions when you only have samples from another distribution. The problem with it, it is a high variance estimator.
Let's call this estimator $\tilde{F}_{IS}(\theta)$
## Alternative importance sampling
In this example, we sample $a$ in a two-stage process: first we sample $\pi$ and then we sample $a$. The IS estimator above doesn't exploit this, it completely ignores $\pi$. We can construct a similar estimator that does take $\pi$ into account:
\begin{align}
F(\theta) &= \mathbb{E}_{a \sim p(a\vert \theta)} \\
&= \sum_{a=1}^{K} p(a \vert \theta)f(a) \\
&= \sum_{a=1}^{K} \int p(a, \pi \vert \theta) f(a) \text{d}\pi \\
&= \sum_{k=1}^{K} \int p(a\vert \tilde{\theta}) \frac{p(a, \pi \vert \theta)}{p(a, \pi\vert \tilde{\theta})} f(a) \pi \\
&= \mathbb{E}_{a \sim \pi, \pi \sim \mathcal{Dir}(\theta)} \frac{p(a, \pi\vert \theta)}{p(a, \pi\vert \tilde{\theta})} f(a)
\end{align}
Based on this, an empirical estimator can be constructed:
\begin{align}
F(\theta) &= \mathbb{E}_{a \sim p(a\vert \theta)} f(a)\\
&= \mathbb{E}_{a \sim \pi, \pi \sim \mathcal{Dir}(\theta)} \frac{p(a, \pi\vert \theta)}{p(a, \pi\vert \tilde{\theta})} f(a) \\
&\approx \frac{1}{N} \sum_{n=1}^{N}\frac{p(\tilde{a}_n, \tilde{\pi}_n\vert \theta)}{p(\tilde{a}_n, \tilde{\pi}_n\vert \tilde{\theta})} f(a)
\end{align}
Let's call this estimator $\tilde{F}_{IS2}(\theta)$.
Note that this estimator is very similar to the previous one, except that the importance weights now depend not just on $a$ but on $\pi$ also.
## Questions
### Closed form importance weights in each estimator
### Which estimator is better?
### How do the importance weights behave in the two estimators?
### Which estimator has lower variance?