Importance Weighting Problem

Let

f : {1, \dots, K} \to R

be a scalar function over

K

values.

Let

θ

be parameters of a

K

-dimensional Dirichlet distribution, and

π

a draw from it. (the sample from the Dirichlet,

π

, is therefore a probability distribution over

K

outcomes, such that

\sum_{k = 1}^{K} π_{k} = 1

π \sim Dir (θ)

Let

a

be a sample from

π

, that is

a \sim π

It's not diffuclt to see that the marginal distribution of

a

\begin{aligned} p (a | θ) & = \int p (a | π) p (π | θ) d π \\ = E_{π \sim Dir (θ)} π_{a} \\ = \frac{θ_{a}}{\sum_{k = 1}^{K} θ_{k}} \end{aligned}

Estimating expectation

I'm interested in estimating

F (θ) = E_{a \sim p (a | θ)} f (a)

If I could sample from

p (a | θ)

, this would be easy to estimate as a Monte Carlo average.

F (θ) = E_{a \sim p (a | θ)} f (a) \approx \frac{1}{N} \sum_{n = 1}^{N} f (a_{n})

where

\begin{aligned} π_{n} & \sim Dir (θ), i.i.d. \\ a_{n} & \sim π_{n} \end{aligned}

Importance sampling

However, I will assume that I can't sample from

p (θ)

, only from

p (\tilde{θ})

. So my samples are:

\begin{aligned} {\tilde{π}}_{n} & \sim Dir (\tilde{θ}), i.i.d. \\ {\tilde{a}}_{n} & \sim {\tilde{π}}_{n} \end{aligned}

Using importance sampling I can estimate my quantity of interest as follows:

\begin{aligned} F (θ) & = E_{a \sim p (a | θ)} \\ = \sum_{a = 1}^{K} p (a | θ) f (a) \\ = \sum_{k = 1}^{K} p (a | \tilde{θ}) \frac{p (a | θ)}{p (a | \tilde{θ})} f (a) \\ = E_{a \sim p (a | \tilde{θ})} \frac{p (a | θ)}{p (a | \tilde{θ})} f (a) \end{aligned}

Now I have expressed my quantity of interest as an expectation over

p (a | \tilde{θ})

, so I can use my samples

{\tilde{a}}_{n}

, as follows:

\begin{aligned} F (θ) & = E_{a \sim p (a | \tilde{θ})} \frac{p (a | θ)}{p (a | \tilde{θ})} f (a) \\ \approx \frac{1}{N} \sum_{n = 1}^{N} \frac{p ({\tilde{a}}_{n} | θ)}{p ({\tilde{a}}_{n} | \tilde{θ})} f ({\tilde{a}}_{n}) . \end{aligned}

So, instead of simply averaging

f (a)

over the samples, I weight each sample by the importance weight

w (a) = \frac{p (a | θ)}{p (a | \tilde{θ})}

This is a widely used method for estimating expectations under distributions when you only have samples from another distribution. The problem with it, it is a high variance estimator.

Let's call this estimator

{\tilde{F}}_{I S} (θ)

Alternative importance sampling

In this example, we sample

a

in a two-stage process: first we sample

π

and then we sample

a

. The IS estimator above doesn't exploit this, it completely ignores

π

. We can construct a similar estimator that does take

π

into account:

\begin{aligned} F (θ) & = E_{a \sim p (a | θ)} \\ = \sum_{a = 1}^{K} p (a | θ) f (a) \\ = \sum_{a = 1}^{K} \int p (a, π | θ) f (a) d π \\ = \sum_{k = 1}^{K} \int p (a | \tilde{θ}) \frac{p (a, π | θ)}{p (a, π | \tilde{θ})} f (a) π \\ = E_{a \sim π, π \sim Dir (θ)} \frac{p (a, π | θ)}{p (a, π | \tilde{θ})} f (a) \end{aligned}

Based on this, an empirical estimator can be constructed:

\begin{aligned} F (θ) & = E_{a \sim p (a | θ)} f (a) \\ = E_{a \sim π, π \sim Dir (θ)} \frac{p (a, π | θ)}{p (a, π | \tilde{θ})} f (a) \\ \approx \frac{1}{N} \sum_{n = 1}^{N} \frac{p ({\tilde{a}}_{n}, {\tilde{π}}_{n} | θ)}{p ({\tilde{a}}_{n}, {\tilde{π}}_{n} | \tilde{θ})} f (a) \end{aligned}

Let's call this estimator

{\tilde{F}}_{I S 2} (θ)

Note that this estimator is very similar to the previous one, except that the importance weights now depend not just on

a

but on

π

Importance Weighting Problem

Estimating expectation

Importance sampling

Alternative importance sampling

Questions

Closed form importance weights in each estimator

Which estimator is better?

How do the importance weights behave in the two estimators?

Which estimator has lower variance?

Importance Weighting Problem

Estimating expectation

Importance sampling

Alternative importance sampling

Questions

Closed form importance weights in each estimator

Which estimator is better?

How do the importance weights behave in the two estimators?

Which estimator has lower variance?

Read more

Reading Group

Új témák MLJC-re

Importrance of Masking in Generative Modeling of Sequences

Periodic Markov-chain example