Importance Weighting Problem

Let

f:{1,,K}R be a scalar function over
K
values.

Let

θ be parameters of a
K
-dimensional Dirichlet distribution, and
π
a draw from it. (the sample from the Dirichlet,
π
, is therefore a probability distribution over
K
outcomes, such that
k=1Kπk=1
).

πDir(θ)

Let

a be a sample from
π
, that is

aπ

It's not diffuclt to see that the marginal distribution of

a is

p(a|θ)=p(a|π)p(π|θ)dπ=EπDir(θ)πa=θak=1Kθk

Estimating expectation

I'm interested in estimating

F(θ)=Eap(a|θ)f(a)

If I could sample from

p(a|θ), this would be easy to estimate as a Monte Carlo average.

F(θ)=Eap(a|θ)f(a)1Nn=1Nf(an)

where

πnDir(θ),i.i.d.anπn

Importance sampling

However, I will assume that I can't sample from

p(θ), only from
p(θ~)
. So my samples are:

π~nDir(θ~),i.i.d.a~nπ~n

Using importance sampling I can estimate my quantity of interest as follows:

F(θ)=Eap(a|θ)=a=1Kp(a|θ)f(a)=k=1Kp(a|θ~)p(a|θ)p(a|θ~)f(a)=Eap(a|θ~)p(a|θ)p(a|θ~)f(a)

Now I have expressed my quantity of interest as an expectation over

p(a|θ~), so I can use my samples
a~n
, as follows:

F(θ)=Eap(a|θ~)p(a|θ)p(a|θ~)f(a)1Nn=1Np(a~n|θ)p(a~n|θ~)f(a~n).

So, instead of simply averaging

f(a) over the samples, I weight each sample by the importance weight
w(a)=p(a|θ)p(a|θ~)
.

This is a widely used method for estimating expectations under distributions when you only have samples from another distribution. The problem with it, it is a high variance estimator.

Let's call this estimator

F~IS(θ)

Alternative importance sampling

In this example, we sample

a in a two-stage process: first we sample
π
and then we sample
a
. The IS estimator above doesn't exploit this, it completely ignores
π
. We can construct a similar estimator that does take
π
into account:

F(θ)=Eap(a|θ)=a=1Kp(a|θ)f(a)=a=1Kp(a,π|θ)f(a)dπ=k=1Kp(a|θ~)p(a,π|θ)p(a,π|θ~)f(a)π=Eaπ,πDir(θ)p(a,π|θ)p(a,π|θ~)f(a)

Based on this, an empirical estimator can be constructed:

F(θ)=Eap(a|θ)f(a)=Eaπ,πDir(θ)p(a,π|θ)p(a,π|θ~)f(a)1Nn=1Np(a~n,π~n|θ)p(a~n,π~n|θ~)f(a)

Let's call this estimator

F~IS2(θ).

Note that this estimator is very similar to the previous one, except that the importance weights now depend not just on

a but on
π
also.

Questions

Closed form importance weights in each estimator

Which estimator is better?

How do the importance weights behave in the two estimators?

Which estimator has lower variance?