---
title: Gaussian AT
tags: research
---
# Gaussian-reparameterized Antithetics
:::info
Edited by Prof. Huei-Wen Teng at NYCU and updated on 2021/10/29. All rights reserved.
:::
## Infrastructure
1. [Data and Code](https://drive.google.com/drive/folders/1IsGzzYtKJgg_d7h3KDKMbD80Ohqyx6-t?usp=sharing)
2. [Jay](https://hackmd.io/263BXf8NTmuGhnbmf7saVA)
3. [Tom](https://hackmd.io/iJ0fYbhnRwaoLCNb_fQSGg)
4. [Introduction to deep learning](https://hackmd.io/0eUjIbHSQme8b7QrjAbpcw)
5. [學生報告](https://hackmd.io/RyNu7V3nToaT6Qb2WQujjw)
6. [國網中心高速運算](https://hackmd.io/HXY75BRpRzimkWdClAbuLw)
7. [專題報告實作懶人包](https://hackmd.io/RyNu7V3nToaT6Qb2WQujjw)
## Debug 2021/12/03
1. Gradient descent computing time? Integration, differentation.
2. Zero-variance estimator?
3. GAN?
### Questions
1. Why the $V(\theta)$ vs iteration stays constant in Example 1?
2. Why $V(\theta)$ can converge to zero in the other cases?
In practice, it takes too long to find the $\theta$. Solutions:
1. Reduce $m$ from 100 to 10 and use the same results.
### Procedures to debug. Example 1
1. $m = 3$, $d =2$, $r = 1000$, and we will calculate $P(X_1>0, X_2>0)$.
2. Fix $k = 1$, $\theta=(t,t,t,t,t,t)$ is of size $(md\times k) = (3\times 2) \times 1 = 6$:
3. The [data matrix $X$](https://drive.google.com/file/d/1Nsj02rZ8nQs4xCcveS-FMSWrM6gyax8v/view?usp=sharing) is of size $(md\times r)$. The first $m$ rows is the first sample of $d$-dimensional random sample.
4. Given $\theta$ all zeros, $V(\theta)$?
5. Given $\theta$ all ones, $V(\theta)$?
6. Print the final results of $\theta$.
7. To minimize $V(\theta)$, show the iterations plot.
| | Teng | Jay | Tom |
| ---|----|---|---|
| Q4 | | 0.0671 | 0.0671|
| Q5 | | 0.1144 | 0.1144|
| Q6 || Q4. $[0, 0, 0, 0, 0, 0]^T$ <br> Q5. $[1, 1, 1, 1, 1, 1]^T$ | Q4. $[0, 0, 0, 0, 0, 0]^T$ <br> Q5. $[1, 1, 1, 1, 1, 1]^T$|
| Q7 |||See below.|
### Tom
***
Target function : $V(\theta) = \cfrac{1}{r-1} \sum\limits_{i=1}^r \left( \hat{\mu_i} - \bar{\mu} \right)^2$, where $\bar{\mu} = \cfrac{1}{r} \sum\limits_{i=1}^r \hat{
\mu_i}$ .
 
#### Consider $V(\,\, \theta(t) \,\,)$ vs $\theta(t)$, where $\theta(t) = (t,t,t,t,t,t)^T$



#### Simple example:
A. $\,\,\,f(x) = e^x + 2x \sin(x)$

B. $\,\,\,f(x) = x^3$

***
1. Summarize the GAN model in [Ren et al, 2019, ICML](http://proceedings.mlr.press/v97/ren19b/ren19b.pdf)
2. Review on stochastic optimization
3. Reveiw on Complex portfolio management
4. Andrew Lo's paper on Sharp ratio
5. Stochastic optimization (Jay presented in a earlier time)
6. Stochastic optimization (Jay presented 2021/8/13)
7. [python built-in function](https://stackoverflow.com/questions/19070943/numpy-scipy-analog-of-matlabs-fminsearch)
### Procedures to debug. Example 2
1. $m = 3$, $d =2$, $r = 1000$, and we will calculate $P(X_1>0, X_2>1)$.
2. Fix $k = 1$, $\theta$ is of size $(md\times k) = (3\times 2) \times 1 = 6$:
3. The [data matrix $X$](https://drive.google.com/file/d/1Nsj02rZ8nQs4xCcveS-FMSWrM6gyax8v/view?usp=sharing) is of size $(md\times r)$. The first $m$ rows is the first sample of $d$-dimensional random sample.
4. Given $\theta$ all zeros, $V(\theta)$?
5. Given $\theta$ all ones, $V(\theta)$?
6. Print the final results of $\theta$.
7. To minimize $V(\theta)$, show the iterations plot.
| | Jay | Tom ||
|----|---|---|---|
| Q4 | 0.0257 | 0.0257||
| Q5 | 0.0288 | 0.0288| |
| Q6 |Q4. $[0,0,0,0,0,0]^T$ <br> Q5. $[1,1,1,1,1,1]^T$ | Q4. $[0,0,0,0,0,0]^T$ <br> Q5. $[1,1,1,1,1,1]^T$|
| Q7 | |See below.|
***

***
## Plan
| No | Topic|
|---| ------|
| 1| Debug python codes: $V_\theta$|
| | Check the correctness of $V_\theta$ (with and without common radnom numbers) |
|2 | Try four examples in Tom's slides (3 simple examples, GARCH option pricing)|
| 2.1 | $P(X_1>0, X_2>1)$
| 2.2 | $E[X_1 X_2]$
| 2.3 | $E[X_1 + X_2]$
| 2.4 | GARCH option pricing for Asian option |
### Algorithm for the method
1. Find $\theta^*$
2. Use the generalized antithetic variable $\mu_{\theta^*}$.
:::info
The following is updated from Jay's slides.
:::
## 1. Problem statement
The goal is to estimate $\mu = \mathbb{E}_{p(x)}[f(x)]$ or in addition, optimize the expectation
$$ \mu(\phi) = \mathbb{E}_{p(x)}[f(x;\phi)]$$
To compute the expectation, Monte-Carlo is widely used:
$$ \mathbb{E}_{p(x)}[f(x)] \approx \dfrac{1}{m}\sum\limits_{i=1}^mf(x_i) := \hat{\mu}_f(x_{1:m}) $$
For i.i.d. samples $x_{1:m} \sim p(x_{1:m})$,
$$ Var_{p(x_{1:m})}[\hat{\mu}_f(x_{1:m})] = \dfrac{Var_{p(x)}[f(x)]}{m} $$
---
## 2. The method
### 2.1. Review on antithetic sampling
#### Antithetic sampling
Classical Monte-Carlo considers i.i.d. sampling.
The antithetic sampling forgo the i.i.d sampling and use a joint $q(x_{i:m})$ s.t.
$$ Var_{q(x_{1:m})}[\hat{\mu}_f(x_{1:m})] < Var_{p(x_{1:m})}[\hat{\mu}_f(x_{1:m})] $$
If $q(x_i) = p(x_i)$, $\forall i=1, \cdots,m$, then the Monte Carlo estimator $\hat{\mu}_f(x_{i:m})$ is unbiased since
$$ \mathbb{E}_{q(x_{1:m})}\left[ \dfrac{1}{m}\sum\limits_{i=1}^m f(x_i) \right] = \dfrac{1}{m}\sum\limits_{i=1}^m \mathbb{E}_{q(x_{1:m})}[f(x_i)] = \mathbb{E}_{p(x)}[f(x)] $$
---
#### An easy example for antithetic sampling
Consider $p(x)$ is symmetric ( e.g. a normal distribution with zero mean ) and $f$ is odd, the choice of $q(x_1, x_2)$ is to sample $x_1 \sim p(x)$ and set $x_2 = -x_1$.
1. $q(x_1) = q(x_2) = p(x_1)$ \\ \hspace*{\fill} \\
2. $\dfrac{1}{2}(f(x_1)+f(x_2)) = 0 = \mathbb{E}_{p(x)}[f(x)]$ \\ \\
$Var\left( \dfrac{f(x_1)+f(x_2)}{2} \right) \propto Var(f(x_1))+Var(f(x_2))+2Cov(f(x_1),f(x_2))$
3. However, if $f$ is even, then it increases the variance on the contrary.
---
#### Issues for antithetic sampling
1. The decision of the distribution $q(x_{1:m})$ is case by case
2. However, there is one method can applied to all case : sampling without replacement
3. Unfortunately, it reduces variance just by a small amount
---
#### Gaussian-reparameterized Antithetics
Our goal is to define antithetics samplers $q_{\theta}(x_{1:m})$ and choose a suitable parameters $\theta$
The first difficulty is to find an unbiased sampler. That is,
\begin{align}
q_{\theta}(x_i)=p(x_i) \text{ , for all } i = 1, \cdots, m
\end{align}
---
#### Construction for unbiased sampler
Since any random variable with its cdf is uniform, we can generate $x \sim p(x)$ by the following process:
$$ \epsilon \sim \mathcal{N}(0, I_d) \quad,\quad x = F^{-1}(\Phi(\epsilon)) := g(\epsilon), $$
where $F$ is the cdf of $p(x)$ and $\Phi$ is the cdf of $d$-dim standard Gaussian.
So the expectation is converted to
$$ \mathbb{E}_{p(x)}[f(x)] = \mathbb{E}_{\mathcal{N}(0, I_d)}[f(g(\epsilon))] $$
---
#### Definition 1: A Gussian antithetic of order $m$
Let $x$ be a continuous random vector with density $p(x)$. A *Gaussian antithetic of order $m$ for $p(x)$* is the family of distribution $q_{\theta}(x_1, \cdots, x_m)$ defined by the procedure
$$ (\epsilon_1, \cdots, \epsilon_m) \sim \mathcal{N}(0, \Sigma_{\theta}) \text{ , } x_i=g(\epsilon_i), $$
where $\Sigma_{\theta} \in \Sigma_{\text{unbiased}} := \{ \Sigma \in \mathbb{R}^{md \times md} : \Sigma \succ 0, \Sigma_{\mathcal{I}\mathcal{I}} = I_d \}$.
####
1. For any $\Sigma_{\theta} \in \Sigma_{\text{unbiased}}$, the estimator $\hat{\mu}_f(x_{1:m})$ is unbiased
2. If $\Sigma_{\theta} = I_{md}$, it is equivalent to i.i.d. sampling
3. Given a Cholesky decomposition $\Sigma_{\theta} = L_{\theta}L_{\theta}^T$, we can sample $\delta = (\delta_1, \cdots, \delta_m)^T$ from $\mathcal{N}(0, I_d)$ and $x_{1:m} = L_{\theta}\delta$
---
#### Learning an Antithetic Distribution
In this section, it deals with the optimization
$$ \min\limits_{\theta} Var_{q_{\theta}(x_{1:m})}[\hat{\mu}_f(x_{1:m})] = \mathbb{E}_{\mathcal{N}(0, \psi(\theta))}[(\hat{\mu}_{f \circ g}(\epsilon_{1:m})-\mu)^2], $$
which is equivalent to
$$ \min\limits_{\theta} tr (Cov_{q_{\theta}(x_{1:m})}[\hat{\mu}_f(x_{1:m})]) = \mathbb{E}_{\mathcal{N}(0, \psi(\theta))} \left[ ||\hat{\mu}_{f \circ g}(\epsilon_{1:m})-\mu|| ^2 \right] $$
---
#### Gradient Based Variance Minimization
Note that our objective function converts to
$$\mathbb{E}_{\delta \sim \mathcal{N}(0, I)}[\phi(L(\theta)\delta)] $$
Suppose $\nabla_{\epsilon} f \circ g(\epsilon)$ exist, then we can compute $\nabla_{\epsilon_i} \phi(\epsilon_{1:m})$. So
$$ \nabla_{\theta}\mathbb{E}_{\delta \sim \mathcal{N}(0, I)}[\phi(L(\theta)\delta)] = \mathbb{E}_{\delta \sim \mathcal{N}(0, I)}[\nabla_{theta} \phi(L(\theta)\delta)] $$
So we can minimize it by using SGD
---
#### Gradient Free Variance Minimization
By reinforce estimator $\nabla_{\theta}p_{\theta}(x) = p_{\theta}(x)\nabla_{\theta}\log p_{\theta}(x)$,
$$ \nabla_{\theta}\mathbb{E}_{\mathcal{N}(0,\Sigma_{\theta})}[\phi(\epsilon_{1:m})] = \mathbb{E}_{\mathcal{N}(0,\Sigma_{\theta})}[\phi(\epsilon_{1:m})\nabla_{\theta}\log \mathcal{N}(0, \Sigma_{\theta})] $$
---
## 3. Applications
d: dimension
m: sample size
r: replication size to calculate the variance
| Method | mean | variance |
| --- | ----| ---- |
| Standard
| AT |
| GPAT |
### 3.1: Three simple examples
Tom, please help to organize this! :sunflower:
### 3.2: GARCH option pricing
Tom, please help to organize this! :sunflower:
### 3.3. Variational Bayes
In Bayesian estimation, it is usually given a likelihood $p(x|z)$ and a prior $p(z)$, the objective is to estimate $\log p(x)$
Typical way :
\begin{align*}
\log p(x) &= \log \mathbb{E}_{p(z)}[p(x|z)] = \log \mathbb{E}_{q(z)}\left[\dfrac{p(x,z)}{q(z)}\right] \\
&\geq \mathbb{E}_{q(z)}\left[\log\dfrac{p(x,z)}{q(z)}\right] := \mathcal{L}(x)
\end{align*}
The $\mathcal{L}(x)$ is called the evidence lower bound (ELBO). We can maximize the ELBO to get the tightest lower bound
---
Burda et al. proposed using multi-variable importance sampling
$$ \mathcal{L}^{is}(x) = \mathbb{E}_{q(z_1), \cdots, q(z_m)} \left[\log\dfrac{1}{m}\sum\limits_i\dfrac{p(x,z_i)}{q(z_i)}\right] $$
and proved that $\mathcal{L}(x) \leq \mathcal{L}^{is}(x) \leq \log p(x)$
This paper consider
$$ \mathcal{L}^{anti}(x) = \mathbb{E}_{q_{\theta}(z_1, \cdots, z_m)} \left[\log\dfrac{1}{m}\sum\limits_i\dfrac{p(x,z_i)}{q(z_i)}\right] $$
---
### 3.4 Generative Adversarial Networks (GAN)
The typical objective is
$$ \min\limits_{G}\max\limits_{D}\mathcal{L}(G,D) = \mathbb{E}_{p_{\text{data}}(x)}[D(x)] - \mathbb{E}_{p(z)}[D(G(z))], $$which is usually solved by SGD.
The computation of gradients is by Monte Carlo
$$ \nabla_G \mathcal{L}(G,D) \approx - \dfrac{1}{m}\sum\limits_i \nabla_GD(G(z_i)) $$
$$ \nabla_D \mathcal{L}(G,D) \approx \dfrac{1}{m}\sum\limits_i \nabla_D D(x_i) -\dfrac{1}{m}\sum\limits_i \nabla_GD(G(z_i)) $$
---
## Appendix A.
1. [CS229](https://cs229.stanford.edu/)
2. Ren, Zhao, and Ermon (2019)
3. [張潼](http://tongzhang-ml.org/publication.html) M. Yang, A. Milzarek, Z. Wen, and T. Zhang, "A Stochastic Extra-Step Quasi-Newton Method for Nonsmooth Nonconvex Optimization," Mathematical Programming, 2021.