Gaussian AT - HackMD

--- title: Gaussian AT tags: research --- # Gaussian-reparameterized Antithetics :::info Edited by Prof. Huei-Wen Teng at NYCU and updated on 2021/10/29. All rights reserved. ::: ## Infrastructure 1. [Data and Code](https://drive.google.com/drive/folders/1IsGzzYtKJgg_d7h3KDKMbD80Ohqyx6-t?usp=sharing) 2. [Jay](https://hackmd.io/263BXf8NTmuGhnbmf7saVA) 3. [Tom](https://hackmd.io/iJ0fYbhnRwaoLCNb_fQSGg) 4. [Introduction to deep learning](https://hackmd.io/0eUjIbHSQme8b7QrjAbpcw) 5. [學生報告](https://hackmd.io/RyNu7V3nToaT6Qb2WQujjw) 6. [國網中心高速運算](https://hackmd.io/HXY75BRpRzimkWdClAbuLw) 7. [專題報告實作懶人包](https://hackmd.io/RyNu7V3nToaT6Qb2WQujjw) ## Debug 2021/12/03 1. Gradient descent computing time? Integration, differentation. 2. Zero-variance estimator? 3. GAN? ### Questions 1. Why the $V(\theta)$ vs iteration stays constant in Example 1? 2. Why $V(\theta)$ can converge to zero in the other cases? In practice, it takes too long to find the $\theta$. Solutions: 1. Reduce $m$ from 100 to 10 and use the same results. ### Procedures to debug. Example 1 1. $m = 3$, $d =2$, $r = 1000$, and we will calculate $P(X_1>0, X_2>0)$. 2. Fix $k = 1$, $\theta=(t,t,t,t,t,t)$ is of size $(md\times k) = (3\times 2) \times 1 = 6$: 3. The [data matrix $X$](https://drive.google.com/file/d/1Nsj02rZ8nQs4xCcveS-FMSWrM6gyax8v/view?usp=sharing) is of size $(md\times r)$. The first $m$ rows is the first sample of $d$-dimensional random sample. 4. Given $\theta$ all zeros, $V(\theta)$? 5. Given $\theta$ all ones, $V(\theta)$? 6. Print the final results of $\theta$. 7. To minimize $V(\theta)$, show the iterations plot. | | Teng | Jay | Tom | | ---|----|---|---| | Q4 | | 0.0671 | 0.0671| | Q5 | | 0.1144 | 0.1144| | Q6 || Q4. $[0, 0, 0, 0, 0, 0]^T$ <br> Q5. $[1, 1, 1, 1, 1, 1]^T$ | Q4. $[0, 0, 0, 0, 0, 0]^T$ <br> Q5. $[1, 1, 1, 1, 1, 1]^T$| | Q7 |||See below.| ### Tom *** Target function : $V(\theta) = \cfrac{1}{r-1} \sum\limits_{i=1}^r \left( \hat{\mu_i} - \bar{\mu} \right)^2$, where $\bar{\mu} = \cfrac{1}{r} \sum\limits_{i=1}^r \hat{ \mu_i}$ . ![](https://i.imgur.com/saMTvqY.png =350x) ![](https://i.imgur.com/55XxmvS.png =350x) #### Consider $V(\,\, \theta(t) \,\,)$ vs $\theta(t)$, where $\theta(t) = (t,t,t,t,t,t)^T$ ![](https://i.imgur.com/YgZ356F.png =350x)![](https://i.imgur.com/GW3BSwt.png =350x) ![](https://i.imgur.com/jnbwkyG.png =350x)![](https://i.imgur.com/7tDKsgE.png =350x) ![](https://i.imgur.com/V80CbGe.png =350x)![](https://i.imgur.com/KFDcPnH.png =350x) #### Simple example: A. $\,\,\,f(x) = e^x + 2x \sin(x)$ ![](https://i.imgur.com/YnnfmNn.png =350x)![](https://i.imgur.com/QP5LTHO.png =350x) B. $\,\,\,f(x) = x^3$ ![](https://i.imgur.com/gnxgytd.png =350x)![](https://i.imgur.com/jlvvvxk.png =350x) *** 1. Summarize the GAN model in [Ren et al, 2019, ICML](http://proceedings.mlr.press/v97/ren19b/ren19b.pdf) 2. Review on stochastic optimization 3. Reveiw on Complex portfolio management 4. Andrew Lo's paper on Sharp ratio 5. Stochastic optimization (Jay presented in a earlier time) 6. Stochastic optimization (Jay presented 2021/8/13) 7. [python built-in function](https://stackoverflow.com/questions/19070943/numpy-scipy-analog-of-matlabs-fminsearch) ### Procedures to debug. Example 2 1. $m = 3$, $d =2$, $r = 1000$, and we will calculate $P(X_1>0, X_2>1)$. 2. Fix $k = 1$, $\theta$ is of size $(md\times k) = (3\times 2) \times 1 = 6$: 3. The [data matrix $X$](https://drive.google.com/file/d/1Nsj02rZ8nQs4xCcveS-FMSWrM6gyax8v/view?usp=sharing) is of size $(md\times r)$. The first $m$ rows is the first sample of $d$-dimensional random sample. 4. Given $\theta$ all zeros, $V(\theta)$? 5. Given $\theta$ all ones, $V(\theta)$? 6. Print the final results of $\theta$. 7. To minimize $V(\theta)$, show the iterations plot. | | Jay | Tom || |----|---|---|---| | Q4 | 0.0257 | 0.0257|![](https://i.imgur.com/dG03myS.png =150x)| | Q5 | 0.0288 | 0.0288| ![](https://i.imgur.com/blt44IG.png =200x)| | Q6 |Q4. $[0,0,0,0,0,0]^T$ <br> Q5. $[1,1,1,1,1,1]^T$ | Q4. $[0,0,0,0,0,0]^T$ <br> Q5. $[1,1,1,1,1,1]^T$| | Q7 | |See below.| *** ![](https://i.imgur.com/h5h9NIR.png =350x)![](https://i.imgur.com/2qWoAUj.png =350x) *** ## Plan | No | Topic| |---| ------| | 1| Debug python codes: $V_\theta$| | | Check the correctness of $V_\theta$ (with and without common radnom numbers) | |2 | Try four examples in Tom's slides (3 simple examples, GARCH option pricing)| | 2.1 | $P(X_1>0, X_2>1)$ | 2.2 | $E[X_1 X_2]$ | 2.3 | $E[X_1 + X_2]$ | 2.4 | GARCH option pricing for Asian option | ### Algorithm for the method 1. Find $\theta^*$ 2. Use the generalized antithetic variable $\mu_{\theta^*}$. :::info The following is updated from Jay's slides. ::: ## 1. Problem statement The goal is to estimate $\mu = \mathbb{E}_{p(x)}[f(x)]$ or in addition, optimize the expectation $$ \mu(\phi) = \mathbb{E}_{p(x)}[f(x;\phi)]$$ To compute the expectation, Monte-Carlo is widely used: $$ \mathbb{E}_{p(x)}[f(x)] \approx \dfrac{1}{m}\sum\limits_{i=1}^mf(x_i) := \hat{\mu}_f(x_{1:m}) $$ For i.i.d. samples $x_{1:m} \sim p(x_{1:m})$, $$ Var_{p(x_{1:m})}[\hat{\mu}_f(x_{1:m})] = \dfrac{Var_{p(x)}[f(x)]}{m} $$ --- ## 2. The method ### 2.1. Review on antithetic sampling #### Antithetic sampling Classical Monte-Carlo considers i.i.d. sampling. The antithetic sampling forgo the i.i.d sampling and use a joint $q(x_{i:m})$ s.t. $$ Var_{q(x_{1:m})}[\hat{\mu}_f(x_{1:m})] < Var_{p(x_{1:m})}[\hat{\mu}_f(x_{1:m})] $$ If $q(x_i) = p(x_i)$, $\forall i=1, \cdots,m$, then the Monte Carlo estimator $\hat{\mu}_f(x_{i:m})$ is unbiased since $$ \mathbb{E}_{q(x_{1:m})}\left[ \dfrac{1}{m}\sum\limits_{i=1}^m f(x_i) \right] = \dfrac{1}{m}\sum\limits_{i=1}^m \mathbb{E}_{q(x_{1:m})}[f(x_i)] = \mathbb{E}_{p(x)}[f(x)] $$ --- #### An easy example for antithetic sampling Consider $p(x)$ is symmetric ( e.g. a normal distribution with zero mean ) and $f$ is odd, the choice of $q(x_1, x_2)$ is to sample $x_1 \sim p(x)$ and set $x_2 = -x_1$. 1. $q(x_1) = q(x_2) = p(x_1)$ \\ \hspace*{\fill} \\ 2. $\dfrac{1}{2}(f(x_1)+f(x_2)) = 0 = \mathbb{E}_{p(x)}[f(x)]$ \\ \\ $Var\left( \dfrac{f(x_1)+f(x_2)}{2} \right) \propto Var(f(x_1))+Var(f(x_2))+2Cov(f(x_1),f(x_2))$ 3. However, if $f$ is even, then it increases the variance on the contrary. --- #### Issues for antithetic sampling 1. The decision of the distribution $q(x_{1:m})$ is case by case 2. However, there is one method can applied to all case : sampling without replacement 3. Unfortunately, it reduces variance just by a small amount --- #### Gaussian-reparameterized Antithetics Our goal is to define antithetics samplers $q_{\theta}(x_{1:m})$ and choose a suitable parameters $\theta$ The first difficulty is to find an unbiased sampler. That is, \begin{align} q_{\theta}(x_i)=p(x_i) \text{ , for all } i = 1, \cdots, m \end{align} --- #### Construction for unbiased sampler Since any random variable with its cdf is uniform, we can generate $x \sim p(x)$ by the following process: $$ \epsilon \sim \mathcal{N}(0, I_d) \quad,\quad x = F^{-1}(\Phi(\epsilon)) := g(\epsilon), $$ where $F$ is the cdf of $p(x)$ and $\Phi$ is the cdf of $d$-dim standard Gaussian. So the expectation is converted to $$ \mathbb{E}_{p(x)}[f(x)] = \mathbb{E}_{\mathcal{N}(0, I_d)}[f(g(\epsilon))] $$ --- #### Definition 1: A Gussian antithetic of order $m$ Let $x$ be a continuous random vector with density $p(x)$. A *Gaussian antithetic of order $m$ for $p(x)$* is the family of distribution $q_{\theta}(x_1, \cdots, x_m)$ defined by the procedure $$ (\epsilon_1, \cdots, \epsilon_m) \sim \mathcal{N}(0, \Sigma_{\theta}) \text{ , } x_i=g(\epsilon_i), $$ where $\Sigma_{\theta} \in \Sigma_{\text{unbiased}} := \{ \Sigma \in \mathbb{R}^{md \times md} : \Sigma \succ 0, \Sigma_{\mathcal{I}\mathcal{I}} = I_d \}$. #### 1. For any $\Sigma_{\theta} \in \Sigma_{\text{unbiased}}$, the estimator $\hat{\mu}_f(x_{1:m})$ is unbiased 2. If $\Sigma_{\theta} = I_{md}$, it is equivalent to i.i.d. sampling 3. Given a Cholesky decomposition $\Sigma_{\theta} = L_{\theta}L_{\theta}^T$, we can sample $\delta = (\delta_1, \cdots, \delta_m)^T$ from $\mathcal{N}(0, I_d)$ and $x_{1:m} = L_{\theta}\delta$ --- #### Learning an Antithetic Distribution In this section, it deals with the optimization $$ \min\limits_{\theta} Var_{q_{\theta}(x_{1:m})}[\hat{\mu}_f(x_{1:m})] = \mathbb{E}_{\mathcal{N}(0, \psi(\theta))}[(\hat{\mu}_{f \circ g}(\epsilon_{1:m})-\mu)^2], $$ which is equivalent to $$ \min\limits_{\theta} tr (Cov_{q_{\theta}(x_{1:m})}[\hat{\mu}_f(x_{1:m})]) = \mathbb{E}_{\mathcal{N}(0, \psi(\theta))} \left[ ||\hat{\mu}_{f \circ g}(\epsilon_{1:m})-\mu|| ^2 \right] $$ --- #### Gradient Based Variance Minimization Note that our objective function converts to $$\mathbb{E}_{\delta \sim \mathcal{N}(0, I)}[\phi(L(\theta)\delta)] $$ Suppose $\nabla_{\epsilon} f \circ g(\epsilon)$ exist, then we can compute $\nabla_{\epsilon_i} \phi(\epsilon_{1:m})$. So $$ \nabla_{\theta}\mathbb{E}_{\delta \sim \mathcal{N}(0, I)}[\phi(L(\theta)\delta)] = \mathbb{E}_{\delta \sim \mathcal{N}(0, I)}[\nabla_{theta} \phi(L(\theta)\delta)] $$ So we can minimize it by using SGD --- #### Gradient Free Variance Minimization By reinforce estimator $\nabla_{\theta}p_{\theta}(x) = p_{\theta}(x)\nabla_{\theta}\log p_{\theta}(x)$, $$ \nabla_{\theta}\mathbb{E}_{\mathcal{N}(0,\Sigma_{\theta})}[\phi(\epsilon_{1:m})] = \mathbb{E}_{\mathcal{N}(0,\Sigma_{\theta})}[\phi(\epsilon_{1:m})\nabla_{\theta}\log \mathcal{N}(0, \Sigma_{\theta})] $$ --- ## 3. Applications d: dimension m: sample size r: replication size to calculate the variance | Method | mean | variance | | --- | ----| ---- | | Standard | AT | | GPAT | ### 3.1: Three simple examples Tom, please help to organize this! :sunflower: ### 3.2: GARCH option pricing Tom, please help to organize this! :sunflower: ### 3.3. Variational Bayes In Bayesian estimation, it is usually given a likelihood $p(x|z)$ and a prior $p(z)$, the objective is to estimate $\log p(x)$ Typical way : \begin{align*} \log p(x) &= \log \mathbb{E}_{p(z)}[p(x|z)] = \log \mathbb{E}_{q(z)}\left[\dfrac{p(x,z)}{q(z)}\right] \\ &\geq \mathbb{E}_{q(z)}\left[\log\dfrac{p(x,z)}{q(z)}\right] := \mathcal{L}(x) \end{align*} The $\mathcal{L}(x)$ is called the evidence lower bound (ELBO). We can maximize the ELBO to get the tightest lower bound --- Burda et al. proposed using multi-variable importance sampling $$ \mathcal{L}^{is}(x) = \mathbb{E}_{q(z_1), \cdots, q(z_m)} \left[\log\dfrac{1}{m}\sum\limits_i\dfrac{p(x,z_i)}{q(z_i)}\right] $$ and proved that $\mathcal{L}(x) \leq \mathcal{L}^{is}(x) \leq \log p(x)$ This paper consider $$ \mathcal{L}^{anti}(x) = \mathbb{E}_{q_{\theta}(z_1, \cdots, z_m)} \left[\log\dfrac{1}{m}\sum\limits_i\dfrac{p(x,z_i)}{q(z_i)}\right] $$ --- ### 3.4 Generative Adversarial Networks (GAN) The typical objective is $$ \min\limits_{G}\max\limits_{D}\mathcal{L}(G,D) = \mathbb{E}_{p_{\text{data}}(x)}[D(x)] - \mathbb{E}_{p(z)}[D(G(z))], $$which is usually solved by SGD. The computation of gradients is by Monte Carlo $$ \nabla_G \mathcal{L}(G,D) \approx - \dfrac{1}{m}\sum\limits_i \nabla_GD(G(z_i)) $$ $$ \nabla_D \mathcal{L}(G,D) \approx \dfrac{1}{m}\sum\limits_i \nabla_D D(x_i) -\dfrac{1}{m}\sum\limits_i \nabla_GD(G(z_i)) $$ --- ## Appendix A. 1. [CS229](https://cs229.stanford.edu/) 2. Ren, Zhao, and Ermon (2019) 3. [張潼](http://tongzhang-ml.org/publication.html) M. Yang, A. Milzarek, Z. Wen, and T. Zhang, "A Stochastic Extra-Step Quasi-Newton Method for Nonsmooth Nonconvex Optimization," Mathematical Programming, 2021.