Literature Reading
Sparse Coding
Autoencoder
Paper
2019 under review這篇有比較多的圖解與詳細的公式推導說明(附錄)
變異自動編碼器(VAEs)在其他難以處理的生成模型中進行近似推理時提供了一種可行的方法。然而,標準的VAEs經常產生分散的、缺乏可解釋性的潛伏代碼,從而使產生的表示不適合輔助任務(如分類)和人類解釋。我們通過融合變異自動編碼器和稀疏編碼的思想來解決這些問題,並建議用Spike和Slab先驗分佈對VAE的潛在空間進行明確的稀疏建模。我們使用離散混合識別函數(mixture recognition function)推導出證據下限(evidence lower bound),從而使近似的後驗推斷與標準的VAE情況一樣具有計算效率。通過新的方法,我們能夠推斷出真正的稀疏表徵,並具有一般難以解決的非線性概率模型。我們證明了這些稀疏表徵在兩個基準分類任務(MNIST和Fashion-MNIST)中比標準VAE表徵更有優勢,證明了分類精度的提高和對潛伏維數的穩健性的明顯增加。此外,我們還從質量上證明了稀疏元素捕捉了主觀上可以理解的變化來源。
Variational auto-encoders (VAEs) offer a tractable approach when performing approximate inference in otherwise intractable generative models. However, standard VAEs often produce latent codes that are disperse and lack interpretability, thus making the resulting representations unsuitable for auxiliary tasks (e.g. classification) and human interpretation. We address these issues by merging ideas from variational auto-encoders and sparse coding, and propose to explicitly model sparsity in the latent space of a VAE with a Spike and Slab prior distribution. We derive the evidence lower bound using a discrete mixture recognition function thereby making approximate posterior inference as computational efficient as in the standard VAE case. With the new approach, we are able to infer truly sparse representations with generally intractable non-linear probabilistic models. We show that these sparse representations are advantageous over standard VAE representations on two benchmark classification tasks (MNIST and Fashion-MNIST) by demonstrating improved classification accuracy and significantly increased robustness to the number of latent dimensions. Furthermore, we demonstrate qualitatively that the sparse elements capture subjectively understandable sources of variation.
圖1:變異稀疏編碼模型(右)與標準VAE(左)的示意圖。在這兩種情況下,觀察到的變量\(x_i\)被認為是由未觀察到的變量\(z_i\)產生的。然而,變異稀疏編碼是用Spike和Slab先驗分佈對潛伏空間的稀疏性進行建模。每個先驗的例子都是以MNIST數據集的一個樣本為例。
最大化ELB 即是最小化\(D_{KL}\),也就是讓z的分布越接近我們指定(約束)的分布
:::
透過參數α控制spike分布集中(稀疏性)的強度
The recognition function \(q_φ(z|x)\) is chosen to be a discrete mixture model of the form
\(D_{KL}(q_φ(z|x)||p_{s}(z))\) 公式推導
# Reconstruction + KL divergence losses summed over all elements of batch
def prior_loss(self, mu, logvar, logspike):
# see Appendix B from VSC paper / Formula 6
spike = torch.clamp(logspike.exp(), 1e-6, 1.0 - 1e-6)
prior1 = -0.5 * torch.sum(spike.mul(1 + logvar - mu.pow(2) - logvar.exp()), dim=1)
prior21 = (1 - spike).mul(torch.log((1 - spike) / (1 - self.alpha)))
prior22 = spike.mul(torch.log(spike / self.alpha))
prior2 = torch.sum(prior21 + prior22, dim=1)
PRIOR = prior1 + prior2 # Slab + Spike KL Divergence
#LOSS = 0.01 * PRIOR
LOSS = PRIOR.mean()
return LOSS
控制spike函數活化敏感度
def reparameterize(self, mu, logvar, logspike):
std = torch.exp(0.5*logvar)
eps = torch.randn_like(std)
gaussian = eps.mul(std).add_(mu)
eta = torch.rand_like(std)
#selection = F.sigmoid(125 * (eta + logspike.exp() - 1))
selection = F.sigmoid(self.c * (eta + logspike.exp() - 1))
return selection.mul(gaussian)
VLB 代表意義:
圖2: 當給太多冗餘的維度空間時,VAE在分布近似上反而表現不好:相對的VSC透過稀疏約束(將無用的活度去活化為0值),而將Z的分布限縮在較小的空間範圍,反而有較穩定的表現
無監督地發現可解釋的特徵和可控制的高維數據生成是目前機器學習的主要挑戰,在數據可視化、聚類和人工數據合成中都有應用。我們提出了一個基於變異自動編碼器(VAEs)的模型,在這個模型中,解釋是通過潛在的空間稀疏性誘導出來的,並以Spike和Slab分佈的混合作為先驗條件。我們為這個模型推導出一個證據下限(evidence lower bound),並提出了一個具體的訓練方法,以恢復被分解的特徵(disentangled features)作為潛伏向量的稀疏元素。在我們的實驗中,我們證明了在無法估計真實變化源的數量和對象顯示不同的屬性組合時,其解纏性能(disentanglement performance)優於標準VAE方法。此外,新模型提供了獨特的能力,如恢復特征利用(recovering feature exploitation),合成與給定輸入對象共享屬性的樣本,以及在生成時控制離散和連續特徵。
Unsupervised discovery of interpretable features and controllable generation with high-dimensional data are currently major challenges in machine learning, with applications in data visualisation, clustering and artificial data synthesis. We propose a model based on variational auto-encoders (VAEs) in which interpretation is induced through latent space sparsity with a mixture of Spike and Slab distributions as prior. We derive an evidence lower bound for this model and propose a specific training method for recovering disentangled features as sparse elements in latent vectors. In our experiments, we demonstrate superior disentanglement performance to standard VAE approaches when an estimate of the number of true sources of variation is not available and objects display different combinations of attributes. Furthermore, the new model provides unique capabilities, such as recovering feature exploitation, synthesising samples that share attributes with a given input object and controlling both discrete and continuous features upon generation.
相較於2019,補衝了特徵可分解性(FEATURE DISENTANGLEMENT)的分析
圖3:笑臉數據集的源特徵和潛變量之間的相關絕對值。 X軸表示潛變量的指數。變量已被替換為與同一特徵顯示出最高相關性的維度組與同一特徵的最高相關性。如果每個潛在變量與一個源屬性有很強的相關性,但與其他屬性的相關性較弱,那麼就可以實現特徵分離。導致一個塊狀對角線結構。
Figure 3: Absolute value of correlation between source features and latent variables for the Smiley data set. The x-axis indicates the index of the latent variable. Variables have been permuted to group dimensions that display the highest correlation with the same feature. Feature disentanglement is achieved if each latent variable correlates strongly with one source attribute but weakly with the remaining ones thus leading to a block diagonal structure.
Hands-On Convolutional Neural Networks with TensorFlow
KL分歧損失是會產生一個數字,表明兩個分佈之間的接近程度。
兩個分佈越接近對方,損失就越小。在下圖中,藍色分佈正試圖模擬綠色分佈。隨著藍色分佈越來越接近綠色分佈,KL分歧損失將越來越接近於零。
The KL divergence loss is one that will produce a number indicating how close two distributions are to each other.
The closer two distributions get to each other, the lower the loss becomes. In the following graph, the blue distribution is trying to model the green distribution. As the blue distribution comes closer and closer to the green one, the KL divergence loss will get closer to zero.
Why Is Cross Entropy Equal to KL-Divergence?
Variational Inference 是一種近似複雜分佈的數學方法,我們假設一類計算簡單的候選分佈q(z),並有另一個複雜難以計算的後驗分佈p(z|x),我們希望從候選分佈中找一個最接近p(z|x)的q∗(z),這個等同於最小化KL(q(z)|p(z|x))
基本思路
在概率模型中,我们常常需要近似难以计算的概率分布,在贝叶斯统计中,所有的对于未知量的推断(inference)问题可以看做是对后验概率(posterior)的计算,而这一概率通常难以计算,之前我们提到了利用MCMC马尔科夫链蒙特卡洛算法(可参考蒙特卡罗方法——深度学习第十七章)做近似,但是对于大量数据,MCMC算法计算较慢,变分推断(Variational Inference)就为我们提供了一种更快更简单的适用于大量数据的近似推断方法。
变分推理的目标是近似潜在变量(latent variables)在观测变量(observed variables)下的条件概率。解决该问题,需要使用优化方法。在变分推断中,需要使用下列优化方法
\[q^{*}(z)= \operatorname*{arg\,min\,KL}_{q(z)\in \mathfrak{Q}} (q(z)\|p(z|x)).\]
- 变分推断等价于最小化KL散度。
ELBO,全称为 Evidence Lower Bound,即证据下界。这里的证据指数据或可观测变量的概率密度。
假设 \(x=x_{1:n}\)表示一系列可观测数据集, \(z=z_{1:m}\)为一系列隐变量(latent variables)。则可用 \(p(z,x)\)表示联合概率, \(p(z∣x)\)为条件概率,\(p(x)\)为证据。
如何简单易懂地理解变分推断(variational inference)?
原始目标是,需要根据已有数据推断需要的分布p;当p不容易表达,不能直接求解时,可以尝试用变分推断的方法, 即,寻找容易表达和求解的分布q,当q和p的差距很小的时候,q就可以作为p的近似分布,成为输出结果了。在这个过程中,我们的关键点转变了,从“求分布”的推断问题,变成了“缩小距离”的优化问题。