Rota link
Next up
Past meetings
Auto-encoding variational Bayes.
📅 12 Oct 2020
👤 Feri
📄 paper and 📝 notes
Auto-encoding variational Bayes. (cont'd)
📅 26 Oct 2020
fhuszar changed 2 years agoView mode Like 1 Bookmark
When learning to generate sequences of symbols $x_1, x_2, \ldots, x_T$, we often do that by defining a pobabilistic generative model, a probability distribution over sequences $p(x_1, \ldots, x_T)$, by making use of the chain rule of probabilities:
$$
p(x_1, \ldots, x_T) = p(x_1) p(x_2\vert x_1) p(x_3\vert x_1, x_3) \cdots p(x_T\vert x_1, \ldots, x_{T-1})
$$
This makes computational sense because modeling each of the conditional distributions above is easier as it is a distribution over a single symbol $x_t$, so even though the entire sequence $x_1, \ldots, x_T$ can take combinatorially many values, each component distribution we model $p(x_t\vert x_1, \ldots, x_{t-1})$ is only a distribution over a relatively small number of options, which can be be modelled easily.
A generative model that defines a probabilistic model as a product of conditional distributions like above is often called autoregressive.
In this short note (and a corresponding colab notebook) about the example of a Markov-chain whose state distribution does not converge to a stationary distribution.
These are probably irrelevant technicalities for reinforcement learning, but an interesting topic to understand nevertheless.
Consider an integer $k$ and the homogeneous Markov-chain $S_{t}$ such that the transition probabilities are:
$$
\mathbb{P}(S_{t+1} = n+1 \mod 2k \vert S_t = n) = \mathbb{P}(S_{t+1} = n-1 \mod 2k \vert S_t = n) = \frac{1}{2}
$$
Independent Component Analysis
We are familiar with the problem that our high-dimensional data sources (images, video, fMRI, etc.) are redundant. And actually, we don't care about the value of pixels/voxels, we need some compact representation. Do you know state-space models? Well, something like that would be great.
Of course, we have dimensionality-reduction methods, like Principal Component Analysis (PCA) or Factor Analysis (FA). Why do we need another one then? Someone just tried to get a PhD by tweaking around a bit to get something published? Fortunately, Independent Component Analysis is much more than that.
Notation
$s$ - signal sources/latents
$x$ - signal mixtures/observations
$y$ - reconstructed signal sources
$g$ - prescribed CDF function
Jensen-Shannon and the Mutual Information
Thte Jensen-Shannon divergence between $Q$ and $P$ is defined as
$$
\operatorname{JS}[P,Q] = \frac{1}{2}\left(\operatorname{KL}\left[P\middle|\frac{P+Q}{2}\right] + \operatorname{KL}\left[Q\middle|\frac{P+Q}{2}\right]\right)
$$
This has a neat infomation theoretic interpretation. For the explanation consider two urns of balls, one where colour of the ball is distributed like $P$ and one where it's distributed like $Q$.
Consider a random coinflip $Y$ such that $\mathbb{P}[Y=1] = \mathbb{P}[Y=0] = \frac{1}{2}$.
The Implicit Bias of Gradient Descent on Separable Data
(Soudry et al, 2018)
Shows that linear models with logistic loss function converge to max-margin solution on fully separable data.
fhuszar changed 4 years agoView mode Like Bookmark
Deep Residual Learning for Image Recognition
Reading log/further reading
Papers:
Exploring Randomly Wired Neural Networks for Image Recognition
Densely Connected Convolutional Networks
Residual Networks are Exponential Ensembles of Relatively Shallow Networks
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
BlockDrop: Dynamic Inference Paths in Residual Networks
Kalman Filtering - an introduction
Slides
Colab
The need for information fusion
Bayes helps us with his famous theorem
$$
\mathrm{posterior} = \dfrac{\mathrm{likelihood} \times \mathrm{prior}}{\mathrm{evidence}} \
Jensen's inequality
Let $f$ be a concave function and let $X$ be a random variable. Then:
$$
E[f(X)] \leq f(E[X])
$$
EM
Suppose we have N observations, and we wish to estimate the parameters for a model that maximizes the following log likelihood:
$$\begin{align}
fhuszar changed 4 years agoView mode Like Bookmark
Mi a cél?
A probléma, amire a VAE megoldast probal adni az unsupervised learning. Feltételezzük, hogy megfigyeleseink $x_1, \ldots, x_N$ valamilyen nem ismert eloszlasból, $p_\mathcal{D}(x)$, származnak, és egymástól függetlenül keletkeztek. (i.i.d. independent and identically distributed). A célunk az, hogy az adatok eloszlását a megfigyelések alapján megközelítsük, vagy leírjuk egy modellel, $p_\theta(x)$. $p_\theta(x)$ egy valoszinusegi eloszlas a megfigyelesek tereben, amit valamilyen parameterek $\theta$ irnak le.
Ennek egyik módja hogy a modell likelihood-jat maximalizaljuk, azaz olyan parametereket kerestunk, ami alatt a megfigyések valószínūsége maximális:
$$
\theta^{ML} = \operatorname{argmax}\theta \sum{n=1}^N \log p_\theta(x_n)
$$
Ezt azonban általában nehéz, mivel $p_\theta(x_n)$ kiértékelése csak nagyon egyszerū eloszlások esetén lehetséges, bonyolultabb terekben bonyolultabb modellekre a maximum likelihood becslés nehéz.
Let
$f:{1,\ldots,K}\rightarrow \mathbb{R}$ be a scalar function over $K$ values.
Let $\theta$ be parameters of a $K$-dimensional Dirichlet distribution, and $\pi$ a draw from it. (the sample from the Dirichlet, $\pi$, is therefore a probability distribution over $K$ outcomes, such that $\sum_{k=1}^{K} \pi_{k}=1$).
$$
\pi \sim \mathcal{Dir}(\theta)
$$
Links
paper
slides
Background (Dimensionality reduction, PCA)
The main motivation for methods like SFA is to somehow recover a more compact representation of high-dimensional data $X = \left(x_1, x_2, \dots, x_N \right), \quad (\forall x_i \in \mathbb{R}^n)$ into the so-called latent space, defined by $Y = \left(y_1, y_2, \dots, y_N \right), \quad (\forall x_i \in \mathbb{R}^k),$ with $k < n$. $N$ denotes the sample size (the sample is independent and identically distributed, or i.i.d. for short).
What is the goal of dimensionality reduction?
The obvious answer is to decrease the dimensionality of the data. However, this answer is insufficient. Namely, what we want is to decrease dimensionality while also preserving as much information as possible.
Some notation:
$p_\mathcal{D}$(x): data distribution
$q_\psi(z\vert x)$: representation distribution
$q_\psi(z) = \int p_\mathcal{D}(x)q_\psi(z\vert x)$: aggregate posterior - marginal distribution of representation $Z$
$q_\psi(x\vert z) = \frac{q_\psi(z\vert x)p_\mathcal{D}(x)}{q_\psi(z)}$: "inverted posterior"
Setup
We'll start from just the representation $q_\psi(z\vert x)$, with no generative model of the data. We'd like this representation to satisfy two properties:
fhuszar changed 5 years agoView mode Like Bookmark
Adatok
Cím: Auto-Encoding Variational Bayes
Szerzők: Diederik P Kingma, Max Welling
Link: https://arxiv.org/abs/1312.6114
Motiváció
Mi a cél?
A probléma, amire a VAE megoldást próbál adni, az az unsupervised learning. Feltételezzük, hogy megfigyeléseink $x_1, \ldots, x_N$ valamilyen nem ismert eloszlásból, $p_\mathcal{D}(x)$, származnak, és egymástól függetlenül keletkeztek. (i.i.d. independent and identically distributed). A célunk az, hogy az adatok eloszlását a megfigyelések alapján megközelítsük, vagy leírjuk egy modellel, $p_\theta(x)$. A $p_\theta(x)$ egy valószínűségi eloszlás a megfigyelések terében, amit valamilyen paraméterek $\theta$ írnak le.
minaremeli changed 5 years agoBook mode Like Bookmark
What is a partial model:
In this paper they call a model that predicts a future observation $y_T$ by conditioning only on the initial state of the agent $s_0$ and an action sequence $a_{<T}$ a partial model.
$$
q_\theta(y_T\vert a_{<T}, s_0)
$$
This is contrasted with models that condition explicitly on past observations as well, etc, as reviewed in the intro of the paper.
How to train your partial model?
fhuszar changed 5 years agoView mode Like 2 Bookmark