ML Fundamentals Journal Club

Reading Group
Rota link Next up Past meetings Auto-encoding variational Bayes. 📅 12 Oct 2020 👤 Feri 📄 paper and 📝 notes Auto-encoding variational Bayes. (cont'd) 📅 26 Oct 2020
fhuszar changed 2 years agoView mode Like 1 Bookmark
Új témák MLJC-re
Contrastive Learning Contrastive Learning Inverts the Data Generating Process - Roland/Yash/Wieland CPC Contrastive Representation Learning blogpost Meta-learning (Feri) MAML and Reptile [1, 2]
fhuszar changed 4 years agoView mode Like Bookmark
Importrance of Masking in Generative Modeling of Sequences
When learning to generate sequences of symbols $x_1, x_2, \ldots, x_T$, we often do that by defining a pobabilistic generative model, a probability distribution over sequences $p(x_1, \ldots, x_T)$, by making use of the chain rule of probabilities: $$ p(x_1, \ldots, x_T) = p(x_1) p(x_2\vert x_1) p(x_3\vert x_1, x_3) \cdots p(x_T\vert x_1, \ldots, x_{T-1}) $$ This makes computational sense because modeling each of the conditional distributions above is easier as it is a distribution over a single symbol $x_t$, so even though the entire sequence $x_1, \ldots, x_T$ can take combinatorially many values, each component distribution we model $p(x_t\vert x_1, \ldots, x_{t-1})$ is only a distribution over a relatively small number of options, which can be be modelled easily. A generative model that defines a probabilistic model as a product of conditional distributions like above is often called autoregressive.
Patrik Reizinger changed 4 years agoView mode Like Bookmark
Periodic Markov-chain example
In this short note (and a corresponding colab notebook) about the example of a Markov-chain whose state distribution does not converge to a stationary distribution. These are probably irrelevant technicalities for reinforcement learning, but an interesting topic to understand nevertheless. Consider an integer $k$ and the homogeneous Markov-chain $S_{t}$ such that the transition probabilities are: $$ \mathbb{P}(S_{t+1} = n+1 \mod 2k \vert S_t = n) = \mathbb{P}(S_{t+1} = n-1 \mod 2k \vert S_t = n) = \frac{1}{2} $$
Patrik Reizinger changed 4 years agoView mode Like Bookmark
Independent Componet Analysis
Independent Component Analysis We are familiar with the problem that our high-dimensional data sources (images, video, fMRI, etc.) are redundant. And actually, we don't care about the value of pixels/voxels, we need some compact representation. Do you know state-space models? Well, something like that would be great. Of course, we have dimensionality-reduction methods, like Principal Component Analysis (PCA) or Factor Analysis (FA). Why do we need another one then? Someone just tried to get a PhD by tweaking around a bit to get something published? Fortunately, Independent Component Analysis is much more than that. Notation $s$ - signal sources/latents $x$ - signal mixtures/observations $y$ - reconstructed signal sources $g$ - prescribed CDF function
Patrik Reizinger changed 4 years agoView mode Like Bookmark
generative model
$$ Z \sim P_s $$ $$ X = A Z $$ $$ P_s(Z) = \prod_i P_s(Z_i)
fhuszar changed 4 years agoView mode Like Bookmark
Jensen-Shannon and the Mutual Information
Jensen-Shannon and the Mutual Information Thte Jensen-Shannon divergence between $Q$ and $P$ is defined as $$ \operatorname{JS}[P,Q] = \frac{1}{2}\left(\operatorname{KL}\left[P\middle|\frac{P+Q}{2}\right] + \operatorname{KL}\left[Q\middle|\frac{P+Q}{2}\right]\right) $$ This has a neat infomation theoretic interpretation. For the explanation consider two urns of balls, one where colour of the ball is distributed like $P$ and one where it's distributed like $Q$. Consider a random coinflip $Y$ such that $\mathbb{P}[Y=1] = \mathbb{P}[Y=0] = \frac{1}{2}$.
Patrik Reizinger changed 4 years agoView mode Like Bookmark
Deep Linear Models Reading List
The Implicit Bias of Gradient Descent on Separable Data (Soudry et al, 2018) Shows that linear models with logistic loss function converge to max-margin solution on fully separable data.
fhuszar changed 4 years agoView mode Like Bookmark
Deep Residual Learning for Image Recognition
Deep Residual Learning for Image Recognition Reading log/further reading Papers: Exploring Randomly Wired Neural Networks for Image Recognition Densely Connected Convolutional Networks Residual Networks are Exponential Ensembles of Relatively Shallow Networks An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale BlockDrop: Dynamic Inference Paths in Residual Networks
Patrik Reizinger changed 4 years agoView mode Like Bookmark
The Kalman Filter
Kalman Filtering - an introduction Slides Colab The need for information fusion Bayes helps us with his famous theorem $$ \mathrm{posterior} = \dfrac{\mathrm{likelihood} \times \mathrm{prior}}{\mathrm{evidence}} \
Patrik Reizinger changed 4 years agoSlide mode Like Bookmark
The EM algorithm
Jensen's inequality Let $f$ be a concave function and let $X$ be a random variable. Then: $$ E[f(X)] \leq f(E[X]) $$ EM Suppose we have N observations, and we wish to estimate the parameters for a model that maximizes the following log likelihood: $$\begin{align}
fhuszar changed 5 years agoView mode Like Bookmark
[Memo] Probabilistic principal component analysis
# [Memo] Probabilistic principal component analysis
Patrik Reizinger changed 5 years agoView mode Like Bookmark
[Memo] $\beta$-VAE notes
# [Memo] $\beta$-VAE notes
Patrik Reizinger changed 5 years agoView mode Like Bookmark
[Memo] Learning Fair Representations
# [Memo] Learning Fair Representations
Patrik Reizinger changed 5 years agoView mode Like Bookmark
VAE, Látens változós modellek, Motiváció
Mi a cél? A probléma, amire a VAE megoldast probal adni az unsupervised learning. Feltételezzük, hogy megfigyeleseink $x_1, \ldots, x_N$ valamilyen nem ismert eloszlasból, $p_\mathcal{D}(x)$, származnak, és egymástól függetlenül keletkeztek. (i.i.d. independent and identically distributed). A célunk az, hogy az adatok eloszlását a megfigyelések alapján megközelítsük, vagy leírjuk egy modellel, $p_\theta(x)$. $p_\theta(x)$ egy valoszinusegi eloszlas a megfigyelesek tereben, amit valamilyen parameterek $\theta$ irnak le. Ennek egyik módja hogy a modell likelihood-jat maximalizaljuk, azaz olyan parametereket kerestunk, ami alatt a megfigyések valószínūsége maximális: $$ \theta^{ML} = \operatorname{argmax}\theta \sum{n=1}^N \log p_\theta(x_n) $$ Ezt azonban általában nehéz, mivel $p_\theta(x_n)$ kiértékelése csak nagyon egyszerū eloszlások esetén lehetséges, bonyolultabb terekben bonyolultabb modellekre a maximum likelihood becslés nehéz.
Patrik Reizinger changed 5 years agoView mode Like Bookmark
Importance Weighting Problem
Let $f:{1,\ldots,K}\rightarrow \mathbb{R}$ be a scalar function over $K$ values. Let $\theta$ be parameters of a $K$-dimensional Dirichlet distribution, and $\pi$ a draw from it. (the sample from the Dirichlet, $\pi$, is therefore a probability distribution over $K$ outcomes, such that $\sum_{k=1}^{K} \pi_{k}=1$). $$ \pi \sim \mathcal{Dir}(\theta) $$
Patrik Reizinger changed 5 years agoView mode Like Bookmark
[Memo] A Maximum-Likelihood Interpretation for Slow Feature Analysis
Links paper slides Background (Dimensionality reduction, PCA) The main motivation for methods like SFA is to somehow recover a more compact representation of high-dimensional data $X = \left(x_1, x_2, \dots, x_N \right), \quad (\forall x_i \in \mathbb{R}^n)$ into the so-called latent space, defined by $Y = \left(y_1, y_2, \dots, y_N \right), \quad (\forall x_i \in \mathbb{R}^k),$ with $k < n$. $N$ denotes the sample size (the sample is independent and identically distributed, or i.i.d. for short). What is the goal of dimensionality reduction? The obvious answer is to decrease the dimensionality of the data. However, this answer is insufficient. Namely, what we want is to decrease dimensionality while also preserving as much information as possible.
Patrik Reizinger changed 5 years agoView mode Like Bookmark
InfoMax derivation of $\beta$-VAE
Some notation: $p_\mathcal{D}$(x): data distribution $q_\psi(z\vert x)$: representation distribution $q_\psi(z) = \int p_\mathcal{D}(x)q_\psi(z\vert x)$: aggregate posterior - marginal distribution of representation $Z$ $q_\psi(x\vert z) = \frac{q_\psi(z\vert x)p_\mathcal{D}(x)}{q_\psi(z)}$: "inverted posterior" Setup We'll start from just the representation $q_\psi(z\vert x)$, with no generative model of the data. We'd like this representation to satisfy two properties:
fhuszar changed 5 years agoView mode Like Bookmark
[Memo] Auto-encoding Variational Bayes
Adatok Cím: Auto-Encoding Variational Bayes Szerzők: Diederik P Kingma, Max Welling Link: https://arxiv.org/abs/1312.6114 Motiváció Mi a cél? A probléma, amire a VAE megoldást próbál adni, az az unsupervised learning. Feltételezzük, hogy megfigyeléseink $x_1, \ldots, x_N$ valamilyen nem ismert eloszlásból, $p_\mathcal{D}(x)$, származnak, és egymástól függetlenül keletkeztek. (i.i.d. independent and identically distributed). A célunk az, hogy az adatok eloszlását a megfigyelések alapján megközelítsük, vagy leírjuk egy modellel, $p_\theta(x)$. A $p_\theta(x)$ egy valószínűségi eloszlás a megfigyelések terében, amit valamilyen paraméterek $\theta$ írnak le.
minaremeli changed 5 years agoBook mode Like Bookmark
Causally Correct Partial Models
What is a partial model: In this paper they call a model that predicts a future observation $y_T$ by conditioning only on the initial state of the agent $s_0$ and an action sequence $a_{<T}$ a partial model. $$ q_\theta(y_T\vert a_{<T}, s_0) $$ This is contrasted with models that condition explicitly on past observations as well, etc, as reviewed in the intro of the paper. How to train your partial model?
fhuszar changed 5 years agoView mode Like 2 Bookmark