ML Fundamentals Journal Club

@mljc

reading group covering the foundations of machine learning

Public team

Joined on Sep 18, 2020

  • Rota link Next up Past meetings Auto-encoding variational Bayes. 📅 12 Oct 2020 👤 Feri 📄 paper and 📝 notes Auto-encoding variational Bayes. (cont'd) 📅 26 Oct 2020
     Like 1 Bookmark
  • Contrastive Learning Contrastive Learning Inverts the Data Generating Process - Roland/Yash/Wieland CPC Contrastive Representation Learning blogpost Meta-learning (Feri) MAML and Reptile [1, 2]
     Like  Bookmark
  • When learning to generate sequences of symbols $x_1, x_2, \ldots, x_T$, we often do that by defining a pobabilistic generative model, a probability distribution over sequences $p(x_1, \ldots, x_T)$, by making use of the chain rule of probabilities: $$ p(x_1, \ldots, x_T) = p(x_1) p(x_2\vert x_1) p(x_3\vert x_1, x_3) \cdots p(x_T\vert x_1, \ldots, x_{T-1}) $$ This makes computational sense because modeling each of the conditional distributions above is easier as it is a distribution over a single symbol $x_t$, so even though the entire sequence $x_1, \ldots, x_T$ can take combinatorially many values, each component distribution we model $p(x_t\vert x_1, \ldots, x_{t-1})$ is only a distribution over a relatively small number of options, which can be be modelled easily. A generative model that defines a probabilistic model as a product of conditional distributions like above is often called autoregressive.
     Like  Bookmark
  • In this short note (and a corresponding colab notebook) about the example of a Markov-chain whose state distribution does not converge to a stationary distribution. These are probably irrelevant technicalities for reinforcement learning, but an interesting topic to understand nevertheless. Consider an integer $k$ and the homogeneous Markov-chain $S_{t}$ such that the transition probabilities are: $$ \mathbb{P}(S_{t+1} = n+1 \mod 2k \vert S_t = n) = \mathbb{P}(S_{t+1} = n-1 \mod 2k \vert S_t = n) = \frac{1}{2} $$
     Like  Bookmark
  • Independent Component Analysis We are familiar with the problem that our high-dimensional data sources (images, video, fMRI, etc.) are redundant. And actually, we don't care about the value of pixels/voxels, we need some compact representation. Do you know state-space models? Well, something like that would be great. Of course, we have dimensionality-reduction methods, like Principal Component Analysis (PCA) or Factor Analysis (FA). Why do we need another one then? Someone just tried to get a PhD by tweaking around a bit to get something published? Fortunately, Independent Component Analysis is much more than that. Notation $s$ - signal sources/latents $x$ - signal mixtures/observations $y$ - reconstructed signal sources $g$ - prescribed CDF function
     Like  Bookmark
  • $$ Z \sim P_s $$ $$ X = A Z $$ $$ P_s(Z) = \prod_i P_s(Z_i)
     Like  Bookmark
  • Jensen-Shannon and the Mutual Information Thte Jensen-Shannon divergence between $Q$ and $P$ is defined as $$ \operatorname{JS}[P,Q] = \frac{1}{2}\left(\operatorname{KL}\left[P\middle|\frac{P+Q}{2}\right] + \operatorname{KL}\left[Q\middle|\frac{P+Q}{2}\right]\right) $$ This has a neat infomation theoretic interpretation. For the explanation consider two urns of balls, one where colour of the ball is distributed like $P$ and one where it's distributed like $Q$. Consider a random coinflip $Y$ such that $\mathbb{P}[Y=1] = \mathbb{P}[Y=0] = \frac{1}{2}$.
     Like  Bookmark
  • The Implicit Bias of Gradient Descent on Separable Data (Soudry et al, 2018) Shows that linear models with logistic loss function converge to max-margin solution on fully separable data.
     Like  Bookmark
  • Deep Residual Learning for Image Recognition Reading log/further reading Papers: Exploring Randomly Wired Neural Networks for Image Recognition Densely Connected Convolutional Networks Residual Networks are Exponential Ensembles of Relatively Shallow Networks An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale BlockDrop: Dynamic Inference Paths in Residual Networks
     Like  Bookmark
  • Kalman Filtering - an introduction Slides Colab The need for information fusion Bayes helps us with his famous theorem $$ \mathrm{posterior} = \dfrac{\mathrm{likelihood} \times \mathrm{prior}}{\mathrm{evidence}} \
     Like  Bookmark
  • Jensen's inequality Let $f$ be a concave function and let $X$ be a random variable. Then: $$ E[f(X)] \leq f(E[X]) $$ EM Suppose we have N observations, and we wish to estimate the parameters for a model that maximizes the following log likelihood: $$\begin{align}
     Like  Bookmark
  • # [Memo] Probabilistic principal component analysis
     Like  Bookmark
  • # [Memo] $\beta$-VAE notes
     Like  Bookmark
  • # [Memo] Learning Fair Representations
     Like  Bookmark
  • Mi a cél? A probléma, amire a VAE megoldast probal adni az unsupervised learning. Feltételezzük, hogy megfigyeleseink $x_1, \ldots, x_N$ valamilyen nem ismert eloszlasból, $p_\mathcal{D}(x)$, származnak, és egymástól függetlenül keletkeztek. (i.i.d. independent and identically distributed). A célunk az, hogy az adatok eloszlását a megfigyelések alapján megközelítsük, vagy leírjuk egy modellel, $p_\theta(x)$. $p_\theta(x)$ egy valoszinusegi eloszlas a megfigyelesek tereben, amit valamilyen parameterek $\theta$ irnak le. Ennek egyik módja hogy a modell likelihood-jat maximalizaljuk, azaz olyan parametereket kerestunk, ami alatt a megfigyések valószínūsége maximális: $$ \theta^{ML} = \operatorname{argmax}\theta \sum{n=1}^N \log p_\theta(x_n) $$ Ezt azonban általában nehéz, mivel $p_\theta(x_n)$ kiértékelése csak nagyon egyszerū eloszlások esetén lehetséges, bonyolultabb terekben bonyolultabb modellekre a maximum likelihood becslés nehéz.
     Like  Bookmark
  • Let $f:{1,\ldots,K}\rightarrow \mathbb{R}$ be a scalar function over $K$ values. Let $\theta$ be parameters of a $K$-dimensional Dirichlet distribution, and $\pi$ a draw from it. (the sample from the Dirichlet, $\pi$, is therefore a probability distribution over $K$ outcomes, such that $\sum_{k=1}^{K} \pi_{k}=1$). $$ \pi \sim \mathcal{Dir}(\theta) $$
     Like  Bookmark
  • Links paper slides Background (Dimensionality reduction, PCA) The main motivation for methods like SFA is to somehow recover a more compact representation of high-dimensional data $X = \left(x_1, x_2, \dots, x_N \right), \quad (\forall x_i \in \mathbb{R}^n)$ into the so-called latent space, defined by $Y = \left(y_1, y_2, \dots, y_N \right), \quad (\forall x_i \in \mathbb{R}^k),$ with $k < n$. $N$ denotes the sample size (the sample is independent and identically distributed, or i.i.d. for short). What is the goal of dimensionality reduction? The obvious answer is to decrease the dimensionality of the data. However, this answer is insufficient. Namely, what we want is to decrease dimensionality while also preserving as much information as possible.
     Like  Bookmark
  • Some notation: $p_\mathcal{D}$(x): data distribution $q_\psi(z\vert x)$: representation distribution $q_\psi(z) = \int p_\mathcal{D}(x)q_\psi(z\vert x)$: aggregate posterior - marginal distribution of representation $Z$ $q_\psi(x\vert z) = \frac{q_\psi(z\vert x)p_\mathcal{D}(x)}{q_\psi(z)}$: "inverted posterior" Setup We'll start from just the representation $q_\psi(z\vert x)$, with no generative model of the data. We'd like this representation to satisfy two properties:
     Like  Bookmark
  • Adatok Cím: Auto-Encoding Variational Bayes Szerzők: Diederik P Kingma, Max Welling Link: https://arxiv.org/abs/1312.6114 Motiváció Mi a cél? A probléma, amire a VAE megoldást próbál adni, az az unsupervised learning. Feltételezzük, hogy megfigyeléseink $x_1, \ldots, x_N$ valamilyen nem ismert eloszlásból, $p_\mathcal{D}(x)$, származnak, és egymástól függetlenül keletkeztek. (i.i.d. independent and identically distributed). A célunk az, hogy az adatok eloszlását a megfigyelések alapján megközelítsük, vagy leírjuk egy modellel, $p_\theta(x)$. A $p_\theta(x)$ egy valószínűségi eloszlás a megfigyelések terében, amit valamilyen paraméterek $\theta$ írnak le.
     Like  Bookmark
  • What is a partial model: In this paper they call a model that predicts a future observation $y_T$ by conditioning only on the initial state of the agent $s_0$ and an action sequence $a_{<T}$ a partial model. $$ q_\theta(y_T\vert a_{<T}, s_0) $$ This is contrasted with models that condition explicitly on past observations as well, etc, as reviewed in the intro of the paper. How to train your partial model?
     Like 2 Bookmark