Marc Lelarge

@mlelarge

https://github.com/mlelarge

Joined on Mar 23, 2021

  • written by @marc_lelarge after this thread. :::warning We always assume that the graph we consider is connected (and finite) so that there is always a shortest path between $s$ and $t$. ::: Recap: Dijkstra's shortest-path algorithm See Chapter 4 of Algorithms - Sanjoy Dasgupta, Christos H. Papadimitriou, and Umesh V. Vazirani :::info
     Like  Bookmark
  • written by @marc_lelarge :::info Cost function: for a partition $C_1, \dots, C_k$ of $n$ observations in $k$ clusters, the quality of the clustering is given by the following cost function: \begin{eqnarray*} \sum_{j=1}^k \sum_{i\in C_j} |x_i-\mu_j|^2 \text{ where, } \mu_j =\frac{1}{|C_j|}\sum_{i\in C_j} x_i. \end{eqnarray*} ::: :::success
     Like  Bookmark
  • written by @marc_lelarge Probability recap We start with real random variables (r.v.). 1- Why is variance positive? Recall that $\text{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2$ so $\text{Var}(X)\geq 0$ means that $\mathbb{E}[X^2] \geq \mathbb{E}[X]^2$. answer: Start with
     Like  Bookmark
  • written by @marc_lelarge (part of the deep learning course) Jacobian Let ${\bf f}:\mathbb{R}^n\to \mathbb{R}^m$, we define its Jacobian as: \begin{align*} \newcommand{\bbx}{{\bf x}} \newcommand{\bbv}{{\bf v}} \newcommand{\bbw}{{\bf w}} \newcommand{\bbu}{{\bf u}} \newcommand{\bbf}{{\bf f}}
     Like  Bookmark
  • written by @marc_lelarge We are using the Named Tensor Notation from David Chiang, Alexander M. Rush and Boaz Barak to describe the basic (i.e. without position encoding and autoregressive mask) encoder block of Transformers defined in Attention Is All You Need by Vaswani et al. For a presentation of attention mechanism and transformers, see Module 12 - Attention and Transformers Basics about Named Tensor Notation These notations should feel natural as it implements the elementwise operations, broadcasting, reductions and contractions natural in numpy and PyTorch. Perhaps, we should recall that functions from vectors to vectors lift to functions on tensors that operate along one axis but leave the tensor shape unchanged. For example, the following softmax function defined by \begin{align*}
     Like  Bookmark
  • written by @marc_lelarge For a graph $G$ with $n$ vertices, with symmetric adjacency matrix $A\in \mathbb{R}^{n\times n}$. For $i\in [n]$, we denote by $\lambda_i$ its eigenvalues and by $\psi_i$ its associated eigenvectors. The empirical measure of $G$ is defined as: \begin{align*} \mu_G = \frac{1}{n}\sum_{i=1}^n \delta_{\lambda_i} \end{align*} The rooted spectral measure of $G$ at vertex $v\in [n]$ is defined by:
     Like  Bookmark
  • written by @marc_lelarge We look at the simplest possible example showing that over-parametrization helps gradient descent. We start with a definition of gradient descent (to set the notation), then study gradient descent for ordinary least squares, and then in the particular setting of over-parametrized models. Gradient descent setting: minimize $F:\mathbb{R}^d\to \mathbb{R}$ with gradient descent (GD) algo (GD): pick $\theta_0\in \mathbb{R}^d$ and for $t\geq 1$, let $$ \theta_{t} = \theta_{t-1} - \gamma F'(\theta_{t-1}),
     Like  Bookmark
  • written by @marc_lelarge For an order-$k$ tensor $T\in \mathbb{R}^{n^k}$, we define for $\sigma \in \mathcal{S}n$: \begin{eqnarray*} \def\R{{\mathbb R}} \def\cS{{\mathcal S}} (\sigma \star T){\sigma(i_1),\dots, \sigma(i_k)} = T_{i_1,\dots, i_k}. \end{eqnarray*} Note that a permutation $\sigma\in \cS_n$ can act through the operator $\star$ on any tensor $T\in \R^{n^k}$ for any value of $k\geq 1$ (i.e. what matters is that "both $n$ agrees").
     Like 1 Bookmark
  • written by @marc_lelarge Fonction de coût: pour une partition $C_1, \dots, C_k$ des $n$ observations en $k$ clusters, la qualité du clustering est donnée par la fonction de coût: \begin{eqnarray*} \sum_{j=1}^k \sum_{i\in C_j} |x_i-\mu_j|^2, \end{eqnarray*} où $\mu_j =\frac{1}{|C_j|}\sum_{i\in C_j} x_i$. Algorithme des k-moyennes:
     Like  Bookmark
  • Adapt your code to solve the following challenge: Some small modifications: First modification: we now generate points $(x_t,y_t)$ where $y_t= \exp(w^\cos(x_t)+b^)$, i.e $y^_t$ is obtained by applying a deterministic function to $x_t$ with parameters $w^$ and $b^$. Our goal is still to recover the parameters $w^$ and $b^*$ from the observations $(x_t,y_t)$. Second modification: we now generate points $(x_t,y_t)$ where $y_t= \exp(w^\cos(p^x_t)+b^)$, i.e $y^_t$ is obtained by applying a deterministic function to $x_t$ with parameters $p^$, $w^$ and $b^*$. Our goal is still to recover the parameters from the observations $(x_t,y_t)$.
     Like  Bookmark
  • written by @marc_lelarge used to check python prerequisites of the deep learning course dataflowr. Before reading this post, you should start with the quiz If you are not familiar with binding (i.e. assignment) and mutation of variables in Python you can have a look at Binding vs Mutation. 1.0 Basics on scope Before looking at the leaking problem, we illustrate basic properties of scope in python: def print_11():
     Like 1 Bookmark
  • written by @marc_lelarge Some thoughts about Cracking nuts with a sledgehammer: when modern graph neural networks do worse than classical greedy algorithms by Maria Chiara Angelini, Federico Ricci-Tersenghi Summary of the paper Focusing on the problem of Maximum Independent Sets (MIS) on $d$-regular random graphs, this paper shows that a very simple greedy algorithm performs much better than GNN studied in Ref. [4] Combinatorial Optimization with Physics-Inspired Graph Neural Networks by Martin J. A. Schuetz, J. Kyle Brubaker, Helmut G. Katzgraber The authors conclude ($n$ is the number of vertices in the graph): "Obviously, a good optimization algorithm should run in a time polynomial (better if linear) in $n$, and the quality of solutions found should be better than simple existing algorithms and should not degrade increasing $n$. In our opinion, at present, optimizers based on neural networks (like the one presented in Ref. [4]) do not satisfy the above requirements and are not able to compete with simple standard algorithms to solve hard optimization problems. We believe it is of primary importance to bring forward the research in this direction to understand whether neural networks can become competitive in this fundamental task or whether there is any deeper reason for their failure."
     Like 1 Bookmark