Comprehensive exam

--- tags: ZZArchived --- # Comprehensive exam ==Before delivery: proper formalisms all over the place, e.g. F -> F(a,b;c)== ==h instead of a?== <details closed> <summary>Define</summary> $\newcommand{\lagrgrad}{\begin{bmatrix} \nabla_\theta L \\ -H(\theta) \end{bmatrix}}$ $\newcommand{\exs}{\mathcal{X}}$ $\newcommand{\vip}{VI(\exs, F)}$ $\newcommand{\R}{\mathbb{R}}$ $\newcommand{\w}{w}$ $\newcommand{\ws}{W}$ $\newcommand{\D}{\nabla}$ $\newcommand{\cost}{\ell}$ $\newcommand{\y}{y}$ $\newcommand{\f}{f}$ $\newcommand{\x}{a}$ $\newcommand{\xs}{\mathbf{a}}$ $\newcommand{\p}{\theta}$ $\newcommand{\pspace}{\Theta}$ $\newcommand{\lagrp}{\begin{bmatrix} \p \\ \lambda \end{bmatrix}}$ $\newcommand{\lastcost}{\cost(\f_{L-1}(\x_{L-1}, w_{L-1}), \y)}$ $\newcommand{\hzero}{\x_{1} - \f_0(x_0, w_0)}$ $\newcommand{\hi}{\x_{i+1} - \f_i(\x_i, w_i)}$ $\newcommand{\hlast}{\x_{L-1} - \f_{L-2}(\x_{L-2}, w_{L-2})}$ $\newcommand{\tarhzero}{\hat\x_{1} = \f_i(x_0; w_0)}$ $\newcommand{\tarhi}{\hat\x_{i+1} = \f_i(\x_i; w_i)}$ $\newcommand{\tarhlast}{\hat\x_{L} = \f_{L-1}(\x_{L-1}, w_{L-1})}$ $\newcommand{\opthzero}{\x_{1} = \f_i(x_0; w_0)}$ $\newcommand{\opthi}{\x_{i+1} = \f_i(\x_i; w_i)}$ $\newcommand{\opthlast}{\x_{L} = \f_{L-1}(\x_{L-1}, w_{L-1})}$ $\newcommand{\sh}{\sum_{i=0}^{L-2} \hi}$ $\newcommand{\sqh}{||\sh||^2}$ $\newcommand{\forwardprop}{\f_{L-1}(\dots\f_i(\dots\f_0(x_0; \w_0)\dots;\w_i)\dots;\w_L)}$ $\newcommand{\bptarget}{\cost(\forwardprop, \y)}$ $\newcommand{\liftedmin}{\underset{(\ws, \xs)}{\operatorname{argmin}}}$ $\newcommand{\classicalmin}{\underset{\ws}{\operatorname{argmin}}}$ $$\newcommand{\bpprob}{ \begin{align} & \classicalmin & \bptarget \\ & \operatorname{given} & x=\x_0 \\ & \end{align}}$$ $$\newcommand{\bpprobb}{ \begin{align} & \classicalmin & \lastcost \\ & \operatorname{given} & \opthi\\ & \operatorname{given} & x=\x_0 \\ & \end{align}}$$ $$\newcommand{\constrprob}{ \begin{align} & \liftedmin \lastcost \\ & \text{s.t.} & \opthi\\ & \operatorname{given} & x_0=\x_0 \\ & \forall i \in {0,...,(L-1)} \end{align}}$$ $$\newcommand{\collocationprob}{ \begin{align} & \underset{\xs}{\operatorname{argmin}} & \lastcost \\ & \text{s.t.} & \phi(\x_i, \x_{i+1}) = \w_i\\ & \operatorname{given} & x=\x_0 \\ & \forall i \in {0,...,(L-1)} \end{align}}$$ $\newcommand{\lagrangian}{L(x_0;\w,\x,\lambda) = \lastcost + \lambda^T H(x_0; \w, \x)}$ $\newcommand{\old_lagrangian}{L(\p, \lambda)=\lastcost+\lambda\sh}$ </details> I am an x-Facebook employee. In the early days I used to be proud wearing Facebook swag. Later it bacame more and more an issue. People started making comments on feature X or policy Y they didn't like. I eventually didn't wear any swag in public, this is when I've realized I need to leave the company. <details closed> <summary>Guidelines</summary> ###### [the review report should be concise, but clear, and is typically between 12 pages and 15 pages in a single-spaced, 12 point font.](https://www.cs.mcgill.ca/academic/graduate/phd/) ![](https://i.imgur.com/UAluMCo.png) https://www.brainpickings.org/2015/05/27/william-zinsser-on-writing-well-science/ ### in BVP we also want to set the boundary gradient to zero, does it add a constraint? ## Review suggestion ### Fact1: The main concept [Lecunn] ### Fact2: splitting training ### Fact3: Solving ConstrNN ### Fact4: BioView Doina suggestions: - Define stuff clearly - Unifiy notation - Lookup domain of expertise of the committee - committe goal: nudge student in PhD direction lpha ## Constrained Optimization as an Alternative to Back-Propagation. Manuel Del Verme, Prof. Pierre-Luc Bacon and Prof. Doina Precup -DATE- https://www.sciencedirect.com/science/article/abs/pii/0041555379900661 http://www.tricki.org/article/Dyadic_decomposition PL version /home/esac/research/PhD Syllabus/comp.pdf ### Reviews [Gori96](https://sci-hub.do/10.1016/0925-2312(95)00032-1) [GoriBook](https://ipfs.io/ipfs/bafykbzaceampi5jlsi7hzx5n7ztqkvgswiixeblb2feota77wjlgdzpjmrj4u?filename=Marco%20Gori%20-%20Machine%20Learning.%20A%20Constraint-based%20Approach-Morgan%20Kaufmann%20_%20Elsevier%20%282018%29.pdf) http://www.tricki.org/article/Divide_and_conquer  </details> <details open> <summary>Intro (0/1)</summary> ## Divide et Impera Decomposing a problem into smaller and more manageable pieces to be solved independently and then merging the solutions has been a common motif inspiring many foundamental advancements (Search, ODE, kepler etc.). In this survey I will present several approaches to divide the parameter estimation problem for neural networks by backprop into simpler sub-problems. ## Feed forward neural networks With the quest of subdivision we now look at a feed forward neural network $$\forwardprop$$ And its loss function: $\cost(\w, x_0, y) = \bptarget$ TODO: talk about bias We can specify the above transfer function layer wise as: $$\x_i = \f_i(\x_{i-1}; \w_{i-1}) \enspace \forall i \in (0\dots {L-1})$$ We can specify the network activactions: \begin{align} \begin{cases} \opthzero \\ ...\\ \opthi \\ ...\\ \opthlast\\ \end{cases} \end{align} \label{forwardprop} ## Backpropagation And it's loss Jacobian $\newcommand{\dfdw}[2]{\frac{\partial \f_{#1}(x_0; \w)}{\partial \w_{#2}}}$ \begin{align*} J(\w, x_0) = \begin{bmatrix} \dfdw{0}{0} & 0 & \dots &\\ \dfdw{1}{0} & \dfdw{1}{1} & \dots &\\ \dfdw{2}{0} & \dfdw{2}{1} & \dfdw{2}{2} & \dots &\\ \vdots && \ddots \\ \dfdw{L-1}{0} & ... & ... & ... & \dfdw{L-1}{L-1} \end{bmatrix} \end{align*} TODO: say something about this last columns are "nice" because on the left they are not because vanishing gradients +inf/0/-inf  # Constrained Formulation [LeCun, 1988] [Lecun88] shows how to cleanly cast the unconstrained learning problem by backpropagation problem into a lagrangian formulation. We can now start decoupling **(right word?)** the network by adding extra decision variables $\hat \x_i \forall i \in (1..L-1)$, we now have two sets of variables and a set of constraints. We can specify the network activactions: $$ \left\{ \begin{array}{ll} \tarhzero \\ ... \\ \tarhi \\ ... \\ \tarhlast \\ \end{array} \right. \text{subject to} \left\{ \begin{array}{ll} \hzero =0 \\ ... \\ \hi =0\\ ... \\ \hlast =0\\ \end{array} \\ \right. $$ [\ref forwardprop] can be now expressed in vector form. \begin{align*} \hat \x = F(x_0; \w, \x) = \begin{bmatrix} \f_0(x_0; \w_0) \\ \vdots \\ \f_{L-1}(\w_{L-1}, \x_{L-1})\\ \end{bmatrix} \end{align*} And the constraints: \begin{align*} H(x_0; \w, \x) = \begin{bmatrix} \hzero\\ a_2 - f_1(a_1, w_1) \vdots\\ \hlast\\ 0\\ \end{bmatrix} \end{align*} \begin{align*} \D_\x H(x_0; \w, \x) = \begin{bmatrix} 1 & ... & &\\ -\D_{a_1} f_0(\x_1, \w_0)& 1 & ... & &\\ & \ddots &&\\ &-\D_{a_{L-2}} f_{L-2}(\x_{L-2}, \w_{L-2})& 1 \\ & & -\D_{a_{L-1}} f_{L-1}(\x_{L-1}, \w_{L-1})\\ & & 0\\ \end{bmatrix} \end{align*} **LeCunn adds $a_L$ as final state and then constrains it, that makes no sense I'm not doing that, the 0 is temporary to remind me about this.** With lagrangian: $$\lagrangian$$  We want to find ($\w, \x, \lambda$) s.t. $$\nabla L(x_0; \w, \x, \lambda)=\boldsymbol{0}$$ \begin{align} \D_\lambda L = 0\\ \D_\x L = 0\\ \D_\w L = 0\\ \end{align} The interesting thing is that those three conditions yield in order: 1. Forward propagation 1. The gradients (Backward pass) 1. Optimality conditions for $\w$ To recover forward propagation from $\D_\lambda L = H = 0$ is trivial. $\D_\x L = 0$ is more interesting  $$\D_\x L = \D_\x \ell + \lambda^T \D_\x H$$ for $\x_{L-1}$ we have: $D_{\x_{L-1}}L=D_\x \lastcost + \lambda_{L-1} = 0$ $\lambda_{L-1} = -D_\x \lastcost$ which is the same gradient w.r.t the last layer in classical backpropagation. For the previous layers we have: ==**somethjing wrong somewhere idk**== $\lambda_{L-2}[\hlast]$ $D_{\x_{L-2}} L=\lambda_{L-2}( - D_{\x_{L-2}}f) = 0$ $\lambda_{L-2} D_{\x_{L-2}}(0 - D_{\x_{L-2}}f) = 0$ 1) $J_aL$ blablalba 3) omg this is costates 3) omg^2 gradients are costates </details> <details open> <summary>Hacks (1/3)</summary> ## Finding a solution, the hacks While [Lecunn88] sets the problem and identifies sufficient conditions we are not provided with a way to solve the optimization problem. ### :heavy_check_mark: [Carreira-Perpi ̃n ́an and Wang, 2014] [CPnW] introduces a matematical formulation (meta-algorithm) named *MAC* to devide the training of a general neural network into sub-problems. This is done by a layer wise optimization intoducing "method of the auxiliary coordinates" (MAC), where they learn $\w$ using the network activations $\x$ as "auxiliary coordinates". The search itself is done by using a penalty based method over $(\w, \x)$. They turn the original problem: $$\bpprob$$ into a constrained equivalent one with auxiliary variables $\x_i$ as usual $$\constrprob$$ This problem can now be solved via quadratic penalty method where the problem becomes again unconstrained as: \begin{align} & L(\x, \w, \lambda) = \lastcost + \lambda_0 \sqh\\ & \operatorname{given} x=\x_0 \\ \end{align} For a fixed $\lambda_0$. And then solve a sequence of problems for $\lambda_0 \to \infty$ by alternating the minimization over $\w_i$ and $\x_i$. ### $\w$-step The minimzation over $\w_i$ is now equivalent to minimizing $L$ single layer problems in the form: $$ \min_{\w_{L-1}} \enspace \lastcost$$ and $$ \min_{\w_i} \enspace \lambda || \hi ||^2 $$ ### $\x$-step The minimization over $\xs$ instead is: $$ \min_z \enspace \lastcost + \lambda \sqh $$ Which is a “generalized” proximal operator. The problem is now optimized sequentially $\x_i$ then $\w_i$ by Gauss-Newton for both \ws and \xs. Once the the problem (ref{L(a, w, lambda}) is considered solved the auxiliary variables $\xs$ are set to $\opthi$ to guarantee feasability.   ### [Betti et al., 2018] notes: $\alpha_{\kappa, i}$ is regularization factor [Betti18] takes [Lecun88] stationary conditions and defines a search approach in the $(w, x, \lambda)$ space by grandient descent ascent (GDA) \begin{align} & \underset{(w, x)}{\operatorname{argmin}} & \ell(x_N, y) + \sum_{i=0}^{N-1}?(x_i, w_i) \\ & \text{subject to} & x_{i+1}=f_i(x_i, u_i) \\ &\operatorname{given} & x_0 \\ \end{align} $x=(x_1, x_2, .. x_N), u=(u_0, u_1, .. , u_{N-1})$ and it's Lagrangian $$L(w, x, \lambda)=\ell(w, x)+\lambda \sum_{i=0}^{N-1}(x_{i+1} - f_i(x_i, u_i))$$ They now suggest to: 1. GDA over $x, \lambda$ ($l \times n$ times) 2. GD over $w$ ($n \times pa(i)$ times) 3. loop back ### [Gotmare et al., 2018] This work introduces BlockProp, an algorithm that splits the unconstrained optimization problem in to a constrained problem over blocks of >=1 layers The paper then recognizes that $$\bpprob$$ If we consider $b_i$ is a sequence of layers $f_i$ $b_i=\f_i(\w_i, \f(\dots(\f_j(\w_j,\x_j)))))$ and $\w_{i:j}$ the weights pertraining to those layers $i..j$ can be reformulated as: $\newcommand{\f}{b}$ $$\constrprob$$ As far as i know they are the first to purpose a stochastic view for the constrained problem, the loss is now $\newcommand{\x}{a^{[n]}}$ $$L(\x, \w, \lambda)= \lastcost + \sum_{i=0}^{L-1} \lambda_i ||\hi||^2$$ Which is minimized by alternating optimization over $\xs$ then $\ws$ **TODO**: page 3 The optimization w.r.t $\xs$ is done using SGD then the optimization w.r.t. $\ws$ using SGD again (abeit a different number of steps are taken). $\newcommand{\x}{a}$ $\newcommand{\f}{f}$ </details> <details open> <summary>Lagrangian (1.5/2)</summary> ## Finding a solution, borrowing from control systems.  ### Discrete time optimal control [DCOPT](/E-oP3XSDTN-9QNsjDWL7DQ) To better understand why this constrained optimization approach might be useful we can look back at fields using trying to solve the same problem of long term dependenticies and coming up with similar solutions. The problem of minimizing a final cost function (and optionally a layer wise one too) is found in DCOPT where we consider the problem of finding a sequences of states $\x_i$ (usually $x$) and controls $\w_i$ (usually $u$) to minimize the cost subject within the realizable dynamics: For example in a robot we might want to fit a planned trajectory from $x_0$ to $y$ constrained by the laws of motion but avoiding obstacles. The most general formulation (in discrete time) is: \begin{align} & \underset{(\w, \x)}{\operatorname{argmin}} & \lastcost + \sum_{i=0}^{L-1}g_i(\x_i, \w_i) \\ & \text{subject to} & \x_{i+1}=f_i(\x_i, \w_i) \\ &\operatorname{given} & x_0=\x_0 \\ \end{align} Setting $g_{i<N}=0$, I will keep ignoring the $g_{i<N}$ costs because the ==[Bertsekas Optimal control vol I sec 3.4]== tells us we can reconduce any bolza? problem into a somebodyelse problem by propagating the costs forward by an identity. We find [lecun88] the original formulation. #### Hard Constraints (Backprop) $$\bpprobb$$ In this setting the classical backpropagation is akin to forward shooting where we want to integrate the problem dynamics. Here we want to find the optimal set of weights (actions) to achieve the minum cost. ![](https://i.imgur.com/J5oVDU2.png =300x) (source, Pieter Abbeel) As it can be shown in the above image, any obstacle in the optimization space could increase optimization complexity as the dynamics of the controlled system define the search space. The gradients estimated by backpropagation will guarantee optimization paths included in the feasible region, meaning we we have to go through the narrow passage. Known problems of forward shooting methods are: - Brittle optimization trajectories - Prone to local minima - Strong dependency on initial guess - Requires initial guess from demonstrations and randomization  We have the same problems in supervised learning: - ? - Vanishing grads (deep), local minima (shallow) - Reliance on engineered initializations - Transfer learning aka good initial guesses #### Soft Constraints (Lagrangian BackProp) ![](https://i.imgur.com/GRQqaMG.png =300x) Lagrangian methods only softly enforce the constraints allowing the optimization path to converge more robustly and with fewer iterations to the optimal solution. ## Collocation methods  In collocation methods we add the state variables $\x_i$ as optimization variables: $$\constrprob$$ Which is identical to the original constrained formulation. One of the foundamental methods in collocation is direct collocation where we have a problem as: $$\collocationprob$$ We are now trying to find the best activation sequence (trajectory) for our network and letting using this $\phi$, representing an inverse layer dynamics define the appropriate weights implicitly, we are going to see more of this in the bioinspired algorithms.  ### :heavy_check_mark: Direct collocation by sequential + orthogonal [Biegler, 1984] ==TODO: this is a global (single shooting) method.. right? clarify and make clear.==  [Biegler, 1984] uses orthogonal collocation, it defines an appropriate class of interpolating polynomial functions (Lagrange polynomial) and fits them at splitting points. 1. defines a set of collocation points $\x_i$, this is equivalent to choosing at ![](https://i.imgur.com/txvjXbG.png) In the image above t={0, 2, 5, 7} 2. And defines an interpolating polynomial:  $$\hat \x_i(t)=\sum_{k=0}^N \x_k^{(i)}L_k(t), i \in [1..n]$$ $$\hat \w_j(t)=\sum_{k=0}^N \x_k^{(i)}L_k(t), j \in [1..m]$$ with $L_k$ the N-th degree Lagrange polynomial: $$L_k(t)=\prod^N_{l=0,l\neq k} \frac{t - t_l}{t_k-t_l}$$ Which guarantees that $L_k(t_l)=1 \enspace \text{if} \enspace l=k, \enspace 0$ otherwise.   ### :heavy_check_mark: Multiple shooting [Bock and Plitt, 1985] Multiple shooting splits the problem into subproblems by defining time intervals (similar to direct collocation) and solving each problem indepeendently. In our case each subproblem has the form of the constraint $\hi$. we now have a system of non-linear equations \begin{array}{} H(\x) = \begin{cases} \hzero& = 0\\ ...\\ \hi & = 0 \\ ...\\ \hlast & = 0 \\ \end{cases} \end{array} Here the dynamics of the system are fixed hence the $\ws$ are not variables. The above system $H(\x)=0$ can be solved by Newton's Algorithm finding iterates $\x^k$ and Jacobian: \begin{align*} H^\prime(\x) = \begin{bmatrix} -I & 0 & ... &\\ \frac{\partial \f_{0}(\x_{i}, \w_i)}{\partial \x_0} & -I & ... &\\ 0 & \frac{\partial \f_{1}(\x_{i}, \w_i)}{\partial \x_1} & -I & ... &\\ \vdots && \ddots \\ 0 & ... & 0 & \frac{\partial \f_{T-1}(\x_{i}, \w_i)}{\partial \x_{T-1}} & -I \end{bmatrix} \end{align*} Since both $H$ and $H^\prime$ are block diagonal we can solve each block in parallel allowing for parallel computation. ==TODO: compare with the first section matrices==  </details> <details open> <summary>Optimization (2/4)</summary> It's clear that being able to solve constrained optimization problems is crucial to finding good solutions to all the problem we have seen up to now. Classical approaches have always been focused on limiting the hypothesis spaces for the approximators restricting the search to some well behaved set of parameters and parametrization. To deal with modern neural networks where strong nonlinearities have proven to be necessary we must look at ways to solve the constrained problem at a larger scale. Because from the constrained optimization point of view both the weights $\w_i$ and the activations $\x_i$ are equally considered parameters let $\p:=(\w, \x)$ the concatenation of both vectors be the variable of interest for this part. ## VI problem formulation.   [Lecun88] gives us clear stationary conditions for a solution. We want to find stationary points $(\p^\star, \lambda^\star)$. ==**TODO:** sufficient? given that our problem is **convex**, not the case here!== In the unconstrained case we want the gradient at $(\p^\star, \lambda)$ to be zero, for the penalty based methods we look for something in the form: $||\nabla L(\p^\star, \lambda)|| = 0$ with $\lambda\rightarrow\infty$ [Gidel19] ntoes that since we have constraints we are interested in a stationary point for which the cost is non-negative in the feasible direction. In our setting this can be expressed as: \begin{cases} \nabla_\theta L(\p^\star, \lambda^\star)^T(\theta-\theta^\star) & \geq 0 \\ \nabla_\lambda -L(\p^\star, \lambda^\star)^T(\lambda-\lambda^\star) & \geq 0 \\ \end{cases} $\forall (\theta,\lambda) \in (\Theta\times\Lambda)$ We can simplify the formulation $\ref{above}$ as: Assuming $L$ is closed convex on $\pspace, \Lambda$ we can set: $F:=(\nabla_\p L, -\nabla_\lambda L)$ $x:=(\p, \lambda)$ And have: $F(x^\star)^T(x-x^\star)\geq0, \forall x \in \mathcal{X}$ The problem $\ref{above}$ is also called Variational inequality problem, $\vip$ $\newcommand{\projh}{(\phi(\p_\w), \p_\w)}$ It's interesting to note that the solution lies in $(\Theta^\star, \Lambda)$ where $\Theta^{\star}=\{\p \in \Theta:\p=P_h(\p)\}$ $P_h: \p \rightarrow (\x, \w)$ $P_h(\p) = \projh$ where $\phi(\w)$ is the forward pass function. ==**Continue here**== https://arxiv.org/pdf/1802.10551.pdf [RockFellar VIs](/W0EUlMukSneqpRk3kKqkJg) ### Projection method One of the first methods to solve VIs in general was proposed by $\ref{LnS}$ Given a $\vip$, and the assumptions ==TODO above formulation is on $\exs$ below is for C subspace $\exs$== - $\exs$ is an Hilbert space - $F$ is continuous bilinear form on $\exs$ - C is a closed convex subset of $\exs$ - f is an element of dual space of $\exs$  Problem: find $x^\star \in \exs$ which solves the VI(C, F)  [Lions and Stampacchia, 1967] Looks at $L(x, x) \geq 0, \enspace \forall x \in X$ >> https://link.springer.com/chapter/10.1007/0-387-24276-7_3 [ book](https://ipfs.io/ipfs/bafykbzaceaoknhfi5ly7bipdz4cnfirz6hoczxad2clfk2gapihsyujwq6oy2?filename=Giannessi%20F.%2C%20Maugeri%20A.%20-%20Variational%20Analysis%20and%20Applications%20%282005%29.pdf) [Lions and Stampacchia, 1967] Tries to study the existence of atleast one solution of: ![](https://i.imgur.com/6M7tqBq.png) By replacing: ![](https://i.imgur.com/eGvvgoh.png) with: ![](https://i.imgur.com/LhAoV1s.png) The main results of $\ref{LnS}$ can be summarized as follows: The hypothesis being still that: 1. F(u, v) is bilinear continuous on V 1. satisfies (3) 1. C is a closed and convex set of $\xs$ >> 1. the set X of all solutions of $\vip$ is a (possibly empty) closed and convex set. >> 2. approximation of the set X by regularizations: Ve > 0, let us consider the problem which consits in finding u, e K , such that: ![](https://i.imgur.com/XpYBFBP.png) And shows that the simplest method to solve $\vip$ is: $$ \begin{equation} x_{k+1}=P_C(x_k−\alpha F(x_k)) \end{equation} $$ where $P_C$ is an orthogonal projection onto C and $\alpha>0$ is a positive step size. Note that $P_C$, in the case of constrained nerual networks is not the projection on the constraint set $P_h(\p)$ defined before since our $\vip$ encodes the unconstrained lagrangian optimization. ### :heavy_check_mark: Extragradient [Korpelevich, 1976]  If we look inside the function we want to optimize we see that it has the form of a two palyer ($\p, \lambda)$ game $F:=(\nabla_\p L, -\nabla_\lambda L)$ which is prone to oscillatory behavior we are infact interested in saddle points. Extragradient aims to dampen this behaviour allowing for convergence to stationary a point. This method does not require strong monotonicity, which makes it a better canddiate for neural networks. ==This line is risky== We are interested in finding a saddle point $\vip$ of the Lagrangian $\lagrangian$ where: $F = [1, -1]^T\nabla L$  Extragradient follows the simple two step update rule: $\overline{x}=P_C(x^k−\alpha F(x^k))$ $x^{k+1}=P_C(x^k−\alpha F(\overline{x})$ In our case the euclidean projection $P_C$ is not necessary. $\overline{x}=x^k−\alpha F(x^k)$ $x^{k+1}=x^k−\alpha F(\overline{x})$ And to keep in mind the function of interest: $\overline{x}=x^k−\alpha \lagrgrad(x^k)$ $x^{k+1}=x^k−\alpha \lagrgrad(\overline x)$   ## Large scale NLP ### Side note, not a paper to review: MirrorDescent GD $x^{k+1}=x^k-\eta \nabla f(x^k)$ This can be seen as the solution to the quadratic problem $x^{k+1}= \operatorname{argmin} f(x^k)+\nabla f(x^k)^T(x-x^k) + \frac{1}{2\eta}||x - x^k||$ at every step. where $f(x^k)$ is constant Taylor(1) + proximity term $x^{k+1}= \operatorname{argmin} \nabla f(x^k)^T(x-x^k) + \frac{1}{2\eta}||x - x^k||$ at every step. the proximity term imposes an homogeneous penalty, which could not be the case e.g. if the gradients on one dimention are even just linearly scaled w.r.t. another (ellispe etc) ![](https://i.imgur.com/nsHCLeF.png) Solution: replace proximity term $\frac{1}{2\eta}||x - x^k||$ with a better prior $D_\varphi(x, x^k)$ called bergman divergence. ### Bregman proximal, MirrorProx [Nemirovski, 2005]  Given we can define a Bergman Divergence function $D_\varphi(x, x^k)$, MirrorProx can be seen as a generalization of ExtraGradient. To recover extragradient lets define $D_\varphi = \frac{1}{2}\langle z, z \rangle$, the Euclidean case. We find the original extragradient. $\bar x=\Pi_X(x^k - \gamma \Phi(x^k)$ $x^{k+1}=\Pi_X(x^k - \gamma \Phi(\bar x)$ $\Pi_X(x) = \underset{x^\prime \in X}{\operatorname{argmin}}\langle x - x^\prime, x - x^\prime \rangle$ ==Does not look like it?== Here we are trying to adjust gradient updates to fit problem geometry via the $D_\varphi(x^k, x)$ term between iterates.  ### :heavy_check_mark: Extrapolation from the past [Gidel et al., 2018]  This paper casts the GAN problem a VIP $\ref{VIP}$ and then introduces an "extrapolation from the past" similar to optimistic mirror descent with projections. (NAMEDROP OMD w/ no explaination, remove?) While the GAN problem is not equivalent to the lagrangian saddlepoint this paper introduces the problem of GANS as VI saddle point optimization and as long as we restrict ourselves to the batch setting, the methods are similar as discussed before. The stochastic VIPs formualtion introduced instead where where instead of the gradient $F=\nabla [-1, 1]^TL$ we have access to an unbiased estimate of it $F(x, \xi), \xi \sim P$ and $F = \mathbb{E}_{\xi \sim P} [F(x,\xi )]$ The reason why we can not apply the stochastic formulation to constrainted neural networks is that don't have access to such estimate as our lagrangian: $$\lagrangian$$ Depends on $\x_i$ which, in the stochastic equivalent we would like to have $x_0 \sim P_X$ as it's the case here but we also have $\x_i \sim P_\xs$. Ultimately more work is required to fully understand this last difference. Nonetheless "Extragradient from the past" can be applied in the batch case. new update rule is now: $\bar{x}^{k+1}=\Pi_X(x^k - \gamma \Phi(\bar {x}^k))$ $x^{k+1}=\Pi_X(x^k - \gamma \Phi(\bar x^{k+1}))$ At each iteration $\bar x^{k+1}$ will become the next $\bar x^{k}$  </details> <details open> <summary>Brain (3/5)</summary> # Credit assignment in the brain These targets depart from the optimization view and often depart from accurate optimization objectives (which is a proxy for the real generalization objective anyway) in favor of herurisitcally choosen auxiliary objectives. ==TODO: something from blake lessons== > We are not sure if the brain has such system but there are ideas about something losely similar to backpropagation could be implemented in the brain, the brain can not have a global gradient as the neurons should have to store both forward (prima) and backward (tangent) values ## Target representations review target prop https://arxiv.org/pdf/2006.14331.pdf ### :heavy_check_mark:[Krogh, 1989]  It solves: $$\liftedmin \lagrangian$$ with $L=1$, where $\lambda$ is a fixed constant. Krog et al. also shows how to impose per hidden constraints to control the learned hidden representations by adding additional terms dependent on the activation values $\x$. ### :heavy_check_mark:[Rohwer, 1990] The moving targets algorithm aims to increase learning speed by a per layer cost function to allow the hidden variables to solve non stationary local problems. It differs with [Krogh 1989] by parametrization of the $f_i(\x, \w)$ trasnfer functions.  ### :white_check_mark:[Lee et al., 2015a] Deeply Supervised Network (DSN) [\cite] considers multiple joint losses and calculates gradients of the joint loss. $$\classicalmin \sum_{i=1}^L \lambda_i \ell(f_i(..f_0(x_0; \w_0)..,\w_i), y)$$ Where lambda is now an annealed constant. >Could show the gradients being lower triangular like the backprop ones, this means that the lower parts of the traingle are helping shape the gradients) Similartly to [Rower and Krogh] this approach aims to provide proxy targets for the hidden layers.  ### :white_check_mark:[Lee et al., 2015b] ==TODO: in the paper they use $x$ everywhere but i'm not sure if they actually mean $\x_{i-1}$== Global gradients require knowlege of the full network, this is not biologically plausible. [Lee 2015b] avoids relies on target values $a_i$ and approximate inverse dynamics $g_i$. This approach is similar to \cite{Beiger} Collocation methods where we have access to approximate dynamics. We now have local loss functions $||\hi||^2$ for the intermediate layers and a standard last layer function $\lastcost$  ### Optimization The last layer is trained by standard gradient descent. The interesting part of this approach is the optimization algorithm for intermediate layers. We use an approximate inverse $g_i(\cdot)$ where: $f_i(g_i(\x_{i+1}))\approx \x_i$ $g_i(f_i(\x_i))\approx \x_i$ To update the target values: $\hat{\x_i} = \x_i - g_i(\x_i) + g_i(\hat\x_i)$ We can now update the weights of $g_i$ by minimizing: $$\underset{W_g}{\operatorname{argmin}}||g_{i+1}(f_{i+1}(\x_i + \epsilon); W_g)- (\x_i + \epsilon)||^2$$ The $\epsilon$ term ensures the backward map is also estimated in a neighbourhood $a_i$. And $f_i$ by minimizing: $$\underset{W_f}{\operatorname{argmin}}||f_{i+1}(\x_i)- \hat\x_i||^2$$ #### models for forward and backward dynamics DTP uses autoencoders to model the $f$ and $g$ functions. ==More?== 2.4 shows how to get the updates, idk if add it. Update direction of target prop does not deviate by more than 90 degrees from gradient direction (the $cos(\alpha)$ proof)   ### [Lillicrap et al., 2020] [Lillicrap 2020] Investigates bioplausibility of target propagation approaches and tries to enstablish a framework called "Neural Gradient Representation by Activity Differences" (NGRAD), They also introduce a NGRAD Hypothesis which sugggests the human cortex uses an NGRAD based method to approximate gradient descent. the main contribution of this paper is to suggest that all these algorithms (which ones?) implicitly represents gradients in temrs of neural activity differences. In particular, the NGRAD hypothesis suggests that many of the bio-inspired algorithms approximate backprop by implicity representing gradients in terms of neural activity (spatial or temporal) differences.  </details>