--- tags: ZZArchived --- # Comprehensive exam ==Before delivery: proper formalisms all over the place, e.g. F -> F(a,b;c)== ==h instead of a?== <details closed> <summary>Define</summary> $\newcommand{\lagrgrad}{\begin{bmatrix} \nabla_\theta L \\ -H(\theta) \end{bmatrix}}$ $\newcommand{\exs}{\mathcal{X}}$ $\newcommand{\vip}{VI(\exs, F)}$ $\newcommand{\R}{\mathbb{R}}$ $\newcommand{\w}{w}$ $\newcommand{\ws}{W}$ $\newcommand{\D}{\nabla}$ $\newcommand{\cost}{\ell}$ $\newcommand{\y}{y}$ $\newcommand{\f}{f}$ $\newcommand{\x}{a}$ $\newcommand{\xs}{\mathbf{a}}$ $\newcommand{\p}{\theta}$ $\newcommand{\pspace}{\Theta}$ $\newcommand{\lagrp}{\begin{bmatrix} \p \\ \lambda \end{bmatrix}}$ $\newcommand{\lastcost}{\cost(\f_{L-1}(\x_{L-1}, w_{L-1}), \y)}$ $\newcommand{\hzero}{\x_{1} - \f_0(x_0, w_0)}$ $\newcommand{\hi}{\x_{i+1} - \f_i(\x_i, w_i)}$ $\newcommand{\hlast}{\x_{L-1} - \f_{L-2}(\x_{L-2}, w_{L-2})}$ $\newcommand{\tarhzero}{\hat\x_{1} = \f_i(x_0; w_0)}$ $\newcommand{\tarhi}{\hat\x_{i+1} = \f_i(\x_i; w_i)}$ $\newcommand{\tarhlast}{\hat\x_{L} = \f_{L-1}(\x_{L-1}, w_{L-1})}$ $\newcommand{\opthzero}{\x_{1} = \f_i(x_0; w_0)}$ $\newcommand{\opthi}{\x_{i+1} = \f_i(\x_i; w_i)}$ $\newcommand{\opthlast}{\x_{L} = \f_{L-1}(\x_{L-1}, w_{L-1})}$ $\newcommand{\sh}{\sum_{i=0}^{L-2} \hi}$ $\newcommand{\sqh}{||\sh||^2}$ $\newcommand{\forwardprop}{\f_{L-1}(\dots\f_i(\dots\f_0(x_0; \w_0)\dots;\w_i)\dots;\w_L)}$ $\newcommand{\bptarget}{\cost(\forwardprop, \y)}$ $\newcommand{\liftedmin}{\underset{(\ws, \xs)}{\operatorname{argmin}}}$ $\newcommand{\classicalmin}{\underset{\ws}{\operatorname{argmin}}}$ $$\newcommand{\bpprob}{ \begin{align} & \classicalmin & \bptarget \\ & \operatorname{given} & x=\x_0 \\ & \end{align}}$$ $$\newcommand{\bpprobb}{ \begin{align} & \classicalmin & \lastcost \\ & \operatorname{given} & \opthi\\ & \operatorname{given} & x=\x_0 \\ & \end{align}}$$ $$\newcommand{\constrprob}{ \begin{align} & \liftedmin \lastcost \\ & \text{s.t.} & \opthi\\ & \operatorname{given} & x_0=\x_0 \\ & \forall i \in {0,...,(L-1)} \end{align}}$$ $$\newcommand{\collocationprob}{ \begin{align} & \underset{\xs}{\operatorname{argmin}} & \lastcost \\ & \text{s.t.} & \phi(\x_i, \x_{i+1}) = \w_i\\ & \operatorname{given} & x=\x_0 \\ & \forall i \in {0,...,(L-1)} \end{align}}$$ $\newcommand{\lagrangian}{L(x_0;\w,\x,\lambda) = \lastcost + \lambda^T H(x_0; \w, \x)}$ $\newcommand{\old_lagrangian}{L(\p, \lambda)=\lastcost+\lambda\sh}$ </details> I am an x-Facebook employee. In the early days I used to be proud wearing Facebook swag. Later it bacame more and more an issue. People started making comments on feature X or policy Y they didn't like. I eventually didn't wear any swag in public, this is when I've realized I need to leave the company. <details closed> <summary>Guidelines</summary> ###### [the review report should be concise, but clear, and is typically between 12 pages and 15 pages in a single-spaced, 12 point font.](https://www.cs.mcgill.ca/academic/graduate/phd/) ![](https://i.imgur.com/UAluMCo.png) https://www.brainpickings.org/2015/05/27/william-zinsser-on-writing-well-science/ ### in BVP we also want to set the boundary gradient to zero, does it add a constraint? ## Review suggestion ### Fact1: The main concept [Lecunn] ### Fact2: splitting training ### Fact3: Solving ConstrNN ### Fact4: BioView Doina suggestions: - Define stuff clearly - Unifiy notation - Lookup domain of expertise of the committee - committe goal: nudge student in PhD direction lpha ## Constrained Optimization as an Alternative to Back-Propagation. Manuel Del Verme, Prof. Pierre-Luc Bacon and Prof. Doina Precup -DATE- https://www.sciencedirect.com/science/article/abs/pii/0041555379900661 http://www.tricki.org/article/Dyadic_decomposition PL version /home/esac/research/PhD Syllabus/comp.pdf ### Reviews [Gori96](https://sci-hub.do/10.1016/0925-2312(95)00032-1) [GoriBook](https://ipfs.io/ipfs/bafykbzaceampi5jlsi7hzx5n7ztqkvgswiixeblb2feota77wjlgdzpjmrj4u?filename=Marco%20Gori%20-%20Machine%20Learning.%20A%20Constraint-based%20Approach-Morgan%20Kaufmann%20_%20Elsevier%20%282018%29.pdf) http://www.tricki.org/article/Divide_and_conquer <!-- ODE splitting http://www.gicas.uji.es/Fernando/Other/enci-sp5.pdf --> </details> <details open> <summary>Intro (0/1)</summary> ## Divide et Impera Decomposing a problem into smaller and more manageable pieces to be solved independently and then merging the solutions has been a common motif inspiring many foundamental advancements (Search, ODE, kepler etc.). In this survey I will present several approaches to divide the parameter estimation problem for neural networks by backprop into simpler sub-problems. ## Feed forward neural networks With the quest of subdivision we now look at a feed forward neural network $$\forwardprop$$ And its loss function: $\cost(\w, x_0, y) = \bptarget$ TODO: talk about bias We can specify the above transfer function layer wise as: $$\x_i = \f_i(\x_{i-1}; \w_{i-1}) \enspace \forall i \in (0\dots {L-1})$$ We can specify the network activactions: \begin{align} \begin{cases} \opthzero \\ ...\\ \opthi \\ ...\\ \opthlast\\ \end{cases} \end{align} \label{forwardprop} ## Backpropagation And it's loss Jacobian $\newcommand{\dfdw}[2]{\frac{\partial \f_{#1}(x_0; \w)}{\partial \w_{#2}}}$ \begin{align*} J(\w, x_0) = \begin{bmatrix} \dfdw{0}{0} & 0 & \dots &\\ \dfdw{1}{0} & \dfdw{1}{1} & \dots &\\ \dfdw{2}{0} & \dfdw{2}{1} & \dfdw{2}{2} & \dots &\\ \vdots && \ddots \\ \dfdw{L-1}{0} & ... & ... & ... & \dfdw{L-1}{L-1} \end{bmatrix} \end{align*} TODO: say something about this last columns are "nice" because on the left they are not because vanishing gradients +inf/0/-inf <!-- compare with https://www.jasonosajima.com/backprop --> # Constrained Formulation [LeCun, 1988] [Lecun88] shows how to cleanly cast the unconstrained learning problem by backpropagation problem into a lagrangian formulation. We can now start decoupling **(right word?)** the network by adding extra decision variables $\hat \x_i \forall i \in (1..L-1)$, we now have two sets of variables and a set of constraints. We can specify the network activactions: $$ \left\{ \begin{array}{ll} \tarhzero \\ ... \\ \tarhi \\ ... \\ \tarhlast \\ \end{array} \right. \text{subject to} \left\{ \begin{array}{ll} \hzero =0 \\ ... \\ \hi =0\\ ... \\ \hlast =0\\ \end{array} \\ \right. $$ [\ref forwardprop] can be now expressed in vector form. \begin{align*} \hat \x = F(x_0; \w, \x) = \begin{bmatrix} \f_0(x_0; \w_0) \\ \vdots \\ \f_{L-1}(\w_{L-1}, \x_{L-1})\\ \end{bmatrix} \end{align*} And the constraints: \begin{align*} H(x_0; \w, \x) = \begin{bmatrix} \hzero\\ a_2 - f_1(a_1, w_1) \vdots\\ \hlast\\ 0\\ \end{bmatrix} \end{align*} \begin{align*} \D_\x H(x_0; \w, \x) = \begin{bmatrix} 1 & ... & &\\ -\D_{a_1} f_0(\x_1, \w_0)& 1 & ... & &\\ & \ddots &&\\ &-\D_{a_{L-2}} f_{L-2}(\x_{L-2}, \w_{L-2})& 1 \\ & & -\D_{a_{L-1}} f_{L-1}(\x_{L-1}, \w_{L-1})\\ & & 0\\ \end{bmatrix} \end{align*} **LeCunn adds $a_L$ as final state and then constrains it, that makes no sense I'm not doing that, the 0 is temporary to remind me about this.** With lagrangian: $$\lagrangian$$ <!-- \frac{\partial{1}}{\partial --> We want to find ($\w, \x, \lambda$) s.t. $$\nabla L(x_0; \w, \x, \lambda)=\boldsymbol{0}$$ \begin{align} \D_\lambda L = 0\\ \D_\x L = 0\\ \D_\w L = 0\\ \end{align} The interesting thing is that those three conditions yield in order: 1. Forward propagation 1. The gradients (Backward pass) 1. Optimality conditions for $\w$ To recover forward propagation from $\D_\lambda L = H = 0$ is trivial. $\D_\x L = 0$ is more interesting <!-- as for the last layer we get $D_\x \bigg[\lastcost + \lambda^T (\hlast)\bigg]$ --> $$\D_\x L = \D_\x \ell + \lambda^T \D_\x H$$ for $\x_{L-1}$ we have: $D_{\x_{L-1}}L=D_\x \lastcost + \lambda_{L-1} = 0$ $\lambda_{L-1} = -D_\x \lastcost$ which is the same gradient w.r.t the last layer in classical backpropagation. For the previous layers we have: ==**somethjing wrong somewhere idk**== $\lambda_{L-2}[\hlast]$ $D_{\x_{L-2}} L=\lambda_{L-2}( - D_{\x_{L-2}}f) = 0$ $\lambda_{L-2} D_{\x_{L-2}}(0 - D_{\x_{L-2}}f) = 0$ 1) $J_aL$ blablalba 3) omg this is costates 3) omg^2 gradients are costates </details> <details open> <summary>Hacks (1/3)</summary> ## Finding a solution, the hacks While [Lecunn88] sets the problem and identifies sufficient conditions we are not provided with a way to solve the optimization problem. ### :heavy_check_mark: [Carreira-Perpi ̃n ́an and Wang, 2014] [CPnW] introduces a matematical formulation (meta-algorithm) named *MAC* to devide the training of a general neural network into sub-problems. This is done by a layer wise optimization intoducing "method of the auxiliary coordinates" (MAC), where they learn $\w$ using the network activations $\x$ as "auxiliary coordinates". The search itself is done by using a penalty based method over $(\w, \x)$. They turn the original problem: $$\bpprob$$ into a constrained equivalent one with auxiliary variables $\x_i$ as usual $$\constrprob$$ This problem can now be solved via quadratic penalty method where the problem becomes again unconstrained as: \begin{align} & L(\x, \w, \lambda) = \lastcost + \lambda_0 \sqh\\ & \operatorname{given} x=\x_0 \\ \end{align} For a fixed $\lambda_0$. And then solve a sequence of problems for $\lambda_0 \to \infty$ by alternating the minimization over $\w_i$ and $\x_i$. ### $\w$-step The minimzation over $\w_i$ is now equivalent to minimizing $L$ single layer problems in the form: $$ \min_{\w_{L-1}} \enspace \lastcost$$ and $$ \min_{\w_i} \enspace \lambda || \hi ||^2 $$ ### $\x$-step The minimization over $\xs$ instead is: $$ \min_z \enspace \lastcost + \lambda \sqh $$ Which is a “generalized” proximal operator. The problem is now optimized sequentially $\x_i$ then $\w_i$ by Gauss-Newton for both \ws and \xs. Once the the problem (ref{L(a, w, lambda}) is considered solved the auxiliary variables $\xs$ are set to $\opthi$ to guarantee feasability. <!-- NOTE: K=Layers, N=Dataset Samples --> <!-- MORE: MAC-constrained and nested problem equivalence (theorem A.1) ![From the paper](https://i.imgur.com/GRWsSy8.png) KKT conditions appendix:A.3 TODO Convergence of QuadraticPenalty TODO appendix:B --> ### [Betti et al., 2018] notes: $\alpha_{\kappa, i}$ is regularization factor [Betti18] takes [Lecun88] stationary conditions and defines a search approach in the $(w, x, \lambda)$ space by grandient descent ascent (GDA) \begin{align} & \underset{(w, x)}{\operatorname{argmin}} & \ell(x_N, y) + \sum_{i=0}^{N-1}?(x_i, w_i) \\ & \text{subject to} & x_{i+1}=f_i(x_i, u_i) \\ &\operatorname{given} & x_0 \\ \end{align} $x=(x_1, x_2, .. x_N), u=(u_0, u_1, .. , u_{N-1})$ and it's Lagrangian $$L(w, x, \lambda)=\ell(w, x)+\lambda \sum_{i=0}^{N-1}(x_{i+1} - f_i(x_i, u_i))$$ They now suggest to: 1. GDA over $x, \lambda$ ($l \times n$ times) 2. GD over $w$ ($n \times pa(i)$ times) 3. loop back ### [Gotmare et al., 2018] This work introduces BlockProp, an algorithm that splits the unconstrained optimization problem in to a constrained problem over blocks of >=1 layers The paper then recognizes that $$\bpprob$$ If we consider $b_i$ is a sequence of layers $f_i$ $b_i=\f_i(\w_i, \f(\dots(\f_j(\w_j,\x_j)))))$ and $\w_{i:j}$ the weights pertraining to those layers $i..j$ can be reformulated as: $\newcommand{\f}{b}$ $$\constrprob$$ As far as i know they are the first to purpose a stochastic view for the constrained problem, the loss is now $\newcommand{\x}{a^{[n]}}$ $$L(\x, \w, \lambda)= \lastcost + \sum_{i=0}^{L-1} \lambda_i ||\hi||^2$$ Which is minimized by alternating optimization over $\xs$ then $\ws$ **TODO**: page 3 The optimization w.r.t $\xs$ is done using SGD then the optimization w.r.t. $\ws$ using SGD again (abeit a different number of steps are taken). $\newcommand{\x}{a}$ $\newcommand{\f}{f}$ </details> <details open> <summary>Lagrangian (1.5/2)</summary> ## Finding a solution, borrowing from control systems. <!-- Up to now all the methods formulate the constrained problem and then try to solve a minimization problem with the penalty approach. Here we are going to look at the field of control and how they solve control problems with long time horizon, the parallel between discrete control time horizon and long term dependencies in neural network comes from the view of a neural network as a multistage system where the signal propagates from the input layers to the output in a discrete time way. ### [Kalman, 2009] [Kalman, 2009] notes that we can find solution of the lagrangian as a saddle point optimization problem (https://sci-hub.do/https://doi.org/10.2307/1905259, would have been a better paper on the topic and https://www.rand.org/content/dam/rand/pubs/papers/2009/P223.pdf too) **ALTERNATIVE PAPER: harrow hurwitz 1952 algorithm** Using $\theta=(X, W)$ $$ \max_{\lambda} \min_{\theta} {L(\theta, \lambda)}\\ \operatorname{given} x_0 $$ Alas, in the general case and in particular in the case of deep neural networks we can not rely on the convex-concave assumption, this will mean we have to find locally optimal saddle points instead of extrema of the lagrangian and hence of the loss function **(important point)** ![](https://i.imgur.com/7UcTFMN.png) [kalman paper] that formulation is a minmax, we can solve minmax by looking at VIs and non linear problems. --> ### Discrete time optimal control [DCOPT](/E-oP3XSDTN-9QNsjDWL7DQ) To better understand why this constrained optimization approach might be useful we can look back at fields using trying to solve the same problem of long term dependenticies and coming up with similar solutions. The problem of minimizing a final cost function (and optionally a layer wise one too) is found in DCOPT where we consider the problem of finding a sequences of states $\x_i$ (usually $x$) and controls $\w_i$ (usually $u$) to minimize the cost subject within the realizable dynamics: For example in a robot we might want to fit a planned trajectory from $x_0$ to $y$ constrained by the laws of motion but avoiding obstacles. The most general formulation (in discrete time) is: \begin{align} & \underset{(\w, \x)}{\operatorname{argmin}} & \lastcost + \sum_{i=0}^{L-1}g_i(\x_i, \w_i) \\ & \text{subject to} & \x_{i+1}=f_i(\x_i, \w_i) \\ &\operatorname{given} & x_0=\x_0 \\ \end{align} Setting $g_{i<N}=0$, I will keep ignoring the $g_{i<N}$ costs because the ==[Bertsekas Optimal control vol I sec 3.4]== tells us we can reconduce any bolza? problem into a somebodyelse problem by propagating the costs forward by an identity. We find [lecun88] the original formulation. #### Hard Constraints (Backprop) $$\bpprobb$$ In this setting the classical backpropagation is akin to forward shooting where we want to integrate the problem dynamics. Here we want to find the optimal set of weights (actions) to achieve the minum cost. ![](https://i.imgur.com/J5oVDU2.png =300x) (source, Pieter Abbeel) As it can be shown in the above image, any obstacle in the optimization space could increase optimization complexity as the dynamics of the controlled system define the search space. The gradients estimated by backpropagation will guarantee optimization paths included in the feasible region, meaning we we have to go through the narrow passage. Known problems of forward shooting methods are: - Brittle optimization trajectories - Prone to local minima - Strong dependency on initial guess - Requires initial guess from demonstrations and randomization <!-- ![](https://i.imgur.com/pL3ZwHy.png) P. Abbeal, problems in robotics --> We have the same problems in supervised learning: - ? - Vanishing grads (deep), local minima (shallow) - Reliance on engineered initializations - Transfer learning aka good initial guesses #### Soft Constraints (Lagrangian BackProp) ![](https://i.imgur.com/GRQqaMG.png =300x) Lagrangian methods only softly enforce the constraints allowing the optimization path to converge more robustly and with fewer iterations to the optimal solution. ## Collocation methods <!-- ![](https://i.imgur.com/MdWtJnt.png) ![](https://i.imgur.com/MdWtJnt.png) Update rules: Backprop chain rule vs splitted backprop With invertible networks, work like bio inspired ones later explained --> In collocation methods we add the state variables $\x_i$ as optimization variables: $$\constrprob$$ Which is identical to the original constrained formulation. One of the foundamental methods in collocation is direct collocation where we have a problem as: $$\collocationprob$$ We are now trying to find the best activation sequence (trajectory) for our network and letting using this $\phi$, representing an inverse layer dynamics define the appropriate weights implicitly, we are going to see more of this in the bioinspired algorithms. <!-- ![](https://i.imgur.com/8MdcPGs.png) In this new DTOCP view of backprop we are now talking about optimizing trajectories, *direct collocation* > discretizes the trajectoryoptimization problem itself,typically converting the original trajectory optimizationproblem into a nonlinear program (see section 1.6). This conversion process is knownastranscriptionand it is why some people refer to direct collocation methods as directtranscription methods ![](https://i.imgur.com/wQbh4hr.png) Collocation problem for a continuous NN ![](https://i.imgur.com/7hBgiSs.png) policy(closed loop) vs trajectory(open loop) --> ### :heavy_check_mark: Direct collocation by sequential + orthogonal [Biegler, 1984] ==TODO: this is a global (single shooting) method.. right? clarify and make clear.== <!-- I think the point is that we split on t but then fit an approx global solution (no local error), with zero error at selected ts. --> [Biegler, 1984] uses orthogonal collocation, it defines an appropriate class of interpolating polynomial functions (Lagrange polynomial) and fits them at splitting points. 1. defines a set of collocation points $\x_i$, this is equivalent to choosing at ![](https://i.imgur.com/txvjXbG.png) In the image above t={0, 2, 5, 7} 2. And defines an interpolating polynomial: <!-- ![https://www.tu-ilmenau.de/fileadmin/media/simulation/Lehre/div/Lec_Slides5.pdf](https://i.imgur.com/f7sXNJ5.png)--> $$\hat \x_i(t)=\sum_{k=0}^N \x_k^{(i)}L_k(t), i \in [1..n]$$ $$\hat \w_j(t)=\sum_{k=0}^N \x_k^{(i)}L_k(t), j \in [1..m]$$ with $L_k$ the N-th degree Lagrange polynomial: $$L_k(t)=\prod^N_{l=0,l\neq k} \frac{t - t_l}{t_k-t_l}$$ Which guarantees that $L_k(t_l)=1 \enspace \text{if} \enspace l=k, \enspace 0$ otherwise. <!-- ![https://www.tu-ilmenau.de/fileadmin/media/simulation/Lehre/div/Lec_Slides5.pdf](https://i.imgur.com/9TfycJ8.png) --> <!-- >Solves the discrete optimal control problem using orthogonal collocation in a simultaneous approach >In the latter case, alsothe states are fully discretized based on (orthogonal) polynomials. Hence, them inimal objective value has to be found while satisfying the discretized differential equations --> ### :heavy_check_mark: Multiple shooting [Bock and Plitt, 1985] Multiple shooting splits the problem into subproblems by defining time intervals (similar to direct collocation) and solving each problem indepeendently. In our case each subproblem has the form of the constraint $\hi$. we now have a system of non-linear equations \begin{array}{} H(\x) = \begin{cases} \hzero& = 0\\ ...\\ \hi & = 0 \\ ...\\ \hlast & = 0 \\ \end{cases} \end{array} Here the dynamics of the system are fixed hence the $\ws$ are not variables. The above system $H(\x)=0$ can be solved by Newton's Algorithm finding iterates $\x^k$ and Jacobian: \begin{align*} H^\prime(\x) = \begin{bmatrix} -I & 0 & ... &\\ \frac{\partial \f_{0}(\x_{i}, \w_i)}{\partial \x_0} & -I & ... &\\ 0 & \frac{\partial \f_{1}(\x_{i}, \w_i)}{\partial \x_1} & -I & ... &\\ \vdots && \ddots \\ 0 & ... & 0 & \frac{\partial \f_{T-1}(\x_{i}, \w_i)}{\partial \x_{T-1}} & -I \end{bmatrix} \end{align*} Since both $H$ and $H^\prime$ are block diagonal we can solve each block in parallel allowing for parallel computation. ==TODO: compare with the first section matrices== <!-- > Here the time domain is partitioned into smaller time elements and the DAE models are integrated separately in each element [9–11]. Con- trol variables are parametrized as in the sequential approach and gradient information is obtained for both the control variables as well as the initial conditions of the states variables in each element. Finally, equality constraints are added to the NLP to link the elements and ensure that the states are continuous across each element. As with the sequential approach, inequality con- straints for states and controls can be imposed directly at the grid points. For piecewise constant or linear controls this approxima- tion is accurate enough, but path constraints for the states may not be satisfied between grid points. >The total integration range issplit intoa finite number of intervals on which integration of the states is continuous.The value of the control and the initial value of the states in each intervalare chosen by the NLP-solver in each iteration while trying to ensure the con-tinuity of the states between the different intervals --> </details> <details open> <summary>Optimization (2/4)</summary> It's clear that being able to solve constrained optimization problems is crucial to finding good solutions to all the problem we have seen up to now. Classical approaches have always been focused on limiting the hypothesis spaces for the approximators restricting the search to some well behaved set of parameters and parametrization. To deal with modern neural networks where strong nonlinearities have proven to be necessary we must look at ways to solve the constrained problem at a larger scale. Because from the constrained optimization point of view both the weights $\w_i$ and the activations $\x_i$ are equally considered parameters let $\p:=(\w, \x)$ the concatenation of both vectors be the variable of interest for this part. ## VI problem formulation. <!-- More: https://youtu.be/vkVslJW0hg0?t=562 [2.3 explains (S)VIP](https://arxiv.org/pdf/1802.10551.pdf) [3 explains batch (S)VIP](https://arxiv.org/pdf/1802.10551.pdf) Fancy pictures https://gauthiergidel.github.io/slides/MAIS2018.pdf Why not GDA ~30min https://www.youtube.com/watch?v=VbclRsNjLzE&list=PLqrJF7iQbscALPio49SSil8XDbfNAlOTO --> <!-- >Rockafellar: Please note here we define necessary/sufficient conditions but in modern optimization we have search processes where we compute solutions with some little error so now we need general conditions that could be readily found ![](https://i.imgur.com/9BmFU4o.png) --> [Lecun88] gives us clear stationary conditions for a solution. We want to find stationary points $(\p^\star, \lambda^\star)$. ==**TODO:** sufficient? given that our problem is **convex**, not the case here!== In the unconstrained case we want the gradient at $(\p^\star, \lambda)$ to be zero, for the penalty based methods we look for something in the form: $||\nabla L(\p^\star, \lambda)|| = 0$ with $\lambda\rightarrow\infty$ [Gidel19] ntoes that since we have constraints we are interested in a stationary point for which the cost is non-negative in the feasible direction. In our setting this can be expressed as: \begin{cases} \nabla_\theta L(\p^\star, \lambda^\star)^T(\theta-\theta^\star) & \geq 0 \\ \nabla_\lambda -L(\p^\star, \lambda^\star)^T(\lambda-\lambda^\star) & \geq 0 \\ \end{cases} $\forall (\theta,\lambda) \in (\Theta\times\Lambda)$ We can simplify the formulation $\ref{above}$ as: Assuming $L$ is closed convex on $\pspace, \Lambda$ we can set: $F:=(\nabla_\p L, -\nabla_\lambda L)$ $x:=(\p, \lambda)$ And have: $F(x^\star)^T(x-x^\star)\geq0, \forall x \in \mathcal{X}$ The problem $\ref{above}$ is also called Variational inequality problem, $\vip$ $\newcommand{\projh}{(\phi(\p_\w), \p_\w)}$ It's interesting to note that the solution lies in $(\Theta^\star, \Lambda)$ where $\Theta^{\star}=\{\p \in \Theta:\p=P_h(\p)\}$ $P_h: \p \rightarrow (\x, \w)$ $P_h(\p) = \projh$ where $\phi(\w)$ is the forward pass function. ==**Continue here**== https://arxiv.org/pdf/1802.10551.pdf [RockFellar VIs](/W0EUlMukSneqpRk3kKqkJg) ### Projection method One of the first methods to solve VIs in general was proposed by $\ref{LnS}$ Given a $\vip$, and the assumptions ==TODO above formulation is on $\exs$ below is for C subspace $\exs$== - $\exs$ is an Hilbert space - $F$ is continuous bilinear form on $\exs$ - C is a closed convex subset of $\exs$ - f is an element of dual space of $\exs$ <!-- a -> L K -> C V -> X u -> y v -> x --> Problem: find $x^\star \in \exs$ which solves the VI(C, F) <!-- $F(x^\star, x-x^\star) \geq \enspace f \cdot (x-x^\star), \enspace \forall x \in C$ TODO: MORE TODO: $L(x^\star, x^\star) \geq \alpha ||x||^2$ then the problem has a solution --> [Lions and Stampacchia, 1967] Looks at $L(x, x) \geq 0, \enspace \forall x \in X$ >> https://link.springer.com/chapter/10.1007/0-387-24276-7_3 [ book](https://ipfs.io/ipfs/bafykbzaceaoknhfi5ly7bipdz4cnfirz6hoczxad2clfk2gapihsyujwq6oy2?filename=Giannessi%20F.%2C%20Maugeri%20A.%20-%20Variational%20Analysis%20and%20Applications%20%282005%29.pdf) [Lions and Stampacchia, 1967] Tries to study the existence of atleast one solution of: ![](https://i.imgur.com/6M7tqBq.png) By replacing: ![](https://i.imgur.com/eGvvgoh.png) with: ![](https://i.imgur.com/LhAoV1s.png) The main results of $\ref{LnS}$ can be summarized as follows: The hypothesis being still that: 1. F(u, v) is bilinear continuous on V 1. satisfies (3) 1. C is a closed and convex set of $\xs$ >> 1. the set X of all solutions of $\vip$ is a (possibly empty) closed and convex set. >> 2. approximation of the set X by regularizations: Ve > 0, let us consider the problem which consits in finding u, e K , such that: ![](https://i.imgur.com/XpYBFBP.png) And shows that the simplest method to solve $\vip$ is: $$ \begin{equation} x_{k+1}=P_C(x_k−\alpha F(x_k)) \end{equation} $$ where $P_C$ is an orthogonal projection onto C and $\alpha>0$ is a positive step size. Note that $P_C$, in the case of constrained nerual networks is not the projection on the constraint set $P_h(\p)$ defined before since our $\vip$ encodes the unconstrained lagrangian optimization. ### :heavy_check_mark: Extragradient [Korpelevich, 1976] <!-- More: https://papers.nips.cc/paper/2019/file/4625d8e31dad7d1c4c83399a6eb62f0c-Paper.pdf --> If we look inside the function we want to optimize we see that it has the form of a two palyer ($\p, \lambda)$ game $F:=(\nabla_\p L, -\nabla_\lambda L)$ which is prone to oscillatory behavior we are infact interested in saddle points. Extragradient aims to dampen this behaviour allowing for convergence to stationary a point. This method does not require strong monotonicity, which makes it a better canddiate for neural networks. ==This line is risky== We are interested in finding a saddle point $\vip$ of the Lagrangian $\lagrangian$ where: $F = [1, -1]^T\nabla L$ <!-- , even in bilinear case (as it is the Lagrangian of a linear programming problem), GDA requires strong concavity-convexity and since we are interestinted in. --> Extragradient follows the simple two step update rule: $\overline{x}=P_C(x^k−\alpha F(x^k))$ $x^{k+1}=P_C(x^k−\alpha F(\overline{x})$ In our case the euclidean projection $P_C$ is not necessary. $\overline{x}=x^k−\alpha F(x^k)$ $x^{k+1}=x^k−\alpha F(\overline{x})$ And to keep in mind the function of interest: $\overline{x}=x^k−\alpha \lagrgrad(x^k)$ $x^{k+1}=x^k−\alpha \lagrgrad(\overline x)$ <!-- To to simplify at the cost of ambiguity: if we look at the gradients: $$\lambda\nabla_\theta h = \lambda\nabla_\theta(\sh) = \lambda-\lambda\nabla_\theta f_i$$ $\newcommand{\badlagrgrad}{\begin{bmatrix} \nabla_\theta \lastcost + \lambda \cdot \nabla_\theta \sum f_i \\ -h \end{bmatrix}}$ $\overline{x}=x^k−\alpha \badlagrgrad(x^k)$ $x^{k+1}=x^k−\alpha \badlagrgrad(\overline x)$ --> <!-- #### Better intuition on extrapolation. First we note that for $\eqref{eq:proj}$ to be a solution of $\eqref{VI}$ we need to have $x^\star=P_C(x^\star−\alpha F(P_C(x^\star−αF(x^∗))))$ Then we want to update $x^{k+1}=P_C(x^k−\alpha F(\underbrace{P_C(x^k−αF(x^k))}_{\bar x^{k}}))$ This gives us the two step formulation of extragradient. Bla1 Bla2 https://www.youtube.com/watch?v=UvJvE26mDio @ 31min http://dm.unife.it/~tinti/Software/Extragradient/methods/vipsegm.pdf > the extragradient method(extrapolated gradient) does not requireany averaging to converge for monotone operators (in the batch setting), and can even converge atthe fasterO(1/t)rate (Nesterov, 2007). The idea of this method is to compute a lookahead step (seeintuition onextrapolationin §3.2) in order to compute a more stable direction to follow. §3.2 ![]#(https://i.imgur.com/mMrkfkg.png) ![]#(https://i.imgur.com/HJR3ZvR.png) --> ## Large scale NLP ### Side note, not a paper to review: MirrorDescent GD $x^{k+1}=x^k-\eta \nabla f(x^k)$ This can be seen as the solution to the quadratic problem $x^{k+1}= \operatorname{argmin} f(x^k)+\nabla f(x^k)^T(x-x^k) + \frac{1}{2\eta}||x - x^k||$ at every step. where $f(x^k)$ is constant Taylor(1) + proximity term $x^{k+1}= \operatorname{argmin} \nabla f(x^k)^T(x-x^k) + \frac{1}{2\eta}||x - x^k||$ at every step. the proximity term imposes an homogeneous penalty, which could not be the case e.g. if the gradients on one dimention are even just linearly scaled w.r.t. another (ellispe etc) ![](https://i.imgur.com/nsHCLeF.png) Solution: replace proximity term $\frac{1}{2\eta}||x - x^k||$ with a better prior $D_\varphi(x, x^k)$ called bergman divergence. ### Bregman proximal, MirrorProx [Nemirovski, 2005] <!-- More: [Slides](http://www.princeton.edu/~yc5/ele522_optimization/lectures/mirror_descent.pdf) TODO: add proof trap (slide 5-21) [Blog](https://blogs.princeton.edu/imabandit/2013/04/23/orf523-mirror-prox/) --> Given we can define a Bergman Divergence function $D_\varphi(x, x^k)$, MirrorProx can be seen as a generalization of ExtraGradient. To recover extragradient lets define $D_\varphi = \frac{1}{2}\langle z, z \rangle$, the Euclidean case. We find the original extragradient. $\bar x=\Pi_X(x^k - \gamma \Phi(x^k)$ $x^{k+1}=\Pi_X(x^k - \gamma \Phi(\bar x)$ $\Pi_X(x) = \underset{x^\prime \in X}{\operatorname{argmin}}\langle x - x^\prime, x - x^\prime \rangle$ ==Does not look like it?== Here we are trying to adjust gradient updates to fit problem geometry via the $D_\varphi(x^k, x)$ term between iterates. <!-- Given a smooth function $L$ and a compact set $\exs$ and a smoothness measure under arbitrary norm $||\cdot||$. ![](https://i.imgur.com/fyQT73f.png) Mirror prox takes a step of mirror descent (**what is MD?**), and then applies the same gradient let $\Phi: D\rightarrow \mathbb{R}$ be a mirror map in $X$ and let $x_1 \in \operatorname{argmin}$ Mirror prox follows an extragradient like update replacing gradient descent with mirror descent. > In words the algorithm first makes a step of Mirror Descent to go from x_t to y_{t+1}, and then it makes a similar step to obtain x_{t+1}, starting again from x_t but this time using the gradient of f evaluated at y_{t+1} (instead of x_t). The following result justifies the procedure. $D_\varphi$, the bergman divergence defines a distance encoding the problem geometry and is in the form $Dφ(x,z) :=φ(x)−φ(z)−〈∇φ(z),x−z〉$ Sources: --> ### :heavy_check_mark: Extrapolation from the past [Gidel et al., 2018] <!-- https://arxiv.org/pdf/1802.10551.pdf --> This paper casts the GAN problem a VIP $\ref{VIP}$ and then introduces an "extrapolation from the past" similar to optimistic mirror descent with projections. (NAMEDROP OMD w/ no explaination, remove?) While the GAN problem is not equivalent to the lagrangian saddlepoint this paper introduces the problem of GANS as VI saddle point optimization and as long as we restrict ourselves to the batch setting, the methods are similar as discussed before. The stochastic VIPs formualtion introduced instead where where instead of the gradient $F=\nabla [-1, 1]^TL$ we have access to an unbiased estimate of it $F(x, \xi), \xi \sim P$ and $F = \mathbb{E}_{\xi \sim P} [F(x,\xi )]$ The reason why we can not apply the stochastic formulation to constrainted neural networks is that don't have access to such estimate as our lagrangian: $$\lagrangian$$ Depends on $\x_i$ which, in the stochastic equivalent we would like to have $x_0 \sim P_X$ as it's the case here but we also have $\x_i \sim P_\xs$. Ultimately more work is required to fully understand this last difference. Nonetheless "Extragradient from the past" can be applied in the batch case. new update rule is now: $\bar{x}^{k+1}=\Pi_X(x^k - \gamma \Phi(\bar {x}^k))$ $x^{k+1}=\Pi_X(x^k - \gamma \Phi(\bar x^{k+1}))$ At each iteration $\bar x^{k+1}$ will become the next $\bar x^{k}$ <!-- >One issue with extrapolation is that the algorithm "wastes" a gradient. Indeed we need to compute the gradient at two different positions for every single update of the parameters. \citep{popov1980modification} proposed a similar technique that only requires a single gradient computation per update. The idea is to store and re-use the extrapolated gradient for the extrapolation: \begin{align} \text{Extrapolation from the past:} \; \;&\omega_{t+1/2} = P_\Omega[\omega_t - \eta F(\omega_{t-1/2})] \quad \\ \text{Perform update step:} \quad &\omega_{t+1} = P_\Omega[\omega_t - \eta F(\omega_{t+1/2})]\;\; \text{and store:} \; \;F(\omega_{t+1/2}) \end{align} >A similar update scheme was proposed by \citet[Alg. 1]{chiang2012online} in the context of online convex optimization and generalized by~\citet{rakhlin2013online} for general online learning. Without projection, \eqref{eq:extrapolation_past} and~\eqref{eq:update_past} reduce to the optimistic mirror descent described by~\citet{daskalakis2017training}: \begin{equation} \text{Optimistic mirror descent (OMD):} \quad \omega_{t+1/2} = \omega_{t-1/2} - 2\eta F(\omega_{t-1/2}) + \eta F(\omega_{t-3/2}) \end{equation} OMD was proposed with similar motivation as ours, namely tackling oscillations due to the game formulation in GAN training, but with an online learning perspective. Using the VIP point of view, we are able to prove a linear convergence rate for extrapolation from the past. We also provide results on the averaged iterate for a stochastic version in \S\. In comparison to the convergence results from~\citet{daskalakis2017training} that hold for a bilinear objective, we provide a faster convergence rate (linear vs sublinear) on the last iterate for a general (strongly monotone) operator $F$ and any projection on a convex $\Omega$. One thing to notice is that the operator of a bilinear objective is \emph{not} strongly monotone, but in that case one can use the standard extrapolation method~\eqref{eq:extrapolation} which converges linearly for an unconstrained bilinear game~\citep[Cor.~3.3]{tseng1995linear} --> </details> <details open> <summary>Brain (3/5)</summary> # Credit assignment in the brain These targets depart from the optimization view and often depart from accurate optimization objectives (which is a proxy for the real generalization objective anyway) in favor of herurisitcally choosen auxiliary objectives. ==TODO: something from blake lessons== > We are not sure if the brain has such system but there are ideas about something losely similar to backpropagation could be implemented in the brain, the brain can not have a global gradient as the neurons should have to store both forward (prima) and backward (tangent) values ## Target representations review target prop https://arxiv.org/pdf/2006.14331.pdf ### :heavy_check_mark:[Krogh, 1989] <!-- >Target propagation treats hidden states as explicit variables to be optimized over. --> It solves: $$\liftedmin \lagrangian$$ with $L=1$, where $\lambda$ is a fixed constant. Krog et al. also shows how to impose per hidden constraints to control the learned hidden representations by adding additional terms dependent on the activation values $\x$. ### :heavy_check_mark:[Rohwer, 1990] The moving targets algorithm aims to increase learning speed by a per layer cost function to allow the hidden variables to solve non stationary local problems. It differs with [Krogh 1989] by parametrization of the $f_i(\x, \w)$ trasnfer functions. <!-- >The moving target algorithm has two phases: >1. the hidden units' targets $a_i$ are improved to minimize the loss >2. the weights $w_i$ are improved to output $a_i$ >The error has two terms: $D(x_{t+1}, f(x_t))$ and $D(y, X_T)$ (not sure) This method decouples temporally distant weights but requires the targets $x_t$ to be stored. The primary disadvantage of this technique is that the $T\times \dim(x_0)$ intermediate states for each input $x_0$ have to be learned. --> ### :white_check_mark:[Lee et al., 2015a] Deeply Supervised Network (DSN) [\cite] considers multiple joint losses and calculates gradients of the joint loss. $$\classicalmin \sum_{i=1}^L \lambda_i \ell(f_i(..f_0(x_0; \w_0)..,\w_i), y)$$ Where lambda is now an annealed constant. >Could show the gradients being lower triangular like the backprop ones, this means that the lower parts of the traingle are helping shape the gradients) Similartly to [Rower and Krogh] this approach aims to provide proxy targets for the hidden layers. <!-- >Deeply Supervised Network (DSN) (Lee et al., 2015) consider multiple losses as a joint loss and sum up the gradients from relevant losses into a joint one in backpropagation (BP). In DSN(Leeetal.,2015),each convolutionlayer is associated with a classifier. To avoid the training difficulty, DSN keeps the losses for a number of epochs and discard all but the final loss to finish the remaininge pochs. ![](https://i.imgur.com/uasnMvp.png) > DSN (Lee et al., 2015),considers a weighted sum of multiple losses as a joint one andupdates the model parameters with the joint gradients. >These terms are combined using weighting hyper parameter α, similar to the method discussed in Lee et al. (2014), Szegedy et al. The DSN had linear auxiliary softmax classifiers after the first and second pooling layers and α was decayed as proposed in Lee et al. (2014). The finetuning network’s weights were initialized using those of the CIFAR-10 teacher network and a linear readout was added. These terms are combined using weighting hyper parameter α, similar to the method discussed in Lee et al. (2014), Szegedy et al. (2015), and Wang et al. (2015). Similar to the method used in Lee et al. (2014), the parameters for the layers in the main network directly connected to auxiliary networks were updated using a linear combination of the backpropagated gradients from later layers and the auxiliary network. These terms are combined using weighting hyper parameter α, similar to the method discussed in Lee et al. (2014), Szegedy et al. (2015), and Wang et al. (2015). In RDL, The DSN had linear auxiliary softmax classifiers after the first and second pooling layers and α was decayed as proposed in Lee et al. (2014). As in Lee et al. (2014), Szegedy et al. (2015), andWang et al. (2015), we utilize auxiliary error functions to train internal layers directly in conjunction with the error from the output layer found via backpropagation. These terms are combined using weighting hyper parameter α, similar to the method discussed in Lee et al. (2014), Szegedy et al. (2015), and Wang et al. The deeply supervised network had linear auxiliary softmax classifiers placed after the max pooling layers and α was decayed using αt+ 1 = α ∗ t 0.1 ∗(1 − t/tmax), as proposed in Lee et al. (2014). --> ### :white_check_mark:[Lee et al., 2015b] ==TODO: in the paper they use $x$ everywhere but i'm not sure if they actually mean $\x_{i-1}$== Global gradients require knowlege of the full network, this is not biologically plausible. [Lee 2015b] avoids relies on target values $a_i$ and approximate inverse dynamics $g_i$. This approach is similar to \cite{Beiger} Collocation methods where we have access to approximate dynamics. We now have local loss functions $||\hi||^2$ for the intermediate layers and a standard last layer function $\lastcost$ <!-- The last layer is trained by standard gradient descent with gradient: $\hat\x_M = \x_M-\eta \frac{\partial \lastcost}{\partial \x_M}$--> ### Optimization The last layer is trained by standard gradient descent. The interesting part of this approach is the optimization algorithm for intermediate layers. We use an approximate inverse $g_i(\cdot)$ where: $f_i(g_i(\x_{i+1}))\approx \x_i$ $g_i(f_i(\x_i))\approx \x_i$ To update the target values: $\hat{\x_i} = \x_i - g_i(\x_i) + g_i(\hat\x_i)$ We can now update the weights of $g_i$ by minimizing: $$\underset{W_g}{\operatorname{argmin}}||g_{i+1}(f_{i+1}(\x_i + \epsilon); W_g)- (\x_i + \epsilon)||^2$$ The $\epsilon$ term ensures the backward map is also estimated in a neighbourhood $a_i$. And $f_i$ by minimizing: $$\underset{W_f}{\operatorname{argmin}}||f_{i+1}(\x_i)- \hat\x_i||^2$$ #### models for forward and backward dynamics DTP uses autoencoders to model the $f$ and $g$ functions. ==More?== 2.4 shows how to get the updates, idk if add it. Update direction of target prop does not deviate by more than 90 degrees from gradient direction (the $cos(\alpha)$ proof) <!-- >compute targets rather than gradients using a local denoising auto-encoder at each layer and building both feedforward and feedback pathways. >More specifically, instead of gradients, feedback learning signals can be target values > Difference target-propagation [17] backpropagates targetvalues instead of error gradients. The target value is meant to be close to the activationvalue while being likely to have provided a smaller loss. With a reasonably good targetat hand, a layer-local training criterion can be applied to update each layer separately >[Source](https://qdata.github.io/deep2Read//MoreTalksTeam/Un17/Muthu-OptmTarget.pdf) Vanishing/exploding gradient problem in back prop Layer-local training criterion for updating each layer separately Instead of propagating error signals, assign each hidden layer a nearby value (h_i carrot) that leads to lower loss ![](https://i.imgur.com/gQrNM7Z.png) Define layer-local target loss and update using SGD○Avoids vanishing/exploding gradient by computing derivatives for only a single layer Use “approximate inverse” to define intermediate targets ![](https://i.imgur.com/7jqZYB2.png) now make this approx inverse an autoencoder: Train inverse mapping via additional autoencoder loss○Modify loss with noise injection so inverse corresponds to neighborhood --> <!-- ![](https://i.imgur.com/OAAbaBv.png) ![](https://i.imgur.com/qEw8T4o.png) ![](https://i.imgur.com/vfInCE7.png) If input of ith layer becomes target, output of ith layer also gets closer to target ![](https://i.imgur.com/XlGJJiu.png) ![](https://i.imgur.com/4mrPnyZ.png) ![](https://i.imgur.com/bu4g0lJ.png) --> ### [Lillicrap et al., 2020] [Lillicrap 2020] Investigates bioplausibility of target propagation approaches and tries to enstablish a framework called "Neural Gradient Representation by Activity Differences" (NGRAD), They also introduce a NGRAD Hypothesis which sugggests the human cortex uses an NGRAD based method to approximate gradient descent. the main contribution of this paper is to suggest that all these algorithms (which ones?) implicitly represents gradients in temrs of neural activity differences. In particular, the NGRAD hypothesis suggests that many of the bio-inspired algorithms approximate backprop by implicity representing gradients in terms of neural activity (spatial or temporal) differences. <!-- looks at how learning rules could be implemented in biological micro-circuits Lillicrap2020 [20] adds a feedback network as approximation for the learning signal. with several recent studies focusing on one issue in particular, the fact that error is fed back using an exact copy of the forward weights (the “weight transport” or “weight symmetry” problem, Lillicrap et al One of the most promising ideas of how backpropagation-like algorithms could be implemented in the brain is based on using temporal difference in neuronal activity to approximate top-down error signals Given the autoencoder view of the inverse network, this work starts by training the network by local updates (as before) and later we can see this process as given that $g(\cdot) \approx f^{-1}(\cdot)$ the inverse approximation error is: $a = g(f(x_{t-1}, \theta_f)) - x_{t-1}$ The propagated target for the layer is: $b = g(x_{t}, \theta_g)$ Considering the inversion error the corrected target is: $a + b = $ ![](https://i.imgur.com/f3k763k.png) --> </details>