[MULTI TASK LEARNING AS MULTI-OBJECTIVE OPTIMIZATION](https://arxiv.org/pdf/1810.04650.pdf)

# [MULTI TASK LEARNING AS MULTI-OBJECTIVE OPTIMIZATION](https://arxiv.org/pdf/1810.04650.pdf) ###### `Tanmay Jain , Dr. Chandrajit Bajaj` ## :memo: SUMMARY ### :pushpin: **Abstract** - Multi Task Learning : Multi task learning involves solving multiple tasks jointly at the same time. **eg.** Training a network at same time for both detecting objects in a image and giving text descriptions of the detected object rather than training each network seperatey for different tasks - Mutli Task Learning is a multiobjective problem because improving on one task might result in poor performance for other task with competing objectives present between the tasks. - Earlier works used a proxy objective which was linearization **(weighted sum)** of different objectives and tried to minimise this combined proxy - This paper present multi task learning as a multi objective optimization problem with the aim to find the ***pareto optimal solution*** <img src="https://i.imgur.com/2KtD9AM.jpg" alt="notes" width="300" height="300"/> :::info :pushpin: **Pareto Optimal**: Design vector *Y* is said to be pareto optimal if it is **not dominated** by any other point. :pushpin: **Dominance**: Design vector *Y^1^* dominates *Y^2^* if L~i~(Y^1^) <= L~i~(Y^2^) for all objectives and inequality must be strict for atleast one objective. ::: ### :pushpin: **Introduction** - **Motivation** : *Stein's Paradox* :Better to estimate means of three or more Gaussian Random Variables from samples from all of them rather than estimeting them indivisually even when the gaussian random variables are independent also. *Mutli Task learning* : Learn multiple tasks using data from all the tasks in hope to obtain superior performance than learning indivisual tasks - To acheive multi task learning a parameterized hypothesis class is set up with some shared parameters across multiple tasks - Earlier works involved minimising the weighted sum of emperical risk for each task , which is only valid when objectives don't compete. TO acheive MTL find solutions not dominated by other's - One such technique to sole multi objective optimization is **MGDA**(Mutliple Gradient Descent Algorithm) which converges to point on **pareto front** :::info :pushpin: **Pareto Front**: Image of pareto optimal points in the objective space is called the **pareto front**. ::: - **MGDA** has two limitations :-1: 1) Doesn't scale well to large dimensions of gradients for updating the shared parameters 2) Compute the gradients with respect to shared parameters for multiple tasks with backward pass for each task scaling up the training time with number of tasks - This paper proposes frank wolfe based solver for high dimensional gradient computations and introduces a upper bound which helps backpropogating only once the gradients with negligible overhead ### :pushpin: **MTL as Multi Objective Optimization problem** | In Words | Notation | | ----------------- |:----------------------- | | Input Space | X | | Collection of task spaces | {Y^t^}~t∈[T]~ | | Total iid data points | N | | Each data point i | {x~i~ , y~i~^1^ , y~i~^2^ ,.... , y~i~^T^} | | Label for i^th^ data point on t^th^ task | y~i~^t^| - **Parametric hypothesis class per task** - f^t^(x; θ^sh^,θ^t^) : X → Y^t^, such that some parameters (θ^sh^) are shared between tasks and some (θ^t^) are task-specific. - **Task-specific loss functions** - L^t^(·, ·) : Y^t^ × Y^t^ → R^+^ - **Earlier Works** - ![](https://i.imgur.com/MXH5yB3.png) where c^t^ are dynamically or statically computed weights per task where L^t^ is defned as ![](https://i.imgur.com/JW3G7af.png) - **This work** - MTL as mutli objective problem with conflicting objectives. Vector L is defined for loss as shown ![](https://i.imgur.com/Vg4mrKi.png) - We want pareto optimality for these multiple objectives defined below- ![](https://i.imgur.com/CazU0Zt.png) ### :pushpin: **MGDA** **QUESTION?** - Given the domain of design point say D, say a design point x~o~ and the gradient ∇f~j~(x~o~),can we find a non zero vector d such that: ![](https://i.imgur.com/2lAg1Jx.png) - If found we can travel in direction of **-d** as a local descent direction common to all the objectives - If x~o~ is pareto optimal then no common local descent direction is possible becuase x~o~ is non-dominated and hence we cannot find d. - **Two questions arise** - Can we find whether x~0~ is pareto optimal or not and how to find this vector d. **SOME CONCEPTS** - **Convex Hull** - Given a set of m vectors {u~j~} (u~j~ ∈ R^N^, j = 1, 2,...m) the convex hull of the family is the set of convex combination of these vectors ![](https://i.imgur.com/7z8AJIC.png) **Lemma1 and Lemma2**- ![](https://i.imgur.com/3RiIwWj.jpg) ![](https://i.imgur.com/nIj21vW.jpg) **APPLICATION FOR MULTIOBJECTIVE OPTIMIZATION** - Let the family {u~j~} be the gradients of the objective-functions at some given starting point x~0~: **u~j~ = ∇f~j~(x~0~) (j = 1,...,m)** - Identify the vector ω of Lemmas 1 and 2- Now, (u~j~, w) >= ||w||^2^ >= 0 - Since for every objective function direction of w is such that it is in direction of gradients for every objective function => **-w is the local desecent direction for all the objectives** - Problem reduces to finding the **minimum norm element in the convex hull of the gradients of all objectives** **MGDA**- ![](https://i.imgur.com/1Hv6mLO.png) where d is in direction of w and ρ > 0 is the step-size **Convergence** - This algorithm convergences to a point where ω = d = 0, that is, a **"Pareto-stationary point"**. At the stationary points KKT conditions of optimality are satisfied and hence we get optimal solution **(Proof I wasnt able to get in much detail)** - At the following conditions parameters obtained called Pareto Stationarity points and satisfy KKT conditions and our goal is to reach these parameters - ![](https://i.imgur.com/nO1d5kq.png) ### :pushpin: **Objective Function needed to be solved to get the gradient direction** ![](https://i.imgur.com/WGBq0b3.png) **Solution**- 1) Either the solution is 0 and the correponding design points are pareto stationary and satisfy KKT optimality condition 2) Solution gives a local descent direction to update the shared parameter toward pareto stationarity ### :pushpin: **Solving the Equation** - Solving the optimization problem is same as finding the minimum norm point in the convex hull of gradient of task specific losses with respect to the shared parameters - Consider solving it for 2 task problem ![](https://i.imgur.com/bfX0pYo.png) ![](https://i.imgur.com/O7e19W3.jpg) ![](https://i.imgur.com/6VfNqIY.jpg) - Using the analytical solution above we could solve the **line-search subroutine** of **Franke Wolfe** easily ### :pushpin: **MTL Update Algorithm** ![](https://i.imgur.com/5daIVGi.png) Above update equation can be easily understood. We are updating the task speific parameters first for each task Then use the Franke Wolfe solver to get the alphas that give the minimum norm vector in the convex hull of the gradient and updates the shared parameter in direction opposite to this vector becuase this vector is in direction to gradients of different loss functions ![](https://i.imgur.com/W2jYYLJ.png) Above equation gives the value of alpha minimising the objective defined earlier to minimum norm point with line search derived analytically as proved earlier **Although this algorithm is not that much clear to me I know the basic franke wolfe alogirthm I am not getting how the frank wolfe proposed here relates to the studied Frank wolfe (The authors have not provided any derivations here)** **Frank Wolfe** - Works by linearising the objective by Taylor series expansion and then choosing a point x belonging to convex set D such that - **x^*^ = argmin~x~(∇f(x)^T^x)** After getting x^*^ it updates the initial point x by a line search on th line joining **x and x^*^** and then updates the initial point x ### :pushpin: **Efficient Optimization for Encoder Decoder architecture** - For the algorithm described above we need to compute the gradients with respect to shared parameters for each of the tasks t , so we would be having one forward pass and T backward passes to compute gradient and then update the shared parameters - This results in scaling of training time with number of tasks as backward pass is more compute expensive and limit the number of tasks we may train on - So ,authors propose a solution to optimise only the upper bound of the objective that requires only a singl backward pass - Theoritically proof optimizing this upper bound gives a pareto optimal solution - Architectures they consider for efficiency are encoder decoder functions with the following hypothesis - ![](https://i.imgur.com/soZkqXn.png) where g is the common representation shared by all of the tasks and f^t^ are task specific functions which take this representation as input - **Upper Bound**- ![](https://i.imgur.com/hrEGsRA.jpg) ![](https://i.imgur.com/znCXVfg.jpg) - Using simple chain rule and properties of matrix norm we get the upper bound of the objective :::info :pushpin: **Properties of the upper bound**: 1)∇~z~L^t^(Θ^sh^ , Θ^t^) can be computed in single backward pass and dont require multiple backward passes over shared parameters 2)![](https://i.imgur.com/8EaQeND.png) is not a function of α^1^, α^2^...α^T^ and can be removed from the objective ::: **Resulting Optimization** - ![](https://i.imgur.com/idxbAT6.png) - MGDA-UB corresponds to using the gradients of the task losses with respect to the representations instead of the shared parameters and calculating the alpha's and updating the shared parameters in similar way as proposed in above algorithm (Calculate gradients wrt representations and update the shared parameters) - The authors state the theorem - ![](https://i.imgur.com/nxprHHb.png) ### :pushpin: **Experiments** - They perform MTL for a number of experiments - MultiMNIST , MUltiLabelClassification in CelebA, Scene understanding with instance segmentation,Depth Estimation and semantic segmentation | Dataset | Number of Tasks | | ----------------- |:----------------------- | | Multi MNIST | 2 | | Multi Label Classification | 40 | | Scene Understanding |3 | **Baselines** - | Baseline | Formula | | ----------------- |:----------------------- | | Uniform Scaling | (1/T) * (Σ~t~L^t^) | | Single Task | Training on tasks independently | | Grid Search |Various values of c^t^ , {c^t^ ∈ [0, 1]/ Σ~t~c^t^ = 1} and optimizing for (1/T)Σ~t~c^t^L^t^| | Kendall et al. (2018) | Uncertainity guided weighing of C^t^ | | Grad Norm | Using normalization | ### :pushpin: **Multi MNIST** - **Construction of dataset**- For each image in MNIST choose a different image uniformly at random , place one image at top left and other at bottom right and now having two tasks classift top left and bottom right image ![](https://i.imgur.com/2OzOziP.png) - **Architecture** - LeNet architecture was used with last layer as representation layer and two fully connected layer as task specific layer . *Loss used was cross entropy with softmax* for each task ![](https://i.imgur.com/bTQ0p6V.png) - **Results** - ![](https://i.imgur.com/KkUKayf.png) ![](https://i.imgur.com/OPK7GJf.png) Tasks compete with each other for model capacity , as we can see in GridSearch improving the accuracy of one task decreases accuracy of the other task. So training for each tasks indivisually gives best results as their is no competition and authors claim their solution gives results similar to learning each task indivisually. Even after large hyperparameter search for scaling factor no solution gives accuracy compared to theirs. ### :pushpin: **Multi Label Classification on CelebA** - **Dataset** - CelebA dataset has 200K images and 40 labels. Each label can give rise to a binary classification tasks , hence there are 40 tasks - **Architecture** - Resnet18 without final layer was used as a shared representation function and fully connected layers were added for each task .The final two-dimensional output is passed through a 2-class softmax to get binary attribute classification probabilitie and cross entropy is applied as a loss ![](https://i.imgur.com/ttkwAvw.png) - **Results** - ![](https://i.imgur.com/CJzgJLt.png) ![](https://i.imgur.com/OP40GFQ.png) this shows the error on different attributes in multi label classification ![](https://i.imgur.com/JFKYFkr.png) Above table shows mean of errors on different MTL tasks on CelebA Grid Search is excluded because search over 40 tasks is not computationally possible This paper achieves the best results and shows can acheive good results on a large number of tasks also. We can also see uniform scaling which is used in multitusk learning performs worse than the single task ### :pushpin: **Conclusion** Paper described an approach to multi-task learning. Approach is based on multi-objective optimization. An efficient algorithm is described with specific approximations that yielded a deep MTL algorithm with almost no computational overhead