# When do curricula work?
#### Author: [Sharath](https://sharathraparthy.github.io/)
## [Paper Link](https://openreview.net/pdf?id=tW4QEInpni)
## Overview
Curriculum learning is inspired by the way human learns, where the examples are shown in the increasing order of the difficulty. More sepcifically te network is exposed to the easier examples in the early stages of training and then gradually to the tougher ones. This paper studies the benifits of showing this sequential ordered eamples to the network and comments about when it works and when it doesn't.
### Contributions:
1. This paper introduces a phenomenon called implicit curricula.
2. One of the claims they make is that the the ordered learning (curriculum, anti-curriculum and random) almost performs same in the standard settings.
3. Curricula is benificial when there is a limited time budget and in noisy regime
## Types of curricula
1. **Implicit Curicula:** There is some connection between the order in which the examples are shown and the order in which the network learns to classify these examples. To understand this connection, the authors study the order in which a network learns examples under SGD settings. This order is reffered as an *implicit curriculum*. This aims to show if the examples are learned in a consistent order across different architectures (like VGG, ResNets, Wide-ResNet, DenseNet, EfficientNet and SGD with momentum). A metric called "learned iteration" is used to quantify this implicit curriculum which is defined as the epoch for which the model correctly predicts the sample for that and all the subsequent epochs. Mathematically this is defined as $min_t \{t^\star \mid \hat{y}_w(t)_i = y_i, \forall t^\star \leq t \leq T\}$
2. **Explicit Curicula:** In this type of curriculum, the ordering is forced/learned through the an external agent via a scoring function.
## Explicit curricula through scoring and pacing functions
The curriculum is defined by three different ingredients:
1. **The scoring function:** The scoring function $s(x)$ corresponds to difficulty of the training example and is a mapping between the input $x$ to a numerical real number in $\mathbb{R}$.
2. **The pacing function:** The pacing function $g(t)$ determines the size of dataset to be used at epoch $t$. THe training step at $t$ consists of the $g(t)$ lowest scored examples. From this the training batches are sampled uniformly.
3. **The order:** This corresponds to ascending-descending order (curriculum), descending-ascending order (anti-curriculum) or random order.
The training procedure is summarized in the algorithm below.
![](https://i.imgur.com/l1AyKOo.png)
### Different types of score functions
The authors consider three scoring functions
1. **Loss function**: The examples are scored based on the real-values loss of a reference network that is trained on the same data.
2. **Learned epoch/iteration:** This metric is similar to what was introduced in implicit curricula section, which is $min_t \{t^\star \mid \hat{y}_w(t)_i = y_i, \forall t^\star \leq t \leq T\}$.
3. **$c$-score**: This metric captures the consistency of a reference model correctly predicting a particular example when trained on IID draws of a fixed dataset which doesn't contain that example. Formally $s(x_i, y_i) = \mathbb{E}_{D \sim D\lnot {(x_i, y_i)}} \left[P(\hat{y}_(w, i) = y_i \mid D)\right]$. The loss based $c$-score looks simlar to this except the that the probability of predicting the correct examples is now replaced by a loss function, $s(x_i, y_i) = \mathbb{E}_{D \sim D \lnot {(x_i, y_i)}} \left[l(s_i, y_i)\mid D\right]$
### Different types of pacing functions
As mentioned earlier, the pacing function determines the size of training data to be used at epoch $t$. In this paper the authors consider a family of 6 pacing functions which are summarised below.
![](https://i.imgur.com/q3hol24.png)
These pacing functions are parameterized by two parameters $(a, b)$ where $a$ denotes the fraction of training needed for the pacing function to reach the size of the full training set and $b$ denotes the fraction of training set used at the starting of the training.
## Findings
### Marginal benifits in standard settings
#### Baselines
The authors conducted a large scale study to comment about the benifits of curricula in standard settings. Authors ran 540 standard training runs and constructed the following baselines;
1. Standard1 baseline: This is the mean of 540 standard runs
2. Standard2 baseline: All the 540 runs are split in 3 groups and the means are calculated for each group. Then the maximum of these three means is reported.
3. Standard3 baseline: This is the mean of top-3 values of all the 540 runs.
The results convey that:
1. Curricula provide little benifit for standard learning
2. When comparing the performance of all three methods to a less crippled baseline, which considers massive hyper-parameter sweep., it is observed that none of the pacing functions or orderings have statistically significant performance over the standard learning.
3. There is no dependence on three different orderings (curriculum/anti-curriculum and random) in terms of the performance.
### Substantial benifits in case of limited time budget and noisy data
The authors further conducted experiments which involves the training time as a parameter. It is observed that under this limited budget, the curriculum ordering helps attaining drastical performance improvement over the baselines.
Also under the noisy labels regime, where the artificial label noise is generated ny randomly permuting the labels, it is observed that the curriculum learning outperforms the baselines.

SharathRaparthy
I am a Masters student at Mila working on continual reinforcement learning.

Author: Sharath Overview This paper discusses how the BNNs behave as we scale to the large models. This paper discusses the "soap-bubble" issue in case of high dimensional probability spaces and how MFVI suffers from this. As a way to tackle this issue, the authors propose a new variational posterior approximation in hyperspherical coordinate system and show that this overcomes the soap-bubble issue when we sample from this posterior. The geometry of high dimensional spaces One of the properties of high dimensional spaces is that there is much more volume outside any given neighbourhood than inside of it. Betacount et al explained this behaviour visually with two intuitive examples. For first example let us consider partitioning our parameter space in equal rectangular intervals as shown below. We can see that as we increase the dimensions the distribution of volume around the center decreases. This becomes almost negligible as compared to the its neighbourhood in high dimensional cases where $D$ is very large. We can observe a similar behaviour if we consider spherical view of parameter space, where the exterior volume grows even larger than the interior in high dimensional spaces as shown in the figure. . How this intuition of volumes in high dimensions explain soap bubble phenomenon?

7/4/2021Author: Sharath What is a convex function? Let's try to define a convex function formally and geometrically. Formally, a function $f$ is said to be a convex function if the domain of $f$ is a convex set and if it satisfies the following $\forall x \ \text{and} \ y \in \text{dom} f$; \begin{equation} f(\theta x + (1 - \theta)y) \leq \theta f(x) + (1 - \theta)f(y) \end{equation} Geometrically it means that the value of a function at the convex combination of two points of the function always lies below the convex combination of the values at the corresponding points. It means that if we draw a line at any two points $(x, y) \in \text{dom} f$, then this line/chord always lies above the function $f$.

7/4/2021Author: Sharath What is a dynamical system? It is any system that evolves and changes through time governed by a set of rules. Using dynamical systems we can study the long term behavior of an evolving system. Formally, it is a triplet $(X, T, \phi)$ where $X$ denotes the state space, $T$ denotes the time space and $\phi: X \times T \rightarrow X$ is the flow (this is the rule that governs the evolution). There are few properties of flow: $\phi(X, 0) = X$ Principle of compositionality: $\phi(\phi(x, t), s) = \phi(x, t+s)$

7/4/2021Author: Sharath Chandra Paper Link tags: simulation, interaction-networks, robotics In this paper the authors proposed a hybrid dynamics model, Simulation-Augemented Interaction Networks, where they incorporated Interaction Networks into a physics engine for solving real world complex robotics control tasks. Brief Outline: Most of the physics based simulators serves as a good platform for carrying out robot planning and control tasks. But no simulator is a perfect because it has it's own modelling errors. So, most of the physics engines (mujoco, bullet, gazebo etc.,) demonstrate some descrepencies between their predictions and actual real world predictions. To decrease these errors, many methods have been propsed in the literature. Some of the methods include randomizing the simulation environments, famously known as Domain Randomization. In this paper, model errors are tackled by learning a residual model between the real world and simulator. In other words, instead of adding pertubations to the environment parameters, here we utilize some real world data to correct the simulator. Even though this method uses some real world data, this method is shown to be sample efficient and have better generalization capabilities. Interaction Networks

7/4/2021
Published on ** HackMD**