When do curricula work?

# When do curricula work? #### Author: [Sharath](https://sharathraparthy.github.io/) ## [Paper Link](https://openreview.net/pdf?id=tW4QEInpni) ## Overview Curriculum learning is inspired by the way human learns, where the examples are shown in the increasing order of the difficulty. More sepcifically te network is exposed to the easier examples in the early stages of training and then gradually to the tougher ones. This paper studies the benifits of showing this sequential ordered eamples to the network and comments about when it works and when it doesn't. ### Contributions: 1. This paper introduces a phenomenon called implicit curricula. 2. One of the claims they make is that the the ordered learning (curriculum, anti-curriculum and random) almost performs same in the standard settings. 3. Curricula is benificial when there is a limited time budget and in noisy regime ## Types of curricula 1. **Implicit Curicula:** There is some connection between the order in which the examples are shown and the order in which the network learns to classify these examples. To understand this connection, the authors study the order in which a network learns examples under SGD settings. This order is reffered as an *implicit curriculum*. This aims to show if the examples are learned in a consistent order across different architectures (like VGG, ResNets, Wide-ResNet, DenseNet, EfficientNet and SGD with momentum). A metric called "learned iteration" is used to quantify this implicit curriculum which is defined as the epoch for which the model correctly predicts the sample for that and all the subsequent epochs. Mathematically this is defined as $min_t \{t^\star \mid \hat{y}_w(t)_i = y_i, \forall t^\star \leq t \leq T\}$ 2. **Explicit Curicula:** In this type of curriculum, the ordering is forced/learned through the an external agent via a scoring function. ## Explicit curricula through scoring and pacing functions The curriculum is defined by three different ingredients: 1. **The scoring function:** The scoring function $s(x)$ corresponds to difficulty of the training example and is a mapping between the input $x$ to a numerical real number in $\mathbb{R}$. 2. **The pacing function:** The pacing function $g(t)$ determines the size of dataset to be used at epoch $t$. THe training step at $t$ consists of the $g(t)$ lowest scored examples. From this the training batches are sampled uniformly. 3. **The order:** This corresponds to ascending-descending order (curriculum), descending-ascending order (anti-curriculum) or random order. The training procedure is summarized in the algorithm below. ![](https://i.imgur.com/l1AyKOo.png) ### Different types of score functions The authors consider three scoring functions 1. **Loss function**: The examples are scored based on the real-values loss of a reference network that is trained on the same data. 2. **Learned epoch/iteration:** This metric is similar to what was introduced in implicit curricula section, which is $min_t \{t^\star \mid \hat{y}_w(t)_i = y_i, \forall t^\star \leq t \leq T\}$. 3. **$c$-score**: This metric captures the consistency of a reference model correctly predicting a particular example when trained on IID draws of a fixed dataset which doesn't contain that example. Formally $s(x_i, y_i) = \mathbb{E}_{D \sim D\lnot {(x_i, y_i)}} \left[P(\hat{y}_(w, i) = y_i \mid D)\right]$. The loss based $c$-score looks simlar to this except the that the probability of predicting the correct examples is now replaced by a loss function, $s(x_i, y_i) = \mathbb{E}_{D \sim D \lnot {(x_i, y_i)}} \left[l(s_i, y_i)\mid D\right]$ ### Different types of pacing functions As mentioned earlier, the pacing function determines the size of training data to be used at epoch $t$. In this paper the authors consider a family of 6 pacing functions which are summarised below. ![](https://i.imgur.com/q3hol24.png) These pacing functions are parameterized by two parameters $(a, b)$ where $a$ denotes the fraction of training needed for the pacing function to reach the size of the full training set and $b$ denotes the fraction of training set used at the starting of the training. ## Findings ### Marginal benifits in standard settings #### Baselines The authors conducted a large scale study to comment about the benifits of curricula in standard settings. Authors ran 540 standard training runs and constructed the following baselines; 1. Standard1 baseline: This is the mean of 540 standard runs 2. Standard2 baseline: All the 540 runs are split in 3 groups and the means are calculated for each group. Then the maximum of these three means is reported. 3. Standard3 baseline: This is the mean of top-3 values of all the 540 runs. The results convey that: 1. Curricula provide little benifit for standard learning 2. When comparing the performance of all three methods to a less crippled baseline, which considers massive hyper-parameter sweep., it is observed that none of the pacing functions or orderings have statistically significant performance over the standard learning. 3. There is no dependence on three different orderings (curriculum/anti-curriculum and random) in terms of the performance. ### Substantial benifits in case of limited time budget and noisy data The authors further conducted experiments which involves the training time as a parameter. It is observed that under this limited budget, the curriculum ordering helps attaining drastical performance improvement over the baselines. Also under the noisy labels regime, where the artificial label noise is generated ny randomly permuting the labels, it is observed that the curriculum learning outperforms the baselines.