Curriculum learning in reinforcement learning

# Curriculum learning in reinforcement learning #### Author: [Sharath Chandra](https://sharathraparthy.github.io/) ## Introduction 1. Curricula is seen everywhere in daily life and is key to how humans are trained to learn be it formal education, physical fitness etc. 2. If the quality of this curricula is not good then it hampers the educational/physical growth of person. And hence curricula is crucial in acheiving success. 3. Example of chess: The person is trained first with the simpler variant (subgames) of a chess and on gradually increasing variants. The final game is the complete game of chess. ![](https://i.imgur.com/PCpZUnT.png) 4. We can use this same idea of curricula for training artificial agents and the studies show that the order in which we show the training examples matters and the main result from this studies is that starting with an easy example and then increasing the difficulty can lead to faster convergence as well as increased performance on a task. 5. One of the natural extension to this type of training is to test the agents ability to perform well on an unseen environment once it is trained using curricula. Few studies have shown that this type of training helps reducing the convergence to optimal policy when training using curricula. 6. Given the advantages that curriculum learning provides, there has been a surge of interest in the community on how this sequence can be generated automatically without any human specification, otherwise called "Automatic curricula". 7. However what exactly constitutes a curriculum and what precisely qualifies an as being an example of curriculum learning is ill defined. 8. In this article, the authors provide a systematic overview of curriculum learning in RL setting. First they define "*what is curriculum?*" and then they talk about the method and evaluation of curriculum learning. Further they talk about the construction of tasks using curriculum and finally they talk about how an agent can transfer the knowledge to a new task as it learns a curriculum. ## Preliminaries ### Reinforcement Learning 1. RL considers a problem of how an agent should act in an environment so that the scalar return is maximized. And the interation with the environment is formalised using a mathematical framework called Markov Decision Processes (MDPs). 2. The agent acts according to a policy $\pi$ and we learn this $\pi$ to maximise the return. Three ways to learn this $\pi$ are: 1. Value based methods: We aim to learn a value function for each state which is used to derive policy $\pi$. Ex: Q-learning, SARSA 2. Policy based methods: Here we directly optimize for the policy by optimizing for the performace measure. Ex: Vanilla PGs, TRPO, PPO. 3. Actor-critic methods: These aim to learn both value(critic) and the policy (actor) where the actor updates based on the feedback from critic. Ex: Deterministic PGs  ## Curriculum learning **Definition of curriculum:** Let $\mathcal{T}$ be a set of tasks where each is a task $m_i = (S_i, A_i, p_i, r_i)$ in $\mathcal{T}$. Let $D^{\mathcal{T}}$ be set of all possible samples from tasks in $\mathcal{T}$: $D^{\mathcal{T}} = \{(s, a, r, s') \mid \exists m_i \in \mathcal{T} s.t. s \in S_i, a \in A_i, s' \sim p_i(\cdot | s, a), r \leftarrow r_i(s, a, s') \}$. A curriculum $C = (\mathcal{V}, \epsilon, g, \mathcal{T})$ is a DAG where $\mathcal{V}$ is the set of vertices, $\epsilon$ is a set of directed edgees, and $g: \mathcal{V} \rightarrow P(D^\mathcal{T})$ is a function that associates vertices to subsets of samples in $D^\mathcal{T}$, where $P(D^\mathcal{T})$ is the poewr set of $D^\mathcal{T}$. A directed edge $<v_i, v_k>$ in C indicates that samples associated with $v_j \in \mathcal{V}$ should be trained on before samples associated with $v_k \in \mathcal{V}$. All paths terminate on a single sink node $v_t$. 1. A curriculum can be created in an online fashion where the edges are dynamically created or it can be done completely offline. 2. But creating a graph at the sample level can be difficult for large tasks or large sets of tasks. So the authors try to simplify this. * **Single task curriculum:** All the samples used in curriculum comes from a single task. In orther words, the cadinality of the task set is 1 and hence there is only one MDP. Given single MDP, how to organise best and train on experience acquired from single task. * **Task-level curriculum** In this each vertex of a graph consists of samples from a single task. In RL, we (usually) acquire samples in an online manner and hence entire set of samples is usually not available ahead of time. And the task data collected from the previous task will influence the policy for the current task and hence the samples. Hence it is important to ask how to order these tasks such that the agent will collect some useful samples in the future tasks. Final simplification is the linear sequence where there is exactly one source and one sink node. #### Stopping criteria: When do we decide to stop the curricula? After convergence? Fixed number of epochs/episodes? ## Key elements of curriculum learning 1. **Task generation:** The quality of learning depends on the quality of tasks available at hand. Task generation is concerned with creating a good set of intermediate tasks. As discussed, these tasks may either be pre-specified or dynamically generated. 2. **Sequencing:** This is more concerend with learning the ordering than learning how to generate a new task. This sequence is sometimes manually specified or in somecases learned automatically. 3. **Transfer learning:** Usually in RL, the intermediate tasks are defined or constructed by altering either the reward function or the transition function or in some cases the both. Curriculum learning also in some sense comes under the umbrella where the agent must continuously learn how to transfer kowledge from one task to another when learning through curricula. Hence curricula is widely used in transfer learning scenarios to learn robust policies. ### How do we evaluate the curricula? 1. Comparing the agents performance on target task when trained with and without curricula in source domain. 2. In case of multiple target tasks, we can average the performance over all tasks (might not be always the best way to go with) or we can track the wall time to reach the assymptotic performance for all task and average them. 3. As we are dealing with curricula we can track the wall time of building this curricula as well. This can be a little bit difficult to track the time if the curricula is human designed. 4. Depends on the problem setting we are in. Are we doing zero-shot transfer to the target domain? Is it more critical to improve the initial performance, final performance or reaching a desired performance threshold? 5. Most of the times it depends on our baselines too. Are we aiming to propose a better curriculum learning algorithm? In that case we should probably compare with the baselines which also have some curriculum embedded in the algorithm. In the other case, where we just aim to show the benifits of curriculum for a new problem, we can just compare our approach with training w/o curriculum. ## Dimensions of categorization 1. **Intermediate task generation:** Finding a good curriculum depends on having a set of meaningful tasks. These tasks can either be human specified by domain experts or by can be learnt automatically. 2. **Curriculum representation:** Most general representation is by DAG. In single task scenario this can be ordering over samples and in multiple task scenario, this can be ordering over tasks (task-curricula). In some cases this can just be represented in terms of sequence. 3. **Transfer method:** The type of knowledge can either be policies, value function or the model itself or high level knowledge such as partial policies or shaping rewards. 4. **Curriculum sequencer:** Most of the literature on sequencing is on automatic generation, in many other works the sequencing is either done by oracle or human expert. This is similar to task generation dimension in terms of design. 5. **Curriculum adaptivity:** This is another design choice which is concerned with online or offline curriculum generation. 6. **Evaluation metric:** Similar to the discussion above in evaluation metric section 7. **Application area:** Toy problems? Sim2real in robotics? video games? ## Curriculum learning in RL Now we use the concepts discussed above and talk more concretely about their use in RL. ### How task generation is done in case of RL? 1. We should avoid "negative" transfer where using a task for transfer hurts a performance. 2. Many works assume that the domain can be parameterized using some kind of representation where the different instantiations result in different tasks. 3. **[Narvekar et al. (2016)](https://):** 4. **Object-Oriented MDPs:** Each task is assumed to have a class environment parameterized by a number of attributes. A human specified function creates simpler versions of the final task by just using the values of attributes that make simpler version of the final task so that it is easy to solve them. ([Da Silva and Reali Costa (2018)](https://)) 5. **Generating auxilliary intermediate tasks:** Some of the works consider how to create subproblems to aid in learning to solve dificult planning problems. ([Stone and veloso (1994)](https://)) 6. **Powerplay:** This work focuses how, in an unsupervised manner, to creat new problems to train more and more general problem. The system searches for both a new task and a modification of the current problem solver, such that the modifier solver can solve all previous tasks and the new one. [Schimdhuber (2013) ](https://) ### How to generate or learn a sequence: The goal is to order them (tasks or samples from tasks) in a way that facilitates learning.  #### Types of sequencing: 1. **Sample sequencing:** The ordering is done at the sample level without generating any intermediate tasks. In case of DQN example, where we store the samples in a replay buffer, we learn the Q-function by drawing samples from this buffer uniformly and then optimise for TD error. Now instead of uniformly drawing the samples, we can prioritize the samples based on the difficulty or usefullness. These are discussed in Prioritized Experience Replay paper and its variants. Another example is HER which is a widely popular technique used to learn efficiently in sparse reward settings. `HER forms a curriculum by taking advantage of the implicit curriculum present in exploration, where early episodes are likely to terminate on easy to reach states, and more difficult to reach states are found later in the training process.` An extension to HER, called curriculum-guided HER was prposed as an extension where the goals are given relative importance instead of equal importance. 2. **Co-learning:** This is a multi-agent approach to curriculum learning where the agents interact either co-operatively or adversarially with each other to generate the curriculum. Asymmetric self-play work proposed a teacher student framework to naturally build an automatic curriculum where the two agents optimize for their own reward. On the other hand, Robust Adversarial RL (RARL) work proposed an adversarial approach to build the curriculum where the RL agent leans to complete the original task while been robust to pertubations applied by the adversarial agent. Here the adversarial agent goal is to learn optimal sequence of destabilizing actions via a zero-sum game training procedure. The adversarial agent tries to identify the hardest conditions under which the RL agent may be required to act, increasing the agent robustness. This idea of colearning has also been explored in multi-agent setting where multiple agents cooperate to buld the curricula. (For example recent paper by openai about the emergent curricula) 3. **Task curricula by creating new MDPs for intermediate tasks:** One can create a new MDP by altering either the transition probabilities/reward function/state and action spaces. We can also play with the agent initial state distribution. One work explored this idea by creating a series of tasks by altering the agents initial state distribution in a way that is close to the goal state initially and then progressively shifting the distribution farther away from the goal state. Racaniere also cons approach to automtically generate a curriculum of goals but for complex goal-conditioned tasks. 4. **MDP sequencing:** The methods under this umbrella formulate curriculum generation as an interaction between two MDPs. The first is the standard MDP, which models learning-agent aiming to solve the task and the secon is a higher level meta-MDP for curriculum agent whose goal is to select tasks for the learning agent. Some works which have this formalism are "*Autonomous task sequencing for customized curriculum design in reinforcement learning*" and "*Teacher-student curriculum learning*". 5. **Combinatorial Optimization and Search:** This approch formalises the sequencing problem as a combinatorial optimization where we are interested in finding the permutation of tasks which lead to best curriculum from the given set of tasks. The notion of best curriculum is determined by some metrics. Few works on this include, "*An optimization framework for task sequencing in curriculum learning*", "*Curriculum learning for cumulative return maximization*", "*Faster reinforcement learning using active simulators*", etc. ## Research areas ### In RL: 1. RL in wild and real world suffers from sample complexity issues. While many papers have proposed some solutions to tackle this issues, each has it's own pros and cons. 2. Curriculum learning takes a different approach on this and aims at creating an ordered intermediate subtasks in a way that facilitates the learning. ### In supervised learning 1. Seminal paper by Bengio et al. who formalised the idea of curriculum in the context of ML. 2. Although setting maybe different the theory discussed earlier partially applies to this. ## Open questions: 1. How to create a fully automated tasks? Also the quality and quantity is also a important factor as for the large spaces we have to come up with intelligent ways to search efficiently for optimal sequence from that space. Few work has been done in generating the tasks on-fly and this is something important for the future research. 2. In what way the knowledge is being transferd from one task to another? Is is only through value function, models? Can't it be a combination of all depending on the requirement? 3. As we discussed there are three crucial parts in curriculum generation; a) task generation b) sequencing and 3) transfer learning. Most of the literature tackled the problem in pieces. So, the question becomes, can do task generation and sequencing simultaneously and construct an end-to-end solution? Also we need to keep in mind that while task generation, the tasks should be solvable, challenging but should not be beyond the agent capabilities.