Automatic curriculums - Proposed outline
===
###### tags: `projects`
# Feedback on current submission of "Learning of Sophisticated Curriculums by viewing them as Graphs over Tasks"
### 1- Introduction
The current introduction explains RL and why RL fails (prior knowledge) in layman's terms. It's nice, but the public this paper is destined to expects something else. I propose to go with more generic introductions where we talk about the current breakthroughs in RL and its current limitations in settings (e.g. sparse reward), without any reference to Minigrid.
The part about curriculum learning contains enough information and can be kept almost as is.
However, we should highlight the contributions of the paper even more. I suggest one of the two things:
- **CASE 1**: If we make enough experiments to show that the proposed distribution converter Delta^prop is better than those proposed by the initial paper (experiments should include some of the same experiments they used in their paper: Minecraft and addition with LSTM), then yes this can be a contribution of this paper
- **CASE 2**: If not, we have to drop the "improve teacher student algorithms" contribution, because it's not ! The sole contribution then would be a mastering rate based algorithm.
### 2- Background
Current sections 2.1 and 2.2 are well written and can be put kept as is. I would add some of the content of section 4.1 (recommendations) to section 2.2, simply by mentioning that some of the scenarios they suggest are particular cases of others. For me it makes no sense to suggest a new distribution converter Delta^prop and use it as a baseline.
In both **CASE 1** and **CASE 2**, the section should maybe end here. If it's too short, we can think of merging it with related work maybe (i.e. start by related work, then introduce the notations, etc..)
It doesn't make sense to advise not to use the Boltzmann distribution because one has to tune $\tau$, especially that the mastering rate algo uses many hyperparameters. Maybe it's hard to tune, but it outperforms Mastering rate based algorithms.
I found the POMDP view of the teacher student algorithm quite nice, and I think it's worth mentioning it in a paragraph, and mention in the following one that we currently use heuristics to solve the teacher's POMDP.
### 3- Section 3 (and possible 4- Section 4)
- **CASE 1**: Section 3 should be about the improvement on the teacher-student algorithm. More intuitions on as to why this distribution converter is better than the others, or some math maybe in some simple tabular environments. End the section by mentioning that the experiments in section 5 will actually show how much better it is.
- **CASE 2**: The proposed distribution converter is no longer a contribution, and we should talk directly about the mastering rate algo. Give more intuitions as to why it makes more sense to use this heuristic rather than the previous one, without referringto one or two of your experiments. At the end of the section, mention that we will provide experimental evidence as to why it makes more sense.
To sum up, current section 5, without referring to experiments, should be section 4 in **CASE 2** and section 3 in **CASE 1**.
There are a few things that need more clarifications about the mastering rate algo:
- Is the name min-max curriculum something that already exists ?
- Way more justifications for any choice of heuristic ! It can be intuitions, it can be one-sentence like justification, but we can't just drop things like "here is the minimum mean, the mean, and the maximum mean, define this ratio, and this ratio is amazing"
- Other things that can be discussed later
### 5- Experiments
- Section 5.1 should be about the experimental setting. I suggest that in addition to the minigrid curricula, we add a supervised curriculum, similar to the one used in the first paper (Addition with LSTMs)
- Section 5.2: results (Current ones + LSTM ones+ if possible minecraft)
- If time / space permit, we can do some simple ablation studies to show that all the choises are sound ! For example, it might be that the introduction of $\gamma_{pred}$ and $\gamma_{succ}$ don't change much, and the algorithm would work just as fine without them. It's always nice to prove that it's not the case (prove experimentally)
To sum up, the experimental part is weak currently because even though the Mastering rate algo outperforms (by far) the others in complicated tasks, it can be because it's a failure mode of the others and given other curricula (e.g. minecraft and LSTM) learning progress is better than mastering rate, or it can be because what you considered as a baseline is not strong enough, in which case the baseline should be tuned better (e.g. tune the Boltzmann distribution converter parameter $\tau$)