# Papers summaries -- 2020
- [Lifelong Policy Gradient Learning of Factored Policies for Faster TrainingWithout Forgetting](https://hackmd.io/ERDXu0InTvC7hTJDV3UZaQ?both#Lifelong-Policy-Gradient-Learning-of-Factored-Policies-for-Faster-TrainingWithout-Forgetting)
- [Efficient Continual Learning with Modular Networks and Task-Driven Priors](https://hackmd.io/ERDXu0InTvC7hTJDV3UZaQ?both#Efficient-Continual-Learning-with-Modular-Networks-and-Task-Driven-Priors)
- [SNR: Sub-Network Routing for Flexible Parameter Sharing in Multi-Task Learning](https://hackmd.io/ERDXu0InTvC7hTJDV3UZaQ?both#SNR-Sub-Network-Routing-for-Flexible-Parameter-Sharing-in-Multi-Task-Learning)
- [Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning](https://hackmd.io/ERDXu0InTvC7hTJDV3UZaQ?both#Meta-World-A-Benchmark-and-Evaluation-for-Multi-Task-and-Meta-Reinforcement-Learning)
### Lifelong Policy Gradient Learning of Factored Policies for Faster Training Without Forgetting
Tags = RL, continual RL, lifelong RL
- TL;DR:
- Related Work:
- first class of lifelong Rl (LRL) = single model
- e.g. PG + EWC
- second class of LRL = multiple models
- shared + task specific params
- Lifelong Learning Problem:
- Learn a sequence of MDPs and maximise the final performance on all of them
- Lifelong Policy gradient learning (LPG-FTW)
- main idea is $\theta^{(t)} \sim L s^{(t)}$ where $L \in \mathbb{R}^{d \times k}$ is a shared dictionary of policy factors and $s^{(t)} \in \mathbb{R}^{k}$ are task-specific coefficient for task $t$.
- some kind of mixture of experts.
- $k$ = number of experts. controls the level of parameter-sharing
- first phase = train phase
- update $s^{(t)}$ on the current task while keep $L$ fixed
- second phase = knowledge accumulation
- hold $s^{(t)}$ and now update $L$ with second-order gradients
- also:
- the idea of factorization of policies $L s^{(t)}$ comes from PG-ELLA. But PG-ELLA doesn't use $L$ in the training phase
- natural gradients are important for learning $L$
- Experiments:
- baselines (all use Natural PG as the base learning method):
- single-task learning (STL)
- STL + EWC
- PG-ELLA
- ER
- benchmarks:
- Always uses task-ID
- 6 Mujoco Domains = continuous (gravit and body-parts)
- HalfChetah, Hopper, Walker-2D
- single-head
- Continual Meta-World (cntinual version of MT10 and MT48) = continuous
- multi-head
- results:
- SOTA (including > PG-ELLA, so there improvement works)
- notes:
- best way to interpret is as a Mixture of Expert: an expert for a task in a simple linear combination between the $k$ experts
- so
### Efficient Continual Learning with Modular Networks and Task-Driven Priors
Tags = modularity, continual learning
- TL;DR:
- new benchmark (CtRL) to not only study final performance and forgetting, but also scaling and transferS
- nice experiment in the space of different metrics
- new method Modular Networks with Task-Driven Priors (MNTDP) which, at each task, spawns new modules and learns to wire them together (task-aware setting so you can save the modules routings)
- Introduction:
- desideratum: CL methods that scale sublinearly
- modularization:
- no forgetting if you freeze the modules
- transfer
- scales sublinearly if modules are shared across tasks
- Evaluating CL models
- Desirable properties:
- high **average accuracy**
- low **forgetting**
- high **transfer**, where transfer on a task = performance of a CL method on that task - performance of indep model
- **Direct transfer** on $S^{-} = (t_1^+, t_2, t_3, t_4, t_5, t_1^-)$
- **knowledge update** on $S^{+} = (t_1^-, t_2, t_3, t_4, t_5, t_1^+)$
- **Input/Output transfer** on $S^{in} = (t_1, t_2, t_3, t_4, t_5, t_1')$ and $S^{out} = (t_1^-, t_2, t_3, t_4, t_5, t_1'')$
- Nice link w/ OSAKA bc shift in target distribution
- **Plasticity** $S^{pl} = (t_1, t_2, t_3, t_4, t_5, t_1^+)$ (classic)
- sublinear **scaling** on $S^{long} = (t_1, t_2, ..., t^{100})$
- **Memory**
- **Compute** (FLOPS)
- Modular Networks w/ Task-Driven Priors (MNTDP)
- a module is essentially a sub-group of a layer, ie a group of neurons
- at each task, their method
- 1) search for the most similar pas task and retain only the corresponding architecture (see Data-Driven prior)
- 2) temporarally spawn new modules at each layer (so it makes the network wider)
- 3) train on current task by learning both the way to combine modules and new modules parameters.
- 4) **freeze** the architecure
- 
- only branch out to the right (to make the search space smaller)
- Data-Driven prior:
- "We take the predictors from all the past tasks and select the path that yields the best nearest neighbor classification accuracy when feeding data from the current task using the features just before the classification head."
- Experiments:
- Took care of not using future data to hparam search
- 
- MNTDP trade-off computation for performance, but does it better than similar methoods e.g. PNN and HAT
### SNR: Sub-Network Routing for Flexible Parameter Sharing in Multi-Task Learning
Tags = modularity, multi-task, subnetworks
- TL;DR:
- multi-task
- can enable positive transfer
- as well as saving computation cost (in this case they mean through parameter sharing because you are performing different tasks on the same input)
- they propose Sub-Network Routing (SNR) which essentially learns a binary mask (as a latent variable) in the multi-task setting that enables flexible parameter sharing through sub-networks
- Approach
- SNR - transformation
- $h'_t = (W_t \odot Z_t) h_t$
- where $h_t$ is the output of the previous layer and $h'_t$ is the input to the next. $Z_t$ is a binary matrix that controls the connections of the subnetworks and $W_t$ is a transformation matrix controlling the weights.
- SNR - Average
- $h'_t = Z_t h_t$
- 
- NOTE: because the last shared layer is only sparsely connected to the independant output heads, if you wanted perform a single task, you could recursively find out the minimum amount of neurons that need to be computed and only run the smaller subnetwork
- degeneration
- if $Z_{ij} = 1, \forall i,j$, then degenerates to shared-bottom architecture
- if $Z_{ii} = 1$, and $Z_{ij}=0$, then degenrates to small indepedant networks.
- ! search space is $2^{|Z|}$, so we learn it
- $z \sim Bern(\pi)$. You could use REINFORCE but...
- but they use the hard concrete distribution. Essentially you learn a latent var which goes true a hard sigmoid, and you use VI and reparmetrication trick to optimize it
- Lastly, use LO regularization on $Z$ to get sparse codes
- (some tricks to make it work w/ the hard concrete distribution)
- Experiments
- keeping compute and capacity fixed, SNR-Trans > SNR-Aver
- SNR is a good tool to reduce the amount of serving-time parameters.
### Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
Tags = RL, multi-task RL, meta RL,
- TLDR:
- Proposes a new multi-task and meta-learning RL benchmark in which the task suite is:
- sufficiently broad to enable generalization to new behavior.
- While containing sufficient shared structure for the generalization to be possible
- Current methods struggle in both benchmarks
- Multi-task vs Meta RL:
- Multi-task:
- Learn a single policy that can solve multiple tasks > learning the tasks individually
- Trying to achieve positive transfer
- In RL: $\pi(a|s,z)$ where $z$ is a task ID
- Evaluated on the training tasks
- Meta-learning
- optimizes for fast adaptation to new taks
- interested in generalization to new behaviors
- Meta-World
- High-level:
- to generalize, we need a broad range of meta-train tasks
- to get enough broadness, they argue that the differences between tasks can't be describe by continuous parameters (like in previous Meta RL benchmarks)
- So the space of manipulation taks exhibit non-parametric variabiility across tasks and parametric variability within tasks (to avoid memorization)
- Essentially, each task ~ a previous Meta RL benchmark
- 
- Tasks structure:
- combinations of basic a beharioral building blocks {reach, push, grasp} w/ an object
- objects have diffenrent shapes
- and different articulation properties (e.g. a door and a drawer have a different joint)
- more complex tasks require combination of the bulding blocks which must be executed in order.
- Actions, Observations, and Rewards
- simulated Sawyer robot for all tasks
- either manipulate one object with a variable goal position,
- Observations = 3-tuple of the 3D Cartesian positions of the end-effector, the object, and the goal (9 dim)
- or manipulate two objects with a fixed goal position.
- Observations = 3-tuple of the 3D Cartesian positions of the end-effector, the first object, and the second object (9 dim)
- Rewards:
- $R=R_{reach}+R_{grasp}+R_{place}$ or a subset for simpler tasks
- Evaluation protocol:
- ML1: few-shot adaptation to goal variation within one task
- similar to previous Meta RL benchmarks
- goal positions are not provided forcing the meta-RL algos to adapt to the goal through trial-and-error
- 
- MT10, MT50: learning one multi-task policy that generalizes to 10 and 50 training tasks
- task-aware setting i.e. task ID is provided
- positions of objects and goals are fixed throughout
- basically you are trying to overfit the training task.
- 
- ML10, ML45: few-shot adaptation to new test tasks with 10 and 45 meta-training tasks.
- intentionally select training and test tasks with structural similarity
- 
- Success metric:
- $||o-g||_2 < \epsilon$, i.e. is the object close enough to its goal position
- Experiements
- baselines:
- Multi_task (multi-task variants of the following):
- PPO
- TRPO
- SAC
- multi-head SAC
- TE (w/ task embedding)
- Meta-learning
- RL$^2$
- MAML
- PERL (w/ infered task embedding)
- ML1:
- even this benchmark is hard
- 
- MT10, MT50:
- Multi-head seems important
- 
- ML10, ML45:
- it's just hard.
- Also, maybe explicit task inference (PEARL) is not straightforward.
- 