Massimo's amzn internship

# Massimo's amzn internship [experience journal and potential features](https://docs.google.com/spreadsheets/d/1360Y99xVZwQkk0QMNnx_VzDo2MvcxOADvKuDBUBrpqY/edit?usp=sharing) [current write-up](https://hackmd.io/wBfk658HRla2veu6Hj2fkQ) [current overleaf](https://www.overleaf.com/6698461249ywvztdfydtvy) [Rasool's Quip](https://quip-amazon.com/FSCFAWyWG1Eu/Continual-learning-Massimos-Amzn-internship) ## current working memory - merging 4.1 and 4.2 - hypothesis 1.3 and 2.2 should be merge together: - namely, "the RNN infers how a new task relate to the others, enabling skill reuse" - if it was only about subtask decomposition, then the MT-RNN would bet MT - ## Meeting 18/02 #### current working memory - TACRL motivation: - start from what our real motivation is, i.e. agent in humongous POMDP non-stationary hidden states - standard episodic POMDP objective : - $\underset{\pi}{max} \quad E_{s^h}\big[E_{s_t, a_t, r_t \sim O, \pi}[\sum_{t=0}^{H-1} \gamma^t r_{t+1} | s_t]\big]$ - where $s_t = [s_t^o, s_t^h] = [o_t, s_t^h]$ - $s^h_{t+1} \sim p(\cdot| s^h_t)$ - then explain how we simplify things a bit to create our benchmarks - i.e. non-backtracking chain is $s^h$ - different eval protocol - if we just look at the non-stationary objective, online performance should be maximised - which is ultimately what we are interested in - but, we believe in that the accumulation of knowledge will be important for solving real-world complex tasks, so in TACRL we focus on the anytime global performance of an agent - why do hell does global performance go down on the 50 task benchmark - toy experiments - particle moving in a 2D plane - probably fetching some things - maybe a maze is appropriate - trickiness stems from the RNN having to have an edge on the other guys - the reward function can't give it away in one timestep - add noise? - add a velocity reward? - task-aware will be at an advantage - add a maze? - positive reward only if you've made it closer to target? - ultimate goal is having control on the complexity of the task, such that we can break task-aware methods by making them too complex - maybe we should study both kind of generalization: - 1: from an HRL sense, i.e. can you sequentially combine skills - 2: from an OoD sense, i.e. can you do task that are interpolations of learned tasks #### meeting takeways - SAC > DQN - better off using a already known toy benchmark, even if discrete - also: - CORA paper could be interesting ## Meeting 23/01 #### results - Ant-dir CL vs MTL - ![](https://i.imgur.com/RSv3wDT.png) - ![](https://i.imgur.com/gepftTd.png) - Grad variance analysis for RNN - ![](https://i.imgur.com/5sdmwkS.png) - ![](https://i.imgur.com/PQpRTzu.png) - gradient entropy - ![](https://i.imgur.com/4LfEG6x.png) - ![](https://i.imgur.com/WFE3nus.png) ## Meeting 20/01 #### discussion pts - job dispatch #### results - ![](https://i.imgur.com/mp32DSh.png) ## Meeting 18/01 #### discussion pts - hypothesis - gradients - TAMH #### results - CW10: - ![](https://i.imgur.com/JVMGBIE.png) - TAMH looking pretty good :o - gradients - ![](https://i.imgur.com/8Rlyzmf.png) - ![](https://i.imgur.com/8O4hdoD.png) - looks like CL is becoming less and less conflicting - ![](https://i.imgur.com/rN2xsqw.png) ## 16/01 meeting w/ rasool - abort deadly triad hypothesis - try g3 and p3 for faster training - look at gradients instead of weight changes (to see if some modularity emerges) - also: - relaunch CW10 w/ bigger nets ## 14/01 meeting w/ alex - try transformers :smirk: - RNN could be better than baseline in CL bc it performs fast adaptation ## 13/01 #### takeways - COMPS made CL_ant-dir harder by at changing the angle by at least 70degres at each consequtive task - in our STL, it's a bit like MTL because the goal changes - TODO: check what Meta-world paper did for MT{10, 50} - maybe we can repeat the STL experiments on MT10 or MT50 #### results - CL ant-dir 50t - ![](https://i.imgur.com/BySdb64.png) - seems like NM_ID is not soo bad but seems overwhelmed at the end... - ![](https://i.imgur.com/QfTWdAI.png) - **performance on all task doesnt increase** - bad CRL benchmark - maximsing variance doesnt help here either - CL ant-dir 20t - ![](https://i.imgur.com/9Hzx9r8.png) - ![](https://i.imgur.com/zieOnHz.png) - some hints of global learning - ![](https://i.imgur.com/ddLm5OW.png) - methods w/ per-task parameters are highly unstable - maybe run `NM_ID` ? - maximsing variance doesnt help here either - CL MT50 25M - csp0.0 - ![](https://i.imgur.com/XT942H9.png) - csp 0.2 - ![](https://i.imgur.com/nRrCIfn.png) - RNNs performance well pass the 20% threshold, equivalent to 50% on MT20 - ![](https://i.imgur.com/sPBXU9e.png =400x) - warning: slight overestimation as performance on unseen task isn't 0.0 - csp 0.2 is doing better! - maybe a good place to try the low replay regime - CL MT50 50M - csp 0.0 - ![](https://i.imgur.com/wMdYgwT.jpg) - csp 0.2 - ![](https://i.imgur.com/WWFFpAf.png) - csp 0.2 better again! - global insights - per-task params is not always good in all regimes ## 11/01 #### key takeways - if we cheer for `ER-NM_ID-ID`, we need to run `ER-ID` as a baseline - do the no-replay setup - (i think we should do tiny replay instead) - if we do an RNN paper, we might need - an RNN version that can't do dynamic policies - e.g. feed the average hidden state to Actor/critic - warning: algo could find a workaround... - e.g. penalty on the hidden variation - transformer - could be faster to run - #### results - CW10 - ![](https://i.imgur.com/djAvqgQ.png) - Er-RNN wins - our Non-RNN w/ MH > ER-MH - CL ant-dir 20tasks - ![](https://i.imgur.com/1v8FIhh.png) - our non-RNN method w/o MH is looking good - not enough data of our non-RNN method w/ MH - CL ant-dir 50 tasks - ![](https://i.imgur.com/gvE804s.png) - our non-RNN w/o MH is looking good - not enough data for both MH methods - CL MT50 25M timesteps - ![](https://i.imgur.com/CKLX4Ij.png) - ER-RNN looking good - not enough data our non-RNN method w/ multi-head - CL MT50 50M timesteps - csp0.0 - ![](https://i.imgur.com/7lZmA4g.png) - RNNs are looking very good, esp compared to MT20 - csp0.2 - ![](https://i.imgur.com/YX4Iat7.png) - MTL MT20 (for reference) - ![](https://i.imgur.com/Dc2NIIQ.png) ## 06/01 #### takeways - reporting MTL results is problematic bc performance eventually drops... - but on the back burner for now - thus, focus on CL (and STL) - if we go the RNN route, we need more analysis - transfer matrix - this can be computed a posteriori - forward trasfer? - same - forgetting - same - representation analysis - disantangling task inference from HRL - or implicit MBRL - PCA #### checkpoint - what we had - in the paper - what's new - gradient clipping experiments MT10 (w/ entropy tuning) - NM_ID - ![](https://i.imgur.com/axH40WP.jpg) - still some seed explosion under some seeds - there's no pattern... - maybe 8 seeds isn't enough for Meta-world... - w/o gradient clipping: - ![](https://i.imgur.com/7ufXilI.png) - RNN - ![](https://i.imgur.com/rLVUTT1.png) - w/o gradient clipping - ![](https://i.imgur.com/neHO54A.png) - NM_RNN - ![](https://i.imgur.com/F3GfIZH.jpg) - w/o grad clip ![](https://i.imgur.com/6KsdzBa.png) - ![](https://i.imgur.com/7dv1xH9.png) - what is running - MTL_MT50 w/ grad clip - ![](https://i.imgur.com/RPMyGUL.png) - looking good for NM_ID - but methods are converging super quick and to bad local minimas compared to Meta-World MTSAC - only diff I see is the per-task alpha learning - CL_MT50 - ![](https://i.imgur.com/tWmIagV.png) - CW10 (now w/ grad clip) - ![](https://i.imgur.com/xEbhB6G.png) - insights - w/ gradient clipping, it is clear that the algo gets stuck in local optimas - is there ways to escape them? - lr schedule ? - CL > MTL(at some pts?) - MTL - ![](https://i.imgur.com/9p1ybIJ.png) - CL - ![](https://i.imgur.com/r9lI10S.png) - sparsity analysis on MT50 - ![](https://i.imgur.com/SUDyaka.png) - ![](https://i.imgur.com/FjpV0Qo.png) - seemed to be inversely correlated ## 04/01 #### new results - 8 seeds is probably not enough in Meta-world (found out w/ the sort_batch variance) #### takeways - double down on long MW experiments - variance from seeds is fine (because you are collecting different dataset for each trials, it make sense) ## 22/12 - RNN might simply be better because the embedding is richer than feeding in task ID? - No: ER-RNN > ER-EMB - RNN is simply treated asan architecture choice... - I think that we can be smarter than this! - RNN is better because the embedding is changing through time: - we should analyse this my monitoring something: - PCA to map the representations in 2D? then you can visualize - you could plot different task in one plot, as well as different timesteps to show the evolution - more simply, we could just monitor the standard deviation of the modules throughout an episode - as well as the mean to differenciate the task? - we should repeat this for the modules and for the RNN's hidden - maybe they have different jobs :open_mouth: - We should test the 0 replay setting, maybe our modular method would work better there - make sense, but we need a reallyy modular method.. - time to try the module variance maximization? - for now, stick with Meta-World... - only if we dont have enough paradigms do we switch to CoMPS ## 21/12 #### thoughts - CL > MTL - maybe we can explore in which regime you should learn or continually learn? - RNNs are always on top in CW10 (CL or STL), maybe we should look into that #### results - MTL on MT10: - w/ automatic_entropy_tuning - ![](https://i.imgur.com/Smk0CvB.png) - w/o - ![](https://i.imgur.com/eZwH2Tp.png) - takeway, our routing method learns MUCH faster than reported MTSAC and significantly better than ER-MH - reported multi-task entropy tuning: - ![](https://i.imgur.com/oGsHDhC.png) - CW10 - Full: - ![](https://i.imgur.com/dUM8mH7.png) - NM_RNN on top w/ ER-MH, but not significant - Half compute: - ![](https://i.imgur.com/RsFS1A5.png) - NM_RNN on top and maybe significant - full w/ fixed alpha - ![](https://i.imgur.com/6Pt3aAW.png) - not done, but similar results and as much variance - STL - for standard architecture: - nothing interesting - for small arcitectures - RNN get destroyed - on Ant, we are experiencing the same Q value explosions! - ![](https://i.imgur.com/FjlNAcg.png) ## 15/12 #### key takeways - we need to bring the variance down, everywhere - we need proper STL results - we need to run the SAC baseline w/ equal amount of params - we already have 3layer vanilla SAC - we eventually should run soft-module - in STL and CL #### thoughts - no need to run bigger architecture: - MW was actually running [400, 400] - do we make the paper about why MoE is bad and Neuron Masking is better? - then, we can prioritize adding the MoE baseline - seems like we might have a win in Multi-task learning with the method that leverages the task ID... - maybe we can do a story w/ that? - in this case, we need to prioritize multi-task automatic entropy tuning - for Single-task mujoco experiments: - my hunch is, the easier the task is, the less our method can help - at least you need to find a regime where a dynamic policy can help. - monitor the total variation of the mask within an episode? ## 06/12 #### results - good single task learning results! - ![](https://i.imgur.com/mOyFt1s.png) - ![](https://i.imgur.com/3OmIgKI.png) - ![](https://i.imgur.com/WkwOhPg.png) - ![](https://i.imgur.com/s5eKuJc.png) - for MTSAC w/o multihead, we dont have variance! - ![](https://i.imgur.com/l9EYBuT.png) - ![](https://i.imgur.com/xMcheL1.png) - is their multi-head in Garage? - if so, maybe ours is badly initalized? ## 09/12 #### key takeways - trying to find why some seeds fail: - try not learning alpha - test smaller lr - grad clipping and/or l2 reg - seed 10 to 20 - different seed, same env (seed) - also: - monitor the norm of the obs - look into the MW code that launched the exps they report - maybe multi-step TD learning could help reduce the bias in the updates ## 02/12 #### thoughts - seems to be all about not having too much seed crashes - maybeee we are better than the `RNN` - ![](https://i.imgur.com/yOHSmmH.png) - idk how I feel about this, given the `warm_up`'s impact on results - ![](https://i.imgur.com/QWyASZC.png) ## 30/11 #### key takeways - investigate single-task regime - maybe reduce burn_in after task 1 - Jonas: you'd want to automatically start training more on the current task and converge to uniform before the end of the task - need a way to do this w/o introducing an hparam - maybe somehting lik prioritized experience replay could do it, like: train more on the weakest tasks - auto tuning alpha could be weird in CL. maybe turn it off, or restart it at task boundary #### results - runs we were cheering for last week (NM_RNNs) are having a hard time w/ the last task - csp = 0.5: - ![](https://i.imgur.com/TBd8Lmi.png) - csp = 0.0 - ![](https://i.imgur.com/YmUApGb.png) - actually here we weren't leading - forgetting - new architectures - ![](https://i.imgur.com/a7x9ySU.png) - doesn't help w/ the hard tasks - data regimes comparisons - compute 2.0 - compute 0.5 - **Warning - preliminary, only 4 seeds**: - RNNs + csp0.5 - ![](https://i.imgur.com/xZ5fVGi.png) - again, tumble on the last task - RNNs + csp0.0 - ![](https://i.imgur.com/H8k47wE.png) - much better - 7.5% gain w/ `NM_RNN-RNN` - 11.5% for `NM_RNN` - MLPs + csp0.5 - ![](https://i.imgur.com/eld8vql.png) - same stumble as the RNNs - MLPs + csp0.0 - ![](https://i.imgur.com/7jEjGHJ.png) - much better (again) - 7% gain for `NM_ID-ID` - key takeways - too much compute is like not enough! - maybe we want a strategy that keeps sampling but stops training after a threshold? - last task analysis - motivation: - methods take a stumble on the last task... - let's try single task learning bc it seems like `NM_RNN` can outperform its "upper bound" `RNN-ID` - ![](https://i.imgur.com/H8FGRcz.png) - looking GOOOOOOD for `NM_RNN-RNN` - maybe our method transcend Continual Learning ?? - Also: why is `TrainFromScratch` suddenly learning?? - hypothesis: bc it's (still) burning data #### some discussion pts - still unclear what the sparsity regularization should be in RNN - also, what do we monitor - proper forgetting analysis: - on the replay sampling tradeoff. - oversampling the current task normally increases current task performance but! uniform sampling can improve its performance on past tasks - can also be observed when comparing `final_performance` w/ `cumulative_performance` or `avg_current_performance` - maybe it's not so important to fill the buffer with high quality trajectories? - we are probably in a large compute regime - a hard task can be desastrous for the policy, especially for `csp=0.5` - seems liek we are getting some HRL gains :o ## 23/11 #### key takeways - on the CW underperformance sitatuation - it's a go for multi-head - also, code the "choose most confident head strategy" - use `log_std` - look into their reward scaling - check the sequoia results on CW - look into their other [wrapper](https://github.com/awarelab/continual_world/blob/dbac45ef95cacf8f6632aa437794f9e914ab9cf3/continualworld/utils/wrappers.py#L89) - on the regularization: - add `Hoyer` and try - try to come up w/ a sparsity inducing trick that doesn't require an extra hparam - do some experiments where compute is changed - for the compute constrained regime, maybe still keep the first task at 1M. #### discussion pts - the overarching question is where we should invest time and energy - iterating on our method - inducing sparsity - could be achieved through meta-continual learning (maybe) - modularity tricks - e.g. alternating optimization, custom learning rates, etc - getting better general results in CW - check rewards, check sequoia results - adding new (COMPS) envs - getting baselines - probably the CW baselines reran on the new meta-world v2 - writing - maybe we need some tricks from Continual-world - like scaling the rewards - multi-head support - restarting the optimizer? - but essentially, `FineTuning` and `TrainingFromScratch` (refered to as `Reference`) should learn each task indidually - this is not the case for us. - maybe Fine-tuning w/ a new head every time would work well? - maybe the baselines from CW are good enough now? - we have the results already, but I could rerun them w/ our exact `batch_size` and `learning rate`. - sparsity regularization problem - doesnt make sense to apply regularization on the mask only - we should apply regularization the activation directly - also maybe `Hoyer` > `l1`: - ![](https://i.imgur.com/5oYTwCB.png =300x) - what to monitor for our RNN method? - or maybe there's smth better? - should the regularization be different in the RNN case? - hard routing to increase parameter count and keep compute fixed - (out of scope probably) #### results - non-RNN: - ![](https://i.imgur.com/d9XracU.png) - the good: we have a proper benchmark - the bad: our routing method doesn't outperform its relevant baseline - compared to Continual-World, we are really underperforming. We have a hard time learning all tasks: - ![](https://i.imgur.com/Mh8PeRl.png) - RNNs - 5 seeds + **moving avg of 4 timesteps**: - ![](https://i.imgur.com/9MnVove.png) - first of all, it looks like the RNN baseline (task-agnostic) will outperform the task aware non-RNN baselines (cool!) - encouraging to see that our task-agnostic RNN (NM_RNN) can be > task-Aware RNN baseline (RNN-ID) - looks like we are better than RNN in terms of data efficiency - ![](https://i.imgur.com/LklqgdS.png) - ofc results might not seem as good when the benchmakr is finished and/or we have 3 more seeds - other takeways - sigmoid >> lrelu - current sparsity regularization is flawed and leads to instability ## 16/11 #### key takeways - need to add support for lr specific to module params ve routing (Jonas) #### (some) results on CW10 - NOTES: - before that set of experiments, nothing except the RNN was working. - The batch_size was 100 and the lr was 0.001 - For this new set of experiments, batch_size is now 4096 and I'm testing a new lr (0.003) - Still testing out 2 sampling strategy: - uniform sampling (csp=0.0) - oversample current task (csp=0.5) - **Uniform sampling + lr=0.001** - ![](https://i.imgur.com/e1ZRAG2.png =500x) - ![](https://i.imgur.com/jkYIWMp.png =500x) - ![](https://i.imgur.com/DzoJQXD.png =500x) - takeway: seems like learning happens but 100% CF - **oversampling current + lr=0.001** - ![](https://i.imgur.com/mPS6BsC.png =500x) - ![](https://i.imgur.com/hJntxSD.png =500x) - ![](https://i.imgur.com/Flh2tSC.png =500x) - takeway: same as above, oversampling current task doesn't help - **Uniform sampling + lr=0.0003** - ![](https://i.imgur.com/nzHeOSW.png =500x) - ![](https://i.imgur.com/RRNXCz3.png =500x) - ![](https://i.imgur.com/gq2ZAcT.png =500x) - takeway: 0.0003 is the key! - **oversampling current + lr=0.0003** - ![](https://i.imgur.com/QnD9nUW.png =500x) - ![](https://i.imgur.com/NxoZM8w.png =500x) - ![](https://i.imgur.com/ys2Dzop.png =500x) #### code speed - bottleneck is now updating the model: - ![](https://i.imgur.com/B1l7WsF.png) - we might get a quick win by having a predisposable tensor for the batch, instead of creating one at each update ## 08/11 #### key takeways - on code performance - repair `buffer.sample()` asap - check if code is as slow when running a single job on the cluster - maybe try the C5 CPU clusters - figure out why meta-world practionners use >1 CPU - on benchmarks - Rasool thinks our algos are to stupid to solve ant direction or goal with a single model INCLUDING W/ THE RNN - get the tasks from the the PearlDataloader in the MQL codebase - or try to find the COMPS code - check code COMPS Appendix and try to figure out their setup and if their SAC, i.e. PEARL, does Experience Replay - also: - can't compare w/ CW right now cause they have multihead support - also remember they are v1 (not v2) - would be nice to have a super-toy example to showcase the behavior of our method, e.g. like the cartoon in SupSup (fig 1) #### discussion pts - summary - "some" encouraging results in the mujoco-openai benchmarks (that matters) - code is really slow which makes it hard to get meta-world results - discuss neuron masking vs MoE - discuss potential new benchmarks - results: - **CW10**: - extremely slow, but at least the algos are learning (well not in task2...) - ![](https://i.imgur.com/rjkdYMK.png) - **Hopper-bodysize v2** (most interesting setting IMO) - when oversampling the current task - ![](https://i.imgur.com/qCI3wP0.png) - ![](https://i.imgur.com/fLcgDdk.png) - adding the baselines with uniform sampling - ![](https://i.imgur.com/0ZKjHYT.png) - ![](https://i.imgur.com/eGfiZrx.png) - sparsity regularization hurts performance or was too high - ![](https://i.imgur.com/nwz1ILp.png) - **Hopper-bodysize v0** - uniform sampling (makes more sense in this benchmark) - ![](https://i.imgur.com/eCyMFJC.png) - ![](https://i.imgur.com/PaECR1Z.png) - module analysis of a "working" solution - actor layer 0 - ![](https://i.imgur.com/aKxSG5a.png) - actor layer 1 - ![](https://i.imgur.com/TjXoLEG.png) - critic layer 0 - ![](https://i.imgur.com/LNy0gn3.png) - critic layer 1 - ![](https://i.imgur.com/FXmyeWZ.png) - observation: - actor is more sparse, but critic has more diversity - oversampling the current task - ![](https://i.imgur.com/OOsGeLu.png) - ![](https://i.imgur.com/pjUv89E.png) - somehow, this hurt Routing... - **HalfCheetah bodysize v0** - uniform sampling - ![](https://i.imgur.com/H53Whwm.png) - ![](https://i.imgur.com/6JVdpWB.png) - Routing is really bad now, but all task are pretty much the same, - e.g. look at the performance on the last task throughtout training: - ![](https://i.imgur.com/e5FG0lA.png) - oversampling current task (more of the same) - ![](https://i.imgur.com/KQBjRbI.png) - ![](https://i.imgur.com/YWYQnct.png) - for what it's worth, here the sparity reg shows some sign of relevance - ![](https://i.imgur.com/Dy6IUg4.png) - **Hopper gravity v0** - not enough runs for display - **Walker2d gravity v0** - not far enough yet - final thoughts - Lrelu > Sigmoid > relu - so maybe lrelu really gives the best of both work i.e. non-vanishing gradients (relu) and no dead neurons (sigmoid) - code slowness: - for reference, Continual-World reports 0.018s for a full timestep loop - sampling data from the env - meta-world isn't really slower than mujoco-openai (here halfcheetah) - keep in mind here, meta-world episodes are 5x shorter in timesteps than mujoco-openai - sampling of one episode: ![](https://i.imgur.com/3himcfi.png) - buffer sampling - for loop in `buffer.sample()` reallyy slows down the code - ![](https://i.imgur.com/5C6YNpR.png) - compared to updating - ![](https://i.imgur.com/UE1iG07.png) - for the RNN however it's SUPPPERRRRR SLOW - ![](https://i.imgur.com/25gfJsi.png) - updating - for a batch_size of 100, CPU ~= GPU - ![](https://i.imgur.com/ARZnZNe.png) - **BUT, on my laptop it's 0.008** - data transfer - pretty insignificant - ![](https://i.imgur.com/yl0DWiD.png) - also: - we shouldn't use the GPU when sampling data: - ![](https://i.imgur.com/SG8yftg.png) - conclusion: - fix buffer - find out why EC2 so slow - add support for multi-processing - new potential benchmakrs - Ant + {goal, direction} - I think all tasks are the same... - Half-Cheetah + velocity - now we are talking ## 02/11 #### discussion pts - code is really slow in meta-world - how come we are so slow compared to CW10? - differences are CPUs - v1 vs v2 - might want to look into vectorization + getting more CPUs - bc there's no forgetting in the current openai-mujocos, maybe we could change the reward function? #### key takeways - problem of no forgetting in mujoco-openai could be adressed by the benchmarks in CoMPS which are actually changing the task (reward function) instead of the env (transition function) - monitor different part of the codes to understand why it's slow - probably the for loop in `buffer.sample()` ## 28/10 #### key takeways - try the meta-world v2 hparams to help the learning in CW - we'll try a bigger batch size and have to monitor the instances to find the proper `runs_per_gpus` - on the HRL idea - transformers might be too long to run on long sequences - maybe we can develop the idea w/ RNNs - however the nice feature about the transformer is that the gradient paths between any timesteps is O(1) compared to O(T) for RNNs ## 26/10 #### thoughts - gonna need support for custom learning rates per types of params (Jonas trick) - lr for modules need to be much bigger! - maybe a good time to do coordinate descent ## 20/10 Alex #### Key takeways - goal is to achieve (multi-level) hierarchical temporal abstractions for RL, but in an **implicit** way. - (as opposed to options, where the hierachy is explicit) - think of hierarchical + latent + sequential variables - we should take some inpiration from - U-net: - through *contraction/expansion* layers, U-net captures global context as well as local information - particulary useful for segmentation - we could re-use the same idea to capture different *resolution* of the MDP an enable some HRL - should probably use the Transformer version - Transformers are good at figure out where things start and end. - Neural Interpreter (Dynamic Inference w/ Neural interpreter) - I have no idea what this is, but Alex has in mind a *reverse* Neural interpreter - I guess i'll understand after reading the paper - should chat w/ Francesco - keep in mind: - continuity between the levels should be built-in - might want to look into Transformers applied to time-series - for the current project: - Could probably swap out the RNN for a transformer ## 19/10 #### key takeways - should add support for transfer matrix monitoring - perfect Memory doesn't work really well in CW - Rasool thinks it's because of uniform sampling - should make sure that's what they do - will be interesting to see if oversampling the current task helps - get to the RNN sooner than later - we want to build off smth that always works, and work in the task-agnostic setting - where we can treat `TaskID` as un upper bound #### results - HalfCheetah-Bodysize v1 - ![](https://i.imgur.com/44ELbsb.png) - HalfCheetah-Bodysize v2 - ![](https://i.imgur.com/FE3jYdA.png) - Hopper-Gravity v1 - instance crashed for ~half runs - Hopper-Gravity v2 (8 seeds!) - ![](https://i.imgur.com/bZGR4lL.png) - oversample current task or not - ![](https://i.imgur.com/PzfYh0Y.png) - gains seem to be correlated w/ the difficulty of the benchmark - bonus point for having better `avg_current_accuracy` - which is great for the deployment of CRL methods ## 15/10 #### key takeways - we might need a modularity baseline - soft-module might do it - soft-modularization: - difference: - module specific attention - I think we need this, or else the modules' input is shifting throughout the tasks - softmax for the routing - This is a classic in routing literrature. - maybe we want it bc it introduces the competition between the modules - but maybe it too restrictive - hidden reprensentations are weighted average of the modules output - the modules **are not** operating independantly on their respective latent variables - not sure how I feel about this one - this might be why this kind of routing algo is often called a Mixture of Experts - high-level controller changes the mask at every timestep - for us, this would beat the purpose of using modularity for sparse updates... - But! there might be a nice balance where our RNN doesn't change the mask too often and we get an option-like behavior! - Rasool: hard-routing works pretty bad in Soft-Module paper - I thin it's because they didn't algin the training algo w/ that - Rasool: the improvement from soft-module could be 70% bc of increased capacity... ## 13/10 #### key takeways - seperating the modules into `used and frozen` and `used and needs updates` is prob a good idea - we have a problem w/ the results, variance is too high! - should increase the number of evals, see in that helps - a module as to work w/ a shifting input (depending on the masked activation of the previous layer) - maybe this is bad, maybe we want to have an attnetion-like mechanism such that the module chooses what pervious module to listen - Jonas: no reason the think that sparsity will emerge in our current `RoutingSigmoid` formulation, bc there isn't any competition between the modules and module routings are myopic (w.r.t to other tasks) - But maybe, once the model converges to a optimal single model doing all tasks, then the only way it can improve is if the modules start being turned off... - Still we should have smth to induce sparsity #### results - new constrainted benchmarks: - ![](https://i.imgur.com/w7P2wO7.png) - 200 episodes - 75k updates - ![](https://i.imgur.com/QM4glzy.png) - 20k updates - ![](https://i.imgur.com/NzeGdke.png) - 100 episodes - 75k updates - ![](https://i.imgur.com/bXrfRI2.png) - 20k updates - ![](https://i.imgur.com/kmk6FG2.png) - some takeways - `TaskID` struggles in the low data regime. - The 3 `RoutingSigmoid` are in the top10 in the last 3 benchmarks (there is a total of 6*3=18 method configs) - `RoutingSigmoid` seems to work better w/ oversampling the current task. - Could be noise however: - ![](https://i.imgur.com/IGn8n1T.png) ## 08/10 #### key takeways - start testing more envs - do more num_evals (from 10 to 25) - do 10 seeds per hparam config - 4 runs per GPUs - Prioritized experience replay could be a good baseline... - but introduces 2 hparams - DART is upside-down for our stuff (glad Jonas also saw that) - should group the activations together to get bigger modules #### ways we can solve the codependant variable problem (architecture vs weights) - EM - e.g. in the seminal [Hierarchical MoE](https://www.cs.toronto.edu/~hinton/absps/hme.pdf) paper - DART - also, MAML, Reptile and other gradient-based bilevel optimization schemes - probabilistic latent variable model? - attention - for discrete architectures, could look into RoutingNetworks (uses RL or stuff like RELAX) - TODO: find in the litterature what's the tradeoff between solving this and not solving it ## 07/10 #### Key takeways - Rasool: should use 10 seeds when we want to get definite results - task do interfere with each other - but maybe that's just in the `ExpReplay` baseline, which is trying to solve all tasks w/ the same function. Maybe this doesn't happen in `ExpReplayTaskID` ? - it still does. look at the evolution of task6 - ![](https://i.imgur.com/Z9dLZtT.png) - it's important not to confuse forgetting w/ interference - main takeway: - Jonas: unclear if `Routing` can actually outperform `TaskID`, at least in a Perfect Replay setting. - this is still unclear to me. e.g. from the figure above, a frozen subnetwork wouldn't have suffered from the interference - in `Routing`, because the masks are continually trained, then the parameters aren't necessarily frozen. Fixing the masks should be better at reducing forgetting. - I still think it's fine if the mask are changing, cause we are still achieving an orthogonalization of gradient (through sparsity). - Taesup: However, freezing the masks could work, the algo is just gonna try to work w/ these frozen masks. - **TODO** find a setting (either by restricting memory or compute) where `ExpReplay` can't be as good and `Routing` can shine. - `RoutingRNN` idea - RNN seems to help consistently. Unclear if it's because it's 1) providing a good task embedding or bc it's 2) providing a good summary of the history of the episode to the actor/critic. - if it's 2), then the `RoutingRNN` will produce dynamic masks, which won't help preventing forgetting/interference. - if we want a static mask, we can use a running average of the RNN's hidden states - however, if it's bc of 2) (anf it probably is), then averaging the Hidden states will remove the performance gain - Rasool: memory is the bottleneck, not compute - :mindblown:, still can't wrap my head around it. I need to think about this more #### Results - legend: - ExpReplay = Experience Replay - TaskID = add task ID to observation space - RNN = add rnn's output to observation space - Routing = modulating the hidden activations of the actor and critic - Sigmoid = modulate w/ sigmoid - ReLU = modulate w/ relu - None = (ablation) no activations - evolution of the global episodic reward (5 seeds) - the runs will end at 500*20=10k episodes - ![](https://i.imgur.com/Hy6DfWN.png) - TaskID as one really bad run, hence the big drop - `ExpReplayRNN` seems to be the method to beat (given TaskID failed run). - Looks like `RoutingSigmoid` can outperform `ExpReplay`, which is encouraging. - if that happens, will probably not be significant however - we can hope that there is a correct way of mixing `Routing` annd `RNN` (given the last two results) which is our ultimate goal - `RoutingReLU` is not soo bad given that it doesn't have a way to revive dead neurons - Hopefully we can make it as good or better than sigmoid - **UPDATE** - ![](https://i.imgur.com/SgWqLpW.png) - One of the `RoutingSigmoid` crashed pretty hard :( - so it didn't beat the `ExpReplay` baseline - Maybe 5 seeds is too little to draw decisive conclusions? - normally `TaskID` outperforms everything, but one crashed run as completely as made it looked pretty bad - controlling the old data sampling vs current task sampling - in this experiment, the `.XX` in the legend means that I will sample `XX%` data from current task and the remaining from the previous tasks. - Setting it to `.00` collapses the uniform sampling (the normal behavior) - Somehow, it only seems to help the worst method, i.e. `TaskEmb` - ![](https://i.imgur.com/SaSDUB4.png) - if you are wondering why the runs with controlled sampling are slower, it's because I didn't code the sampling efficently - (I know how to make it as fast as before, if we want to keep that feature around) - an observation - I understand that I talk a lot about modularity to motivate the `Routing` methods, and that the current benchmark might not really be modular (at least not in an Hierarchical RL way) - However, even though the performance of the different tasks seem to be correlated on average, there are still some tasks that interfere with each other, as seen in the next figure (I know it's messy, feel free to skip) - ![](https://i.imgur.com/UTANGi0.png) - (still averaged over 5 seed btw) - In this figure, we see the evolution of the performance of 6 different task. On the x axis, you see the task that was just completed. - Learning the last task (19) increases a lot the performance on task 19, as highlighted in the top middle figure. - However, the performance on task 15 (top left) and 11 (top right) take a hit. - This is essentially the gap we could fill with a properly working routing net - if a subnetwork can achieve a good performance on a task at some point, keeping it untouched during future learning will solve the interference problem ## 28/09 #### discussion pts - ICLR + and how you could make it up to you - would be quite easy to add an MBRL loss at this point - might help the RNN, which will help us w/ the soft-modules - should we monitor smth about the masks, or the learned task representations, or the RNN's hidden state? - should we learn the soft modules simultaneously w/ the weights? Or instead use an EM style approach? - ok look in the litterature (to not reinvent the wheel) - making the Buffer big enough to no discard any that completely skews how much compute is allocated per task #### litterature on modularity - [Multi-Task Reinforcement Learning with Soft Modularization](https://arxiv.org/abs/2003.13661) - most relevant. - high-level controller modulates a low-level controller w/ soft mask - [Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks](https://arxiv.org/abs/2010.02066) - never read. looks really relevant - [Routing Networks and the Challenges of Modular and Compositional Computation](https://arxiv.org/abs/1904.12774) - never read #### new results - TaskID - seems like taskID could be helpful in the `standard` regime when the buffer size is appropriate for the benchmark - ![](https://i.imgur.com/Nped5Sh.png) - remember that it didnt in the `low ressource` one: - ![](https://i.imgur.com/zQa7rbY.png) - RNN - RNN could improve performance in the `low ressource` regime - ![](https://i.imgur.com/90v5dcq.png) - not as much in `standard` - ![](https://i.imgur.com/BWMcHf6.png) - TrainFromScratch - adding much more compute could increase the performance, but it's really slow. - In the next figures, x axis is the numer of samples (to keep things aligned) - in the `standard` regime - ![](https://i.imgur.com/Ai7okju.png) - in the `low ressource` regime - ![](https://i.imgur.com/wl34cTf.png) - Hypothesis: the others methods are enjoying an order of magnitude more data, so that might be why this baseline is still not an upper bound #### soft-modularization algo - using task ID - starting from vanilla SAC: ![](https://i.imgur.com/O1mHibf.png) - in line 1: - let $\pi^{(t)}$, $Q_1^{(t)}$ and $Q_2^{(t)}$ be the actor and critics for task $t$ - $\pi^{(t)}$ is parametrized by $\theta \odot \sigma (M_\pi^{(t)})$ and the $Q^{(t)}$ are parametrized by $\phi \odot \sigma(M_Q^{(t)})$ - $\odot$ is the element-wise product - $\sigma$ is either the sigmoid function, or a relu - to achieve modularity, one mask per row (of the neural net's layers formed by $\theta$) can be applied, which is equivalnt to masking the hidden activations - in line 13: - if $j$ is odd: - update $\phi_i$ with gradient descent - else: - update both $M_{Q_i}^{(t)}$ with gradient descent - in line 14: - if $j$ is odd: - update $\theta$ with gradient descent - else: - update both $M_\pi^{(t)}$ with gradient descent ## 23/09 ### Results - ![](https://i.imgur.com/fxl1V3z.png) - `standard` > `best` hparams - make sense bc I had a bug that crashed my runs after ~2.5M updates and `standard` has 3M - for context, `best` is [32 32] and `standard` [256 256] - `ER` > `FineTuning` - and as expected fine tuning achieves a better performance on the current task (`final/avg_current_performance`) - `TaskID` doesn't help: - some hypothesis as to why: - hyp1 the task are all too similar - i.e. the cost of learning a task representation is higher than the benefit of being task-aware - how to test: ? - hyp2 the neural nets are not deep enough to learn a latent task embedding - how to test: Taesup suggest to learn a explicit task embedding i.e. outside of the actor/critic. - this would actually level the playing field w/ the context RNN, as both actor/critic would now enjoy the same capacity - hyp3 naively using Experience Replay to leverage the taskID is not enough - or maybe the Buffer isn't balance enough - TODO look into - how to test: - add code for multi-task learning - could also be useful later on when we are debugging the method and we don't know if it doesn't work bc of non-stationarity or smth else. - add code for multi-head - that's a good baseline, but could make it harder to reach the same performance w/ the task-agnostic method... - extra: seems to help however ER-standard on the current task - ![](https://i.imgur.com/GnIX0pA.png =600x) - y = current performance; x = final performance; z = total runtime - you want to be top right and dark - no surpises --> more compute, better results - ![](https://i.imgur.com/BMmL55B.png) - `training_from_scratch` learning curves. seems like it's underfitting #### random questions - should we reset the optimizer in `train_from_scratch`? - should we also reset it in other methods? - should we have seperate task buffers (similarly to MQL)? ### notes - the `reward_threshold` in the LPG repo was totally useless - nice [library](https://github.com/facebookresearch/fvcore/blob/main/docs/flop_count.md) for counting flops ## 7/09 ### key takeways - ok for forward transfer definition. Also ok for memory and compute (total runtime for now) - no need to focus on the Workshop for now (we don't want to give out our idea) To discuss - maybe we need a better benchmark, probably still within mujoco-openai, bc half-cheetah gravity task are all too similar. - Metrics: - Forward transfer definition = ![](https://i.imgur.com/8IHUWF9.png =200x) - memory = `replay_buffer_size` + `total_params` in bytes - compute: - total runtime. - total number of flops $\approx$ `batch_size` * `total_updates` * `flops_per_update` - `batch_size` might overly penalize efficient models - theoretical total number of FLOPS ~~- short notice but, if we can replicate our findings on MetaWorld, we could consider submitting at [this NeuriPS lifelong robotic workshop](http://www.robot-learning.ml/2021/). Deadline sep 24.~~ MIR AQM Data analysis, all on the *HalfCheetah w/ changing gravities* benchmark - linear SAC - ![](https://i.imgur.com/oy9y3aM.png) - This is the evolution of the performance on the current task (blue) and performance on all task (orange). - On the benchmark side: - maybe the task are too similar for this benchmark to be a good evaluator of CL methods. - So maybe we really only use this one for debugging and I try another of the mujoco-openai benchmarks for more robust eval. - On the method side: - these 3 dips pretty scary, but maybe we will only get those w/ linear actors? - Linear Actor (but didnt sample the same task/gravities as LPG_FTW) - ![](https://i.imgur.com/wEIQNbu.png) - super high variance! - maybe are som particular gravities are harder than the others... - bonus: MLP actor w/ twice as much data - ![](https://i.imgur.com/PHbNNJ2.png) - was not computing global performance yet, but the agent quickly hits the ceiling (3800) ## 31/08 - pre meeting notes: - brining Dushyant onboard - brax will only increase sampling speed... Maybe not so useful for off-policy RL bc the buffer is on CPU anyways... - should we use frame stacking? - how to find the most efficient batch_size? - should we sample more recent trajectories? - or in proportion to their cumulative rewards? - on the benchmark - can't use halfcheetah-V0... - can't pass the `reward_threshold` argument bc we aren't registering the env - can I simply overwrite it? - Hparams search - updates_per_task - sampling_ratio - batch_size - or fix it to most efficient value - learning rate ? - hidden_size ? - replay_size ? - or fix to most efficient value - experience replay - on the speed related issue: - I was simply spending more time sampling than updating compared to previous code, which add a 1to1 timesteps vs update ratio - smater sampling stategies? - e.g. replay sucessful trajectories - or more recent ones ### SAC outperformance vs LPG-FTW investigation - setting: - halfcheetah-v0 vs -v3 - `reward_threshold` can't be passed w/ registering the gym env - hardcoded it - method - we are using an MLP-actor [64, 64] wheareas they have a linear-actor ### LPG-FTW setup - architeture: - Mujoco = $\pi$: linear, $V$: MLP - MetaWorld = $\pi$: MLP, $V$: MLP - MLP = (64, 64) - benchmarks: - Half-cheetah - gravity - nb_tasks = 20 - max_episode_steps=1000 - reward_threshold=3800.0 - max_episodes = 50*10 - methods: - common across: - mutiple indepedant $V$s per task - LPG-FTW and PG-ELLA - task-aware (w/ task ID) $\pi$ - EWC - task-agnostic in MUJOCO (cheating!) - task-aware in MetaWorld (multi-head) - ER - not ran in Mujoco - task-aware in MW (multi-head) - other stuff - PG-ELLA - has access to pretrained $\pi$s and $V$s ## 26/08 ### key takeways - fix trajectories throughout experiments - compute is less important (for now). toggle w/ - brax is good :) ### to adress - brax? ## 24/08 ### to address - [LEO](https://arxiv.org/pdf/1807.05960.pdf)? ## 20/08 ### key takeways - LPG-FTW: - really bad repo because - copy pasted all the library they've used - Rasool: this is quite common in RL for Reproc. - Start the project, snapshot you libraries and never change them - has a million files to launch all the experiments - maybe that's the price to pay because all methods and all envs have their particularities - uses MetaWorld-V1 - we'll cross that bridge when we get there - also, no parallelism - Rasool's RL repos - can be segmented in 3 - `Runner`: code for the agent interacting w/ the env - `algos/`: code for learning algorithms - `main.py`: keep the same for all methods (to minimize bugs) - Amzn cluster setup: - the data can't leave Amzn - can't use W&B, unless we so [self-hosting](https://docs.wandb.ai/guides/self-hosted) - Docker: - create when you have a working setup, fix your conda env (local) and image (remote) for the rest of the project - start w/ Rasool's RL image - for the EC2 instances [Isengard](https://isengard.amazon.com/console-access) - always use `p2.xlarge` - you can have a development one, called `massimo_dev` that will run indefinetely - security = Rule1 - use `aws cli` for command-line instead of browser [install](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-mac.html#cliv2-mac-install-cmd) ## 19/08 ### key takeways - MBRL can be hard even in mujoco - Humanoid and hopper have 100+ features and they can be hard to predict - the seperate RNNs (instead of a shared one) in MQL is for convinience / ease of code - a CL baseline we might want to play w/ if to simply keep the current representations close to the previous one. - should work correctly even if we only use the previous model (instead of all the previous ones) - like an EWC but on activations - MQL and SAC codebases are quite similar, so we should be able to copy-paste RNN related stuff from MQL to SAC ### current TODOs - [X] undertands SAC https://spinningup.openai.com/en/latest/algorithms/sac.html - [X] look into context-variable stuff (from Rasool's slack msg) - [ ] write down the potential algos and updates ### To address - parameter codependency problem - MBRL loss on the contextual RNN - bc state space in both settings (mujoco, meta-world) is simple and low dimentional (and probably don't require CNN?) - Might be hard for hopper/humonoids - codebase w/ SAC + soft-modularization code (published) [here](https://github.com/RchalYang/Soft-Module) - in MQL: why have 3 seperate RNNs instead of a shared one? - easier to code ### potential algos - baseline: context SAC - take SAC training algo and condition $\pi$, $Q_1$ and $Q_2$ on their respective $z = GRU(\tau)$ - as was done in MQL - maybe try a shared RNN - compositional SAC - use $z$ to mask the activations of $\pi$, $Q_1$ and $Q_2$ - unclear if $z$ will converge to a fix mask / module composition - or if $z$ enable dynamic soft-modularization as in [this paper](https://arxiv.org/abs/2003.13661) ## 17/08 ### key takeways - task-inference w/ a context variable - maybe we learn it w/ an RNN - quite similar to the [last year's brainstrom](https://hackmd.io/8OAUSKUKQYa81D-cRaPokg) - link w/ PEARL and LEO, should look into - plan - build off SAC (good continuous off-polivy base learner) - TODO(read-up on SAC and implementations) - trace a pareto efficiency curve and try to beat it w/ task inference - TODO(find a good SAC implementation that is easy to work w/) - simplest algo version (baseline, no modularity): - feed in rnn-produced latent variable to the base policy - better version (modular): - simultanously run an RNN and a policy where the RNN's hidden state is the mask on the policies activations - keep LPG-FTW as a reference bc it's the only published paper doing CRL on meta-world - problem we (still) have to solve: - we have codependency problems: the latent variable (thus the RNN's weights) depends on the base policy and vice-versa. - usual solutions: second-order gradients, MAML, EM ### to adress - LPG-FTW builds off Natural Policy Gradient, an on-policy method. - should we instead find another paper to build off from? ### prep notes - thoughts on SupSup: - I like the general idea but ALL experiments are on MNIST, which I know is toy enough to rely on the model's confidence for task-inference. - IIRC this doesn't even scale to Cifar... - however maybe there is smth interesting in the superfluous neuron trick combined w/ $G$ for outlier detection - I like the task-inference and task-boundary detection algos - I suspect however that confidence will NOT be magically well-calibrated in harder settings, esp in RL. - the latter is quite similar to a chinese-restaurant process (which seem like a good solution for CL in general) - not a fan of never training the weights, my hunch is that it doesn't scale well / inneficient. - I understand that it however solves the codependency between modules' params and combinations problem (Taesup's point) - RL version: - it's probably insuficent to only look at the entropy of predited actions in RL. - at least you would need to look at the entropy on a couple of steps - maybe the algo would be: - take a gradient step on the supermask superposition weights $\alpha$ at everytime steps and take the action using the newly weighted SupSup, till convergence - maybe it would be easier to rely on a model of the env (MBRL) ie retrieve the mask that is most confident about its prediction of the world. - or even better (because we can't rely on the model's confidence to be magically well-calibrated) retrieve the model that can correclty predict the environment - maybe you could simply maximise the cumulative rewards w.r.t the superposition weights - we probably want to be smarter about transfer - when the algo has to solve a new task that is e.g. a variation of an old one, the algo should quickly retrieve the most closest task's model and build-off of this one. - if we apply masks on the activations (to achieve modularity), the smaller search space should help us retrieve the appropriate model faster - at the cost of now probably needing to learn the parameters because of the decreased learning capacity - solution to SupSup not working on OoD data / can't perform systematic geneneralization: - maybe we can use the superposed supermask as the policy. - there is a link to be made w/ LPG-FTW, i.e. it would be a MoE - thougts on LPG-FTW and modularity - in a weird way, modularity could emerge in LPG-FTW ($\theta^{(t)} = L s^{(t)}$) - if there was A LOT of experts, and they would each ecnode a a module, then $s^{(t)}$ could in theory be module combinations - there is probably a cool way to extend their method to do modularity - instead of $\theta^{(t)} = L s^{(t)}$ - where $L \in \mathbb{R}^{D \times K}$ are the $K$ experts - and $s^{(t)} \in \mathbb{R}^{K}$ are the mixture coeff - do $\theta^{(t)} = s^{(t)} \odot l$ - where $s^{(t)}$ is now a per-parameter binary mask - $L$ collapses to $l$, a backbone neural net - for modularity we can group the param/mask : - $\theta^{(t)} = M b^{(t)} \odot l$ - where $M \in \mathbb{R}^{D \times K} = \mathbb{1}_{d=k}$ - and $b^{(t)} \in \mathbb{R}^{K}$ are modules combinations - Sequoia's potential TODOs: - helping w/ the empirical section's analysis - adding offline/batch RL support - MQL - eventual NeurIPS competition ## 12/08 ### key takeways - importance of striving for task-agnostic methods - because replay will play a part in CRL, we need to build off off-policy algorithms - look into SupSup - and trailing references - before diving into meta-RL, start understanding the CRL benchmarks and then start iterating on simpler solutions

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.