# [Pipeline of paper](https://arxiv.org/pdf/1912.01603.pdf) - [name=作者:Jeff] - [time=Sat, August 1, 2020 17:30] ## Outline [TOC] ## Important keywords 1. **Continuous Control** 2. **Robotics arm** (FetchReach-v1) 3. **Goal based mission** - Far from several scenarios in D2C or PlaNet papers - Taking "Walker-walk" and "Ant" for example, the two agents learn from own body reward, instead of the Fetch series agents learning reward between arms and objects. 4. **Model-based Deep Reinforcement Learning** - Build environment model 6. **Learn behaviors by latent** - image input via autoencoder 7. **The name of 'Planning' or 'imagination'** - via RNN, LSTM, or GRU ## Build environment ### Simulator - [MuJoCo](http://www.mujoco.org/) 1) Apply for the MuJoCo license and get a accout number from a mail 2) Download th sutiable [version](https://www.roboti.us/index.html) 3) Run the "getid" to register own PC. After sucessfull registration, an activation key will lock own PC. :::warning - Access to MuJoCo official document ``` export MJLIB_PATH=/home/{user_name}/.mujoco/mujoco200_linux/bin/libmujoco200.so export MJKEY_PATH=/home/{user_name}/.mujoco/mujoco200_linux/bin/mjkey.txt ``` - Access to MuJoCo files ``` export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/jeff/.mujoco/mujoco200_linux/bin ``` ::: ### [OpenAI Gym](https://github.com/openai/gym) ``` pip install gym ``` ### Python requirement toolkit ## Training Using the authors'[the code](https://github.com/yusukeurakami/dreamer-pytorch) who has developed dream to control ### Prepocessing - Set Reward strategies - Select camera position in simulator - actor noise ### Main loop (until converge) - Initial hyperparameters ``` * Commom Adjusion - action_repeat : 1 (do same action several times) - action_noise : 0.15 (Add noise to make agent learn from real action instead of only learning from "images") - batch_size : 50 (How many datas for each episode) - chunk_size : 50 (slice the data from batch size) - model_lr: 1e-3 (Model learning rate) - actor_lr : 8e-5 (Actor model learning rate) - value_lr : 8e-5 (Value model learning rate) - adam_epsilon : 1e-7 (Adam optimizer epsilon value) - grad_clip_norm : 100 (The rate of grad_clip_norm to avoid gradient explosion and vanishing) - planning_horizon : 15 (How long does the rssm plan to imagine) - discount : 0.99 (Planning horizon distance) - free_nats : 3 (Free nats are applied to the normalization and overshooting (but not global) KL losses before averaging over elements in the distribution) - bit_depth : 5 (5 bit depth mean that only 32 colors in an image) - seed_episodes : 5 (precollect episodes experience) * Some setting: - max_episode_length : 1000 (How long does an episode have? In the "fetach", there are 50 longs in an episodes.) - experience_size : 500000 (the storation of experience) - cnn_activation_function : relu - dense_activation_function : elu - embedding_size : 1024 (train rssm model input dim) - hidden_size : 200 (train through linear layers' dim) - belief_size : 200 (Belief/hidden size) - state_size : 30 (State/latent size) - collect_interval: 100 (Collect interval) - overshooting_distance : 50 (Latent overshooting distance/latent overshooting weight for t = 1) - overshooting_kl_beta : 0 (Latent overshooting KL weight for t > 1 (0 to disable)) - overshooting_reward_scale : 0 (Latent overshooting reward prediction weight for t > 1 (0 to disable)) - global_kl_beta : 0 (Global KL weight (0 to disable)) - disclaim : 0.95 (discount rate to compute return) - optimisation_iter : 10 (Planning optimisation iterations) - candidates : 1000 (Candidate samples per iteration) - top_candidates : 100 (Number of top candidates to fit) - worldmodel_LogProbLoss : default = True (use LogProb loss for observation_model and reward_model training) ``` - Intial several models - Collect seed episodes - Learn from episodes ## Testing ## Result ## Related work - The concept like dreamer - [Dyna, an Integrated Architecture for Learning, Planning, and Reacting ](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.48.6005&rep=rep1&type=pdf) - [MuJoCo](https://homes.cs.washington.edu/~todorov/papers/TodorovIROS12.pdf) - Survey - [A brief survey of deep reinforcement learning](https://arxiv.org/pdf/1708.05866.pdf) - [Reinforcement Learning in Robotics: A Survey](https://www.ias.informatik.tu-darmstadt.de/uploads/Publications/Kober_IJRR_2013.pdf) - [Survey of Robotic Manipulation Studies Intending Practical Applications in Real Environments: —Object Recognition, Soft Robot Hand, Challenge Program and Benchmarking](https://www.researchgate.net/publication/326401265_Survey_of_Robotic_Manipulation_Studies_Intending_Practical_Applications_in_Real_Environments_-Object_Recognition_Soft_Robot_Hand_Challenge_Program_and_Benchmarking-) - Robotics - [Reacher](https://gym.openai.com/envs/Reacher-v2/) (OpenAI Gym) - [Fetch Robotics](https://fetchrobotics.com/) - [Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research](https://arxiv.org/pdf/1802.09464.pdf) - [Asymmetric Actor Critic for Image-Based Robot Learning](https://arxiv.org/pdf/1710.06542.pdf) - [TEMPORAL DIFFERENCE MODELS:MODEL-FREE DEEP RL FOR MODEL-BASED CONTROL](https://arxiv.org/pdf/1802.09081.pdf) - [Pick and Place Without Geometric Object Models](https://arxiv.org/pdf/1707.05615.pdf) - Deep reinforcement learning 1) Model-based DRL which learn from latent (Image-based) - [PlaNet](https://arxiv.org/pdf/1811.04551.pdf) - [I2A](https://arxiv.org/pdf/1707.06203.pdfs) 2) Model-based DRL which learn with planning - [Muzero](https://arxiv.org/pdf/1911.08265.pdf) 3) Else model-based DRL - [BREMEN](https://arxiv.org/pdf/2006.03647.pdf) 4) About reward (dense and sparse) - Sparse Reward: 1) [HER](https://arxiv.org/pdf/1707.01495.pdf) 2) [SHER](https://arxiv.org/pdf/2002.02089.pdf) 3) [PlanGAN](https://arxiv.org/pdf/2006.00900.pdf) 5) [KL divergence in DRL](https://arxiv.org/pdf/1905.01240.pdf) - Dynamics learning - [Improving PILCO with Bayesian Neural Network Dynamics Models](http://mlg.eng.cam.ac.uk/yarin/PDFs/DeepPILCO.pdf) - [Deep reinforcement learning in a handful of trials using probalistic Dynamics models](https://arxiv.org/pdf/1805.12114.pdf) - Robotics with DRL - Image-based model-based deep reinforcement learning - [SOLAR](https://arxiv.org/pdf/1808.09105.pdf) - [Deep Visual foresight for planning robot motion](https://arxiv.org/pdf/1610.00696.pdf) - [Visual foresight: model-based deep reinforcement learning for vision-based robotic control](https://arxiv.org/pdf/1812.00568.pdf) - [E2C](https://arxiv.org/pdf/1506.07365.pdf) - [RCE](https://arxiv.org/pdf/1710.05373.pdf) - Else - [Model-Based Planning with discrete and continuous actions](https://arxiv.org/pdf/1705.07177.pdf) - [Leveraging Deep Reinforcement Learning for Reaching Robotic Tasks](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8014805) - [Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning](https://arxiv.org/pdf/1708.02596.pdf) - [Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods](https://arxiv.org/pdf/1802.10264.pdf) - [Review of Deep Reinforcement Learning for Robot Manipulation]( https://arxiv.org/pdf/1610.00633.pdf) - [KEEP DOING WHAT WORKED: BEHAVIOR MODELLING PRIORS FOR OFFLINE REIN-FORCEMENT LEARNING](https://arxiv.org/pdf/2002.08396.pdf)