Paper Reading - HackMD

If you have any questions, please contact chenwenze21@gmail.com. ## Second Time ### HIQL: Offline Goal-Conditioned RL with Latent States as Actions levine 2023/7/24 7pts https://arxiv.org/pdf/2307.11949v1.pdf Abstract: directly learn an goal-reaching agent is hard. The paper proposed a hierarchical method, e.g. a higher-level agent learn to generate sub-goal(take state as action), and a traditional RL agent. Intro: * train a value function by IQL * use that value function to generate two networks * state(input), sub-goal(output) * subgoal(input), action(output) * the advantage of this kind of setting is that the subgoal is in latent space, thus we can utilize unlabel data to pre-train the policy. * hierarchical structure also better to learn high horrizon goal. ### PASTA: Pretrained action-state transformer agents InstaDeep 2023/7/22 4pts https://arxiv.org/pdf/2307.10936v1.pdf Abstract: a universal pre-trained model for RL task. Intro: * Tokenize trajectories at the component level is a better way. component level means that it see each state as a vector, like position of legs. * Pre-train the same model on datasets from multiple domains improve model's ability on generalization. * test on many down stream tasks to test it. ### Real Robot Challenge 2022: Learning Dexterous Manipulation from Offline Data in the Real World collaborate, 2023/8/16 4pts https://arxiv.org/pdf/2308.07741.pdf Abstract: rules of the competition, present the methods used by the winning teams and compare their results with a benchmark of state-of-the-art offline RL algorithms on the challenge datasets. (the competition allowing participants to experiment remotely with a real robot) Content: * It is interesting that the 1st price taker use navie BC as their algorithm. When data is collected by expert, BC will be a better way in comparison with offline RL algorithms. * They use a self-supervised method to train a discriminator to seperate expert and near-optimal data. * An assumption they use is that trajectory with higher return do not mean that it is collected by an expert. * Of course data augumentation help RL training in real world setting. However, in practice, things may not that sysmetric in real world. Should pay attention to this when using data augumentation. ### RT-2: Vision-Language-Action Models Google, 2023/7/30 8pts https://robotics-transformer2.github.io * Abstract: incorporate pretrained large-scale language-vision model trained on internet data into RL training loop. * Intro: train vision, language, robot action all together in an end-t-end way. * discretize action space * co-fine-tune: fine-tune action by interacting with the environement and fine-tune vision-language model by web-data allow model more general * output-constraint: when training visual-language task there is no constraint, while training robot task, a constraint is added to the action space. * can generate to unseen view and object. ### Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement Apple, 2023/8/19, 4pts https://arxiv.org/pdf/2303.08983v2.pdf Abstract: disigned a distillation and data augumentation based method to inhence the quality of cv dataset. Intro: * add data augumentation into dataset * add distill information(teacher model's output) into the dataset. * teacher model with higher acc not always be the best teacher, should test its student's ability to check this. ### Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition Shuran Song, 2023/8/1 7pts https://www.cs.columbia.edu/~huy/scalingup/ Abtract: combine multi-task RL and LLM. Use LLM to give high level planing. Use RL planner to generate multiple path. Use these path to train a language conditioned policy. Train a network to detect failure, reset and train again when it fail. Intro: * want to achieve two goals * generate as many laguage-trajectory pair data as possible: LLM output high level planning. sampling based policy output lower level control to output the trajectory. * use these data as a multi-task RL policy: use BC, but conditioned on language input and is multi-task. * step: * step1: goal -> sub-goal: * turn the whole task into multiple sub-goals by the guidiance of recursive LLMs, built a goal-tree. * seperate the task based on objects. i.e., put A into B should be seperate into two parts. * pros: LLMs provides general knowledge but can not complete the task on it own * step2: implement: * use sample based method to add randomness to the trajectory. * follow the structure of the tree to complete the whole task. * step3: verify & retry * collect both success and failure task) * train an inferred success function to check the task is success or failure. * rerun the task with another seed, without reset, so the agent can learn to recover from failure. * step4: Language-conditioned Policy Distillation * use diffusion policy + imitation learning to get a multi-task policy, by adding language-conditioning * is able to transfers to the real-world, without domain randomizaiton * guess: imitation learning is good enough if the quality of the trajectory is good * some note: * how to utilize failure trajectory in imitation learning? * is retry realy important to RL agent? how to mathmatecally formulate retry this behavior? * seems like a good general RL agent should take both image and language as input, so we had better know the state of the art model of that kind of input-output. ### Language Reward Modulation for Pretraining Reinforcement Learning Peter Abbeel, 23/8, 7pts https://arxiv.org/pdf/2308.12270v1.pdf Abstract: instead of using learned reward functions as singal to train subsquence reward, use Vision-Language Models to unsupervisly train the RL policy. Intro: * learned reward functions are noisy, directly use it to train a down-stream task may cause problem. * learned reward functions do not need human effort to label them * pre-training phase has a lower requirment on the precision of the reward functions. Method: * Reinforcement learning with vision-language reward: * visual representations $F_φ(o_i)$ and text representations $L(x)$ * $r^{in}_t = D (F(o), L(x))$, $D(\cdot)$ means distance * R3M: * learn from large scale Ego4D dataset. * trained to predict $G(F(o_1),F(o_i),L(x))$, represent whether the agent complete task discription $x$ between timestep $1$ to $i$ * use $G$ as reward model * generate instructors * query ChatGPT * instruction may be human-centric, robot centric, as well as ambiguous * training * pre-train: add an exploration reward(Plan2Explore) * fine-tune: fix the language for the whole task * note: * "ego-centric" instruction may be important, otherwise, there may be a domain shift between pre-trained and fine-tune env * get rewards rather than state representation from large language/vision model is interesting. (but have to be aware of disturbance of the generative reward model) ### Learning Dexterous Manipulation from Exemplar Object Trajectories and Pre-Grasps Vikash Kumar, ICRA2023, 5pts https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10161147 Abstract: an algorithm that can grasp various kind of object, without tuning hyper-parameter per task. Use exploration strategy induced by a surprisingly simple ingredient (single pre-grasp pose). Intro: * PGDM: Accelerating Exploration w/ Pre-Grasps * observation: decompose dexterous tasks into a “reaching stage” and a “manipulation stage,” * method: manually set the init state of the system to pre-grasps state. * implementation: use a scene-agnostic trajectory optimizer to achieve pre-grasp state first, then solve it by PPO. * TCDM * 50 task, from (1) human MoCap recordings transferred to robot via IK, (2) expert pre-grasps extracted from Tele-Op data, (3) manually labeled pre-grasps, and (4) learned pre-grasps generated by an object mesh condition ### Real World Offline Reinforcement Learning with Realistic Data Source Vikash Kumar, ICRA2023, 4pts https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10161474 Abstract: collect safe trajetories from different task. Use Offline RL to train a multi-task agent. Use other trajectories from other tasks as hetrogeneous data. Intro: * problem: * simulator data: * hardware noises, varying reset conditions * sub-optimal data: * source: add noise to optimal data * may be unsafe * method: * use trajetories from other tasks as heterougeneous data * it is safe * it is meaningful * use offline RL algorithm * use out-of-domain data * result * general * behavior cloning demonstrates strong robustness to varying representations and in-domain tasks. * Offline RL could outperform BC for tasks where the initial state distributions change during deployment. * IQL have similar performance with BC in in-domain task * Offline RL good at utilizing heterogeneous data. * note * this work does not consider the safty problem it mentioned in the introduction part. * how to capture the knowledge in different domain ### Internally Rewarded Reinforcement Learning x, ICML2023, 8pts https://arxiv.org/pdf/2302.00270v3.pdf Abstract: The noise of the mutual information-based rewards function may be unneglectable. Use linear objective to replace log one. Use clip method to stablize training process. Intro: * Linear Reward * use $q(z|s) - p(z)$ instead of $\log q(z|s) -\log p(z)$ as rewards * has lower error(in terms of expectation and variance) * it is in fact χ2-divergence (originally we use KL-divergence) * Clipped Reward * $r = q(z|s) - p(z) \approx p(z|s)-p(z)$ * $p(z|s)$ in general should greater than $p(z)$ * use a clip trick to ensure this constraint, i.e., use $\max(q(z|s), p(z))\approx p(z|s)$ * note: a very egineering paper ### Neural Amortized Inference for Nested Multi-agent Reasoning stanford, 2023/8/29, 7pts https://arxiv.org/pdf/2308.11071v1.pdf Abstract: try to model the opponent and infer their behavior. Intro: * use the concept of k-level thinking. * at level k, $a_i\sim P(a|b_j^{k-1},o_i)$, where player j is player i's opponent, and b is the belief. * belief is derived from a recurrent way. note: * generally 2 to 3-level think perform well, can we use this as prior to design our systems. * how to utilize the recurrent nature of this kind of network, can we use something like fix-point theory to solve this problem? * how to ensure that you have a good model of the opponent? can we gain benefit from the symetric of the game, i.e., both agent use the same policy? ### Diffuser：Planning with diffusion for flexible behavior synthesis. ICML2022, 7pts https://arxiv.org/pdf/2205.09991.pdf Abstract: use diffusion model to train a decision making agent, not in the RL classical setting. Intro: * problem: * recently, model-base RL run model prediction and decision making seperately. * Genearally, it will work like a adversarial game, because the RL agent want to exploit the model. * use a universal framwork to combine two process. * diffusion model * use diffusion model as a trajectories generator. * input: all trajectories in dataset, output: the entire trajectory is predicted simultaneously * rewards model: * given a trajectory we can get the rewards it obtain. * objective $p(\hat\tau)=p(\tau)h(\tau)$, where $p(\tau)$ is the probability the trajectory is generated by the diffusion model * note * if we can generate a trajectory in one forward pass, then we don't need the value function, instead, we only need the rewards function and calculate the ture return by go through the trajectory. ### Adversarial Style Transfer for Robust Policy Optimization in Deep Reinforcement Learning Purdue, 23/8, 7pts https://arxiv.org/pdf/2308.15550v1.pdf Abstract: seen RL training as an adversarial game. Generator transfer style and max entropy of the policy, the discriminator max the rewards function. Intro: * overview: * generator change style of sinput images * discriminator(policy) should output the same action distribution for image before and after style tranformation. * generator: * a pre-trained model is used to reduce data dimension. * Gaussian Mixture Model is used to cluster the data into n clusters. * use GAN to train a generator and a discriminator, where generator should cheat discriminator and RL-policy simultaneously * RL policy * min $KL(πθ(.|xt),πθ(.|x′t))$, where two x stand for image before and after style tranformation. * note: * view data augmentation as a style transformation is interesting. * They define styles by clustering observations in the dataset, which can be somewhat narrow. We can define styles as trajectories or solutions for a given problem, where the same problem may have multiple solutions (styles). In this case, what should be fixed is the target state, not the policy distribution. In this way, we can also leverage methods similar to this work to enhance stability. * Convergence properties can be analyzed within the framework of game theory. * we have to pay attention to the adversarial relationship between RL-agnet and the environment. ### Lifelike Agility and Play on Quadrupedal Robots using Reinforcement Learning and Generative Pre-trained Models Tencent, 23/8, 8pts https://arxiv.org/pdf/2308.15143v1.pdf Abstract: use large scale pre-trained model to learn env-agnostic knowledge from animal video. Then fine-tune on the down-stream tasks. deploy it on quadraped robots. Intro: * overall * there are three training stage, each is independent, and they are from general to specific task. * PMC learn latent representation of actions from videos. * EPMC learn multiple env-decoders by interact with the simulator and use multi-expert distillation contract them into one model * SEPMC * PMC * use multiple camera to capture a labrador's locomotion and behavior * use inverse-kenametic method+pose estimation to obtain the labeled dataset * Use a VAE like structure to learn a policy * encoder: $P(z|s^p, s^f)$, where $z$ is a latent vector, $s^p$ is current observation, $s^f$ is future trajectory in the video * discrete latent space：$z^e = arg\min_i ||z-z_i||$ * decoder: $P(a|z^e,s^p)$, where a is a desire position of the joint, and use a PD controller to implement it. > why use discrete latent space > output position rather than torque is better. > observation should include history * prioritized sampling: give behavior that is rare in dataset more weight * EPMC * flat terrain: use gail(imitation learning) to train the agent to follow the demonstration * stair: add a residule network on top of the original decoder * others: use a hyper-network to outputthe wieght of the latent code. The goal is to follow a average velocity * multi-expert distillation: uniformly sample the task, if there is a new task, just train a specific network and distill it into the main network * SEPMC * a high-level network, take opponent's info, the map as input * use Prioritized FSP(PFSP) to train an adversarial game. * note * the hierarchical structure is interesting, flexible, and reasonable. Shall we split it in a different way? * multi-expert is a flexible design. * the lower level network is fixed while training the high level policy, can we optimize them all together? * learn from real animal is interesting, can we discover something by analysing the latent space in step1? why do they choose to use discrete latent space, I guess the reason is, in step2, it have to output a discrete distribution. Also, can we learn from wild data? like videos on youtube. * The agents are able to move in intrinsic terrain with training dataset only contain trajetories in flat terrain, which is suprising. I think a network that output residule poistion is neccessary in this domain shift case. * use VAE instead of an imitation learning+fine-tune pipeline is interesting. ### DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION google, ICLR2020 oral, 6pts https://arxiv.org/pdf/1912.01603.pdf Abstract: learn a world model to tackle tasks with image input. Intro: * pipeline: * execute policy in the world to obtain a dataset. * learn world model from the dataset * train policy/value function by querying world model * world model * The following two networks are required for step3 training * Transition model: $q(s_t|s_{t-1}, a_{t-1})$ * Reward model: $q(r_t|s_t)$ * pros * agents are able to make decision w/o image observation, efficient * use RNN to implement transition model, long horizon * x: Representation model: $p(s_t|s_{t-1},a_t,o_t)$ > $s_t$ are Markovian * objective * $\max I(s_\tau;o_\tau,r_\tau|a)-\beta I(s_\tau;o_\tau)$ * lower bound * $\max E[\sum_t \ln q(o_t|s_t)]$ * $\max E[\sum_t \ln q(r_t|s_t)]$ * $\min E[\sum_t D_{KL}(p(s_{t+1}|s_t,o_t,,a_t)|q(s_{t+1}|s_t,a_t))]$ * meaning * latent space encode the whole trajectory rather than current observation. * try to use as less information from observation as possible. * note * $q(s_{t+1}|s_t,a_t)$ seems wierd, it is a open loop control, state prediction will not be corrected by the current observation. I do not agree with the model based setting. The error may accumulate if the trajectory is very long. But the point that information should not heavy rely on the current input is good. * Will the gradient flow from $q(r|s)$'s $s$ to the encoder $q(s|o)$? how does it truely work? ### EFFICIENT RLHF: REDUCING THE MEMORY USAGE OF PPO microsoft, 23/9, 4pts https://arxiv.org/pdf/2309.00754v1.pdf Abstract: PPO use 3x memory than supervised learning. In this work, they try to min the usage of memory but maintain the performance Intro: * memory/computation cost reduction through the implementation of model-sharing between Reference/Reward Models and Actor/Critic Models. * $\pi_\theta$ and $\pi_{ref}$: * they are both loaded from the same pre-trained model * $\pi_\theta$ use LoRA * can load pre-trained model once. Turn on LoRA while training $\pi_theta$, turn off LoRA when use $\pi_{ref}$ * Actor and Critic also can share the pre-trained model, by using different LoRA model. ### Double Clipping: Less- ed Variance Reduction in Off-Policy Evaluation amazon, 23/9, 7pts https://arxiv.org/pdf/2309.01120v1.pdf Abstract: clipping (importance sampling) can reduce variance in the sacrification of increase in bias. The bias is always a downward bias. This work design a method to compensate the bias but fix the variance. Intro: * the clipping bias is defined as: * $b = E_{\pi}[\mathbb{I}(\frac{\pi}{\pi_0}>U)(\frac{U}{\pi/\pi_0}-1)E[r]]$ * once $r\geq 0$, the bias is always negative. * note * the clipping operator in PPO in fact let the policy to underestimate the value. * underestimation is a good property for hurestic function in A star algorithm. * the reward function should $\geq$ 0 ### NEUROEVOLUTION IS A COMPETITIVE ALTERNATIVE TO REINFORCEMENT LEARNING FOR SKILL DISCOVERY instadeep, 23/9, 7pts https://arxiv.org/pdf/2210.03516v4.pdf Intro: * contribution: proposed three benchmark to test QD algorithms. * compare 4 mutual information based methods with 4 QD-based methods. * no algorithm is significantly outperform other algorithms. ### A Survey on Transformers in Reinforcement Learning Tecent, 23/9, 7pts https://arxiv.org/pdf/2301.03044v3.pdf * representation learning * AlphaStar: multi-head dot-product attention * multi-entity * multi-modal * vit * temporal sequence * encoding the trajectory * While Transformer outperforms LSTM/RNN as the memory horizon grows and parameter scales, it suffers from poor data efficiency with RL signals * model learning * transformer based world model better than Dreamer's * sequential decision-making * offline RL * not for online at this moment * generalist agents * large scale multi-task Dataset * Prompt-based Decision Transformer: samples a sequence of transitions from the few-shot demonstration dataset as prompt * Gato, RT1 large scale multi modal dataset * beneficial to finetune DT with Transformer pre-trained on language datasets or multi-modal datasets containing language modality. * perspectives * online sequential decision making * transformer is originally designed for the text sequence. * general agent, general world model * similarity/difference with diffusion model * notes * add transformer in a good way can add inductive bias for the model * language pretrained model help sequential decision process is interesting. * transformer fail to learn unstable objective. the world model can be learned but the value function can not. * use transformer to encode a seqence of partial oberservable observation into a global state in promising. ### Human-Timescale Adaptation in an Open-Ended Task Space Deepmind, 23/10, 6pts https://arxiv.org/pdf/2301.07608.pdf?trk=public_post_comment-text Abstract: use auto-curriculum learning. meta rl. distillation, large model size(transformer). Detail: * META RL * rl^2 based method. agent's memory will reset each trial. the return will not be truncated at the end of each episode. * curriculum learning * select “interesting” tasks at the frontier of the agent’s capabilities. There are two ways: * no-op: compare the current policy with a no operation policy. Choose it once meets some condition. * Prioritised level replay: fitness score, which approximates the agent’s regret for a given task * RL * use transformer outout next n token, update them. * Memory: * RNN with Attention: stores a number of past activations, do attention on them, use current hidden as query. * Transformer-XL: variantion of transformer, allow longer input. May use sub-sampling the sequence to further extend the vision length. * Distillation * train a smaller teacher model * Main model is seen as student, but is larger than teacher and use the same hyper-parameter to train. * Exp result * scale agent network size/memory length(multiple epsodes, or many shot setting) improve performance * scale tasks distribution and complexity improve performance * ### A survey of inverse reinforcement learning x, 22/2, 6pts https://link.springer.com/content/pdf/10.1007/s10462-021-10108-x.pdf ## First Time ### Beyond Black-Box Advice: Learning-Augmented Algorithms for MDPs with Q-Value Predictions CUHK 2023/7/22 6pts https://arxiv.org/pdf/2307.10524v1.pdf Abstract: how to use additional information to advise the training of Q-value. ### On the Convergence of Bounded Agents DeepMind 2023/7/22 4pts https://arxiv.org/pdf/2307.11044v1.pdf Abstract: It is easy to define an environment converged, but how to define an agent converged? ### A Definition of Continual Reinforcement Learning DeepMind 2023/7/22 4pts https://arxiv.org/pdf/2307.11046v1.pdf Abstract: give continual reinforcement learning a definition. ### Leveraging Offline Data in Online Reinforcement Learning UW 2023/7/22 5pts https://arxiv.org/pdf/2211.04974v2.pdf Abstract: how to combine online and offline data to accelerate training process ### Offline Reinforcement Learning with Closed-Form Policy Improvement Operators UCSB 2023/7/24 4pts https://arxiv.org/pdf/2211.15956v3.pdf Abstract: when using constrained optimization, policies are prove to use linear approximation. ### Provable Reset-free Reinforcement Learning by No-Regret Reduction Micorsoft 2023/7/24 6pts ICML https://arxiv.org/pdf/2301.02389v3.pdf Abstract: form reset-free RL as a two player zero-sum game to ensure that the policy will avoid to reset and achieve optimal performance. ### Toward Efficient Gradient-Based Value Estimation Sutton 2023/7/24 5pts https://arxiv.org/pdf/2301.13757v3.pdf Abstract: Gradient-based RL algorithm often slower than TD-based methods. The paper let the update of value functions approximately follows the Gauss-Newton direction. In this way, the condition number of the H matrix will be low, thus, accelerate the training process. ### HINDSIGHT-DICE: STABLE CREDIT ASSIGNMENT FOR DEEP REINFORCEMENT LEARNING Stanford, 2023/8/18, 5pts Abstract: adapt existing importance-sampling ratio estimation techniques for off-policy evaluation to drastically improve the stability and efficiency of these so-called hindsight policy methods. focus on credit assignment. ### Parallel Q-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation MIT 20237/24 5pts https://arxiv.org/pdf/2307.12983v1.pdf Abstract: large scale off-policy RL framework ### A Connection between One-Step RL and Critic Regularization in Reinforcement Learning levine 2023/7/24 5pts https://arxiv.org/pdf/2307.12968v1.pdf Abstract: theorerically show that one-step RL is somehow equal to critic regularization multi-step RL(used in offline RL), and empirically prove the ability of the observation. ### Pixel to policy: DQN Encoders for within & cross-game reinforcement learning UCSD 2023/8/1 3pts https://arxiv.org/pdf/2308.00318v1.pdf Abstract: use limited data to train an agent that can take over Atari game. Use transfer learning. ### MADIFF: Offline Multi-agent Learning with Diffusion Models SJTU 2023/8/15 4pts https://arxiv.org/pdf/2305.17330v2.pdf Abstract: naively combine diffusion model and offline RL. Also design some customed network structure. ### CaRT: Certified Safety and Robust Tracking in Learning-based Motion Planning for Multi-Agent Systems CalTech 2023/8/15 5pts https://arxiv.org/pdf/2307.08602v2.pdf Abstract: design a hierarchical model to deal with safty in multi-agent path finding. Either project non-linear system back to safe linear system or filter the bad trajectory by the hierarchical structure. ### Model-Based Safe Reinforcement Learning with Time-Varying State and Control Constraints: An Application to Intelligent Vehicles IEEE member 2023/8/15 3pts Abstract: barrier force-based control policy structure for safty. multi-step policy evaluation mechanism is employed for time varing ### Generating Personas for Games with Multimodal Adversarial Imitation Learning 2023/8/16 4pts Abstract: multi modal GAIL, train multiple reward functino(discriminator) and use RL to exploit them ### Principles and Guidelines for Evaluating Social Robot Navigation Algorithms collaborate, 2023/8/16 6pts https://arxiv.org/pdf/2306.16740.pdf Abstract: social robot navigation means navigation in human-populated environments. metrics(easier to compare results from different simulators, robots and datasets), development of scenarios, benchmarks, datasets, and simulators ### Deep Reinforcement Learning with Multitask Episodic Memory Based on Task-Conditioned Hypernetwork Beijing Post and tele, 2023/8/16, 4pts https://arxiv.org/pdf/2306.10698.pdf Abstract: selecting the most relevant past experiences for the current task, and integrate such experiences into the decision network ### Policy Regularization with Dataset Constraint for Offline Reinforcement Learning ICML2023 2023/8/16 6pts https://arxiv.org/pdf/2306.06569.pdf Abstract: offline RL too conservative. Instead regularizing the policy towards the nearest state-action pair. It is a softer constraint but still keeps enough conservatism from out-of-distribution actions. * Detail: * distance: $dist((s,a),D) = \min_{(s',a')\in D} dist(s,s') + \beta dist(a,a')$ * loss: $\min_\theta L(\theta) = E_{s\sim D}[dist\big((s, \pi_\theta(s)), D\big)]$ * theoretical result: with Lipschitz assumption, once max distance of (s,a) is bounded $\max dist((s,\pi(\cdot |s)),D)<\epsilon$, then $|Q(s, \pi(\cdot|s))-Q(s, \mu(\cdot|s))|<K\epsilon$ is bounded. * note: * $\max L(\theta) \text{ s.t } dist(\theta, D)<\delta$, with point-wise distance. * change actor loss, explicitly constraint the policy. I think the distance of policy is meaningless, while critic distance is explainable. Thus, I do not like this method. * Offline RL encourage the agent to use some OOD (s,a). Otherwise, they just have to do "re-pair" operation in the original dataset. ### CHALLENGES AND OPPORTUNITIES OF USING TRANSFORMER-BASED MULTI-TASK LEARNING IN NLP THROUGH ML LIFECYCLE: A SURVEY doxray(company) 2023/8/17 4pts https://arxiv.org/pdf/2308.08234v1.pdf Abstract: sytematically explore the multi-task NLP training. connect this to continuous learning. ### RoboAgent: Towards Sample Efficient Robot Manipulation with Semantic Augmentations and Action Chunking CMU+Meta 2023/8/18 8pts https://robopen.github.io Abstract: Built a large scale robot system MT-ACT. Aiming to train an universal agent that can handle multi-tasks. By using sementation augmentation and action representation, the agent is able to learn from only 7500 trjectories. ### WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct x, 2023/8/18, 5pts https://arxiv.org/pdf/2308.09583v1.pdf Abstract: use Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to fine-tune Lamma-2 ### CoMIX: A Multi-agent Reinforcement Learning Training Architecture for Efficient Decentralized Coordination and Independent Decision Making UCL, 2023/8/19, 6pts https://arxiv.org/pdf/2308.10721v1.pdf Abstract: Co-Qmix. decentralized, flexible policies. Allow agents to communicate with each other. ### DPMAC: Differentially Private Communication for Cooperative Multi-Agent Reinforcement Learning SJTU, 2023/8/19, 4pts https://arxiv.org/pdf/2308.09902v1.pdf Abstract: teach agent to collaborate and preserve private info. use game theory to prove its effectiveness. ### Never Explore Repeatedly in Multi-Agent Reinforcement Learning 趙千川, 2023/8/19, 3pts https://arxiv.org/pdf/2308.09909v1.pdf Abstract: solved the problem of revisitation, which means that agent recurrently visit an area. Proposed a dynamic reward scaling approach to stablize fluctuations in intrinsic rewards in previously explored areas. * Detail * reward = $r_{extrinsic} + \alpha r_{intrinsic}$, $\alpha$ should be dynamically adjust (be small in well-explored area, be large in un-familiar area) * store all the visited observations in $D$. uncertainty of observation $o$ is defined as $\min_{o'\in D} dist(o,o')$ * use CDS like intrinsic rewards * note * store all the observation seems ridiculous. * like a combination of CDS and RND. (re-weighted the CDS rewards by exp of RND rewards may give a similar performance) ### Reinforced Self-Training (ReST) for Language Modeling Deepmind, 2023/8/19, 5pts https://arxiv.org/pdf/2308.08998v2.pdf Abstract: generate data. Use it to train offline policy. It is sample efficient beacause the data can be reused. Test on LLM task. * Detail: * Grow step, a policy generates a dataset. * Improve step, the filtered dataset is used to fine-tune the policy. * Both steps are repeated, Improve step is repeated more frequently to amortise the dataset creation cost. * note: * semi-supervised method work when the supervised signal generation is much more faster than running the simulation. * dreamer like model based rl methods may be viewd as a semi-supervised method. ### Continual Learning as Computationally Constrained Reinforcement Learning Stanford, 2023/8/19, 4pts https://arxiv.org/pdf/2307.04345v2.pdf Abstract: intro to continual learning, seen it a RL task. ### Some Supervision Required: Incorporating Oracle Policies in Reinforcement Learning via Epistemic Uncertainty Metrics x, 2023/8/19, 6pts https://arxiv.org/pdf/2208.10533v3.pdf Abstract: proposed Critic Confidence Guided Exploration to incorporate oracle policy into rl model. use a external uncertainty estimate data. The agent will take in the oracle policy’s actions as suggestions and incorporates this information into the learning scheme when uncertainty is high. * Detail: use UCB method to get $Q_{UCB}$, if the potential improvement$\frac{|Q_{UCB}^{oracle}-Q_{UCB}^{\pi}|}{Q^\pi}$ is greater than threshold, than use oracle action, otherwise use $a\sim \pi(\cdot|s)$ * note * setting: online rl with expert data available. * incorporate UCB with imitation learning, which is interesting. * expert data give us a biased q value approximation. during online fine-tuning we are aim to explore those overestimated (s,a). w/ UCB we can filter out trajectories with highest approximation q values. w/o UCB, we may interest in every trajectory that has higher approximate q values than the expert data. Thus, deviate to the wrong trajectory at the begin of the episode. ### FoX: Formation-aware exploration in multi-agent reinforcement learning x, 2023/8/29, 2pts https://arxiv.org/pdf/2308.11272v1.pdf Abstract: exploration problem in MARL. relate state in exploration space to previous states can reduce exploration space. Exp: bad exp result! ### Active Exploration for Inverse Reinforcement Learning ETH, 2023/8/29, 5pts https://arxiv.org/pdf/2207.08645v4.pdf Abstract: provide a sample-complexity bounds for IRL that does not require a generative model of the environment. ### Lifelong Multi-Agent Path Finding in Large-Scale Warehouses Jiaoyang Li, 2021, Intro: * MAPF * Multi-Agent Path Finding (MAPF): moving a team of agents from their start locations to their goal locations while avoiding collisions. * lifelong MAPF: after an agent reaches its goal location, it is assigned a new goal location and required to keep moving * view lifelong MAPF as running window problem of MAPF. ### Identifying Reaction-Aware Driving Styles of Stochastic Model Predictive Controlled Vehicles by Inverse Reinforcement Learning Arizona, 23/8, 2pts https://arxiv.org/pdf/2308.12069v1.pdf Abstract: use inverse RL to model the behavior pattern of the opponent. focus too much on how to model the driving scenario. ### E(3)-Equivariant Actor-Critic Methods for Cooperative Multi-Agent Reinforcement Learning USC, 23/8, 2pts https://arxiv.org/pdf/2308.11842v1.pdf Abstract: leverage sysmetric nature of MARL algorithm. ### MARLlib: A Scalable and Efficient Library For Multi-agent Reinforcement Learning Yaodong Yang, 23/8, 6pts https://arxiv.org/pdf/2210.13708v3.pdf Abstract: 1) a standardized multi-agent environment wrapper, 2) an agent-level algorithm implementation, and 3) a flexible policy mapping strategy Intro: * collect data at agent level * train policy at agent level * share parameter/ group parameter/ independent parameter ### An Efficient Distributed Multi-Agent Reinforcement Learning for EV Charging Network Control x, 23/8, 2pts https://arxiv.org/pdf/2308.12921v1.pdf Abstract: use CTDE to solve electircal vehicle problem ### DIFFUSION POLICIES AS AN EXPRESSIVE POLICY CLASS FOR OFFLINE REINFORCEMENT LEARNING UTAustin, 23/8, 6pts https://arxiv.org/pdf/2208.06193v3.pdf Abstract: Diffusion Q- learning. Previous methods are constrained by policy classes with limited expressiveness. Use diffusion model can learn multi-modal distribution effectively. Intro: * problem * policy classes are not expressive enough, most are Gaussian distribution * offline datasets are often collected by a mixture of policies * use Diffusion model, which is expressive enough ### Map-based experience replay: a memory-efficient solution to catastrophic forgetting in reinforcement learning x, 23/8, 6pts https://arxiv.org/pdf/2305.02054v2.pdf Abstract: reduce the size of memory by merging similar samples ### BarlowRL: Barlow Twins for Data-Efficient Reinforcement Learning x, 23/8, 5pts https://arxiv.org/pdf/2308.04263v2.pdf Abstract: combine Barlow Twin(unsupervised method) with RL ### Examining Policy Entropy of Reinforcement Learning Agents for Personalization Tasks https://arxiv.org/pdf/2211.11869v3.pdf Abstract: PG tend to env up with lower entropy, while Q-learning does not. ### Improving Reinforcement Learning Training Regimes for Social Robot Navigation x, 23/8, 5pts https://arxiv.org/pdf/2308.14947v1.pdf Abstract: use cirriculum learning method to achieve better generalization performance. ### Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning x, 23/8, 6pts https://arxiv.org/pdf/2308.14897v1.pdf Abstract: Implement importance sampling trick in off-line RL setting. By maintaining a behavior policy and let the policy output is a distribution. ### Cyclophobic Reinforcement Learning x, 23/8, 6pts https://arxiv.org/pdf/2308.15911v1.pdf Abstract: add inductive bias to help exploration. Not reward novelty, but punishes redundancy by avoiding cycles ### Policy composition in reinforcement learning via multi-objective policy optimization Deepmind, 23/8, 6pts https://arxiv.org/pdf/2308.15470v2.pdf Abstract: learn from multiple well-trained teacher models. Formulate as a multi-objective problem, where agent are able to select teacher models and determine whether to use them. ### RePo: Resilient Model-Based Reinforcement Learning by Regularizing Posterior Predictability UW, 23/9, 7pts https://arxiv.org/pdf/2309.00082.pdf Abstract: Visual based RL agents are more likely to be distracted by the perturbation of the environment. Proposed a method to learn from dynamic and reward rather than observations. Also, designed a method for quick adaptation to handle significant domain shift. Intro: * spurious variance * spurious variance = task irrelevent observation * self-supervised pre-trained models do not know the down-stream task, thus is hard to distinguish the task-irrelevent observation * objective * $\max I(z,r|a)$, $\min I(z,o|a)$ * information bottleneck * action policy conditioned on latent state $z$ rather than ground truth state $s$ * implement detail * $I(z_\tau,r_\tau|a_\tau)\geq E[\sum_t\log q(r_t|z_t)]$ > encourage a latent representation that can improve the performance * $I(z_\tau,o_\tau|a_\tau)\leq E[\sum_t D_{KL}(p_{z_{t+1}}(\cdot|z_t,a_t,o_t)||q_{z_{t+1}}(\cdot|z_t,a_t))]$ > use $q, which does not conditioned on observations, to generate next $z$ * trick: * $D_{KL}(p ∥ q) = αD_{KL}(⌊p⌋ ∥ q) + (1 − α)DKL(p ∥ ⌊q⌋)$, where ⌊·⌋ denotes the stop gradient operator ### The Role of Diverse Replay for Generalisation in Reinforcement Learning x, 23/9, 6pts https://arxiv.org/pdf/2306.05727v2.pdf Abstract: define "reachable", the state that $\rho_\pi(s)>0$. Analysis the relationship between the reachable property and the genernarity of the policy. * note: a new quantitative way to analyse generality. ### MULTI-OBJECTIVE DECISION TRANSFORMERS FOR OFFLINE REINFORCEMENT LEARNING x, 23/9, 5pts https://arxiv.org/pdf/2308.16379v1.pdf Abstract: improve decision transformer to deal with multi-objective problems. ### GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields xiaolong wang, corl, 5pts https://arxiv.org/pdf/2308.16891v2.pdf Abstract: use LLMs and voxel to help robot learning. ### RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback google, 23/9, 7pts https://arxiv.org/pdf/2309.00267v1.pdf Abstract: reinforcement learning from ai feedback. Intro: * Position Bias: the order of the choice may influence the result. * promt end with: "Consider the coherence, accuracy, coverage, and overall quality of each summary and explain which one is better. Rationale:" * self-consistency: sampling multiple reasoning paths * train RM by AI labeler, kind of distill model, may bypass RM model but RM model is smaller than label model. ### Task Aware Dreamer for Task Generalization in Reinforcement Learning x, 23/9, 5pts https://arxiv.org/pdf/2303.05092v2.pdf Abstract: Task Aware Dreamer (TAD), let the policy know about the task it is solving. ### Leveraging Prior Knowledge in Reinforcement Learning via Double-Sided Bounds on the Value Function x, 23/9, 5pts https://arxiv.org/pdf/2302.09676v2.pdf Abstract: get some good property by clipping the value function. ### Robust Quadrupedal Locomotion via Risk-Averse Policy Learning x, 23/9, 5pts https://arxiv.org/pdf/2308.09405v2.pdf Abstract: measure the potential risk and quick adapt to it. ### Learning Shared Safety Constraints from Multi-task Demonstrations CMU, 23/9, 6pts https://arxiv.org/pdf/2309.00711v1.pdf Abstract: form the problem as two player zero sum game, one player optimize rewards subject to the constraint, one player output the constraint. Intro: * give a expert dataset and a rewards function for multi-task RL. The general constraint may implicitly present in the expert dataset, try to figure it out. * form the problem as two player zero sum game, one to optimize rewards but subject to the constraint, while the other one design the constraint that is similar to the expert data. * expand to multi-task, one agent per task to max the rewards function * IRL: * gaol: learn a policy that performs as well as the expert’s, no matter the true reward function * objective: $\min_\pi\max_R J(\pi,R)-J(\pi_{expert}, R)$ > the only reward function that actually makes the expert optimal is zero everywhere. * CRL * goal: give a rewards function and a constraint function, try to max rewards subject to violate constraint less than a degree. * objective $\max_\pi J(\pi, r) \text{ s.t.} J(\pi,c)<\delta$ * final algorithm * $\max_\pi J(\pi,r) \text{ s.t.} \max_c J(\pi,c)- J(\pi_{expert}, c)\leq 0$ * meaning: for every constraint function, the learned policy is always safter than the expert policy. ### RL + Model-based Control: Using On-demand Optimal Control to Learn Versatile Legged Locomotion ETHZ, 23/9, 6pts https://arxiv.org/pdf/2305.17842v3.pdf Abstract: use Optimal Control to generate data and use imitation learning to learn from these data. Combine it with RL algorithm to obtain a simplified but robust model of quadroped locomotion. ### Efficient RL via Disentangled Environment and Agent Representations deepak, 23/9, 6pts https://arxiv.org/pdf/2309.02435v1.pdf Abstract: decoupling RL-agent learning and world model learning. from the CV aspect. ### Marginalized Importance Sampling for Off-Environment Policy Evaluation x, 23/9, 6pts https://arxiv.org/pdf/2309.02157v1.pdf Abstract: combine online RL(in simulator) and offline RL collected from real robots. ### ORL-AUDITOR: Dataset Auditing in Offline Deep Reinforcement Learning x,23/9,3pts https://arxiv.org/pdf/2309.03081v1.pdf Abstract: use rewards to distiguish the trajectory in offline dataset is from which dataset. ### Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning Hugging face, 23/9, 5pts https://arxiv.org/pdf/2302.02662v3.pdf Abstract: use LLMs to play a Interactive fiction game. Use BabyAI-text as benchmark. ### Pre- and post-contact policy decomposition for non-prehensile manipulation with zero-shot sim-to-real transfer KAIST, 23/9, 5pts https://arxiv.org/pdf/2309.02754v1.pdf Abstract: Let the robot learn non-grasping behaviors, such as using walls to flip objects. utilize sim and real robot to train simultaneously. ### SUBWORDS AS SKILLS: TOKENIZATION FOR SPARSE-REWARD REINFORCEMENT LEARNING x, 23/9, 5pts https://arxiv.org/pdf/2309.04459v1.pdf Abstract: consider a continuous task. Firstly, discretize the action space. Secondly, pruning subwords, to minimize the action space, the pruned actions are called skills. Lastly, use skill to plan. ### Improving Offline-to-Online Reinforcement Learning with Q-Ensembles X, 23/9, 6pts https://arxiv.org/pdf/2306.06871v2.pdf Abstract: offline training + online finetuning is a promising method. However, the domain shift will be a great problem under this setting. Remove the conservative term in offline RL objective is not help, because the domain shift will let the policy deviate from the correct direction at the begin of online finetuning. Furthermore, to keep the optimistic term in offline RL during online finetuning also have some negative impact on the performance. this work try to solve this problem by using multiple Q network(? ### Leveraging World Model Disentanglement in Value-Based Multi-Agent Reinforcement Learning McGill, 23/9, 4pts https://arxiv.org/pdf/2309.04615v1.pdf Abstract: use model-based RL to play StarCraft II. The trick they use is that they decouple the world model into three parts. ### Massively Scalable Inverse Reinforcement Learning in Google Maps google, 23/9, 6pts https://arxiv.org/pdf/2305.11290v3.pdf Abstract: use large scale IRL method to solve path finding task on google map. Achieves a 16-24% improvement in global route quality. ### Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback collaborate, 23/9, 6pts https://arxiv.org/pdf/2307.15217v2.pdf Abstract: list some open problems of the RLHF. ### Inverse Reinforcement Learning without Reinforcement Learning CMU, 23/9, 6pts https://arxiv.org/pdf/2303.14623v3.pdf Abstract: use DP method to solve IRL(?. The main motivation is that it is not neccessary to do RL each epcho. ### Computationally Efficient Reinforcement Learning: Targeted Exploration leveraging Simple Rules EPFL, 23/9, 6pts https://arxiv.org/pdf/2211.16691v3.pdf Abstract: Give certain constraints to the action space to reduce the explorable space. ### Robot Parkour Learning Stanford, 23/9, 4pts https://arxiv.org/pdf/2309.05665v2.pdf Abstract: learn 5 robots' locomotion skills, use distillation to combine them together. ### Investigating the Impact of Action Representations in Policy Gradient Algorithms x, 23/9, 3pts Abstract: Try to figure out the paradigm to find the optimal action space, but fail to make a conclusion. note: I think it should be some relationship between action space, state space and rewards space that effect the performance ### Reasoning with Latent Diffusion in Offline Reinforcement Learning Jeff, 23/9, 7pts https://arxiv.org/pdf/2309.06599v1.pdf Abstract: aim to solve offline rl task. Use a diffusion model to capture multi-modal and project them to a latent space z. Then constraint the behavior by choosing only in the given latent space(low level actions) so solve the conservative problem. ### Safe Reinforcement Learning with Dual Robustness tsinghua, 23/9, 6pts https://arxiv.org/pdf/2309.06835v1.pdf Abstract: unify safe RL and robust RL. ### Self-Refined Large Language Model as Automated Reward Function Designer for Deep Reinforcement Learning in Robotics x, 23/9, 5pts https://arxiv.org/pdf/2309.06687v1.pdf Abstract: use LLM to generate a rewards model. A close-loop task, the feedback is some general description of the task. For instance, the goal is 1. The quadruped robot should run forward straightly as fast as possible. 2. The quadruped robot cannot fall over. The feedback should be: The robot's average linear velocity on the x-axis is [NUM]. note: good idea. But the feedback info highly corelated to the goal of the task, why not just use it. ### Equivariant Data Augmentation for Generalization in Offline Reinforcement Learning x, 23/9, 5pts https://arxiv.org/pdf/2309.07578v1.pdf abstract: solve domain shift problem in offline RL. transform in state space, and check whethere it's equivalent. ### backdoor detection in RL x, 23/9, 6pts https://arxiv.org/pdf/2210.04688v3.pdf https://arxiv.org/pdf/2202.03609v5.pdf Abstract: learn what backdoor is. ### Your Diffusion Model is Secretly a Zero-Shot Classifier deepak, 23/9, 6pts deepak: https://arxiv.org/pdf/2303.16203.pdf google: https://arxiv.org/pdf/2303.15233.pdf Abstract: well-trained diffusion model can be used as a zero shot classifier, where input is the class description and the image output is the class. Method * google * for every $x_t$, use description to generate a image, calculate the similarity between generated image and the original one. * get a series of points, use a predefined weights to combine them * can remove impossible class at the begining of the process * deepak * calculate the similarity between the original $\epsilon$ and the $\epsilon$ that conditioned on the text input. * note * model trained on generation task can be used as a classifier. In RL, actor is a generation task, and critic is a classification/regression task. * naive example: * suppose there are two reward = {1,-1} * given an oracle actor $f(s,r) = a$ * to get critic, we have to calculate $f(s,r_1)=a_1, f(s,r_2)=a_2$. Then, calculate the simularity between $a,a_1,a_2$ to determine to choose $r_1$ or $r_2$ ### Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions Google, 23/9, 5pts https://q-transformer.github.io/assets/q-transformer.pdf * Abstract: use transformer to solve offline RL task. discretize the action space, each dimension of action is coresponded to a token of transformer, add conservative by forcing actions that are not in dataset toward zero. * method * discrete the action space * each dimension of action is coresponded to a token of transformer, from $a_i,..,a_n$ * the update rule of each element is $Q_{a_i} = \max Q_{a_{i+1}}$ * the last dimension use traditional bellmen function $Q = R + \max Q$ * force the q value of actions not in the dataset toward zero * use MC to accelerate. Specificly, use $\max\{MC, Q\}$ replace $Q$ (generally $MC<Q^*$, thus do not affect convergency) ### Guiding Pretraining in Reinforcement Learning with Large Language Models UCB, 23/9, 5pts https://arxiv.org/pdf/2302.06692v2.pdf * algorithm: * use LLMs output sub-goals. * policy conditioned on sub-goal $\pi(a|o, g_{sub})$ * intrinsic rewards: $-Dist(E(o,a,o'), E(g_{sub}))$ * need a task description * note * how to get the task description * outputs of LLMs is noisy, use it as intrinsic/pre-trained signal is better. ### CHAIN-OF-THOUGHT REASONING IS A POLICY IM- PROVEMENT OPERATOR Harvard, 23/9, 5pts https://arxiv.org/pdf/2309.08589v1.pdf Abstract: let the model recursively generate new data and leran from those data. Have the problem of ERROR AVALANCHING, which can be solve by independently update a set of model, stop when the output of these models diverge. ### PTDE: Personalized Training with Distillated Execution for Multi-Agent Reinforcement Learning x, 22/10, 6pts https://arxiv.org/pdf/2210.08872.pdf Abstract: use global info to train a teacher and distill the knowledge to the decentralized agent. ### DOMAIN: MilDly COnservative Model-BAsed OfflINe Reinforcement Learning x, 23/9, 5pts https://arxiv.org/pdf/2309.08925v1.pdf Abstract: use model-based RL + offline RL to cover more domain than only using offline RL. Use adative sampling(?) for model-based RL, the OOD Q value is proved to be the lower bound of the ture value. ### Contrastive Initial State Buffer for Reinforcement Learning ETHz, 23/9, 5pts https://arxiv.org/pdf/2309.09752v1.pdf Abstract: project dataset to a latent space, use KNN to cluster similar skill. Being able to learn from datas collected long time ago. ### Your Room is not Private: Gradient Inversion Attack on Reinforcement Learning CMU, 23/9, 3pts https://arxiv.org/pdf/2306.09273v2.pdf Abstract: can reconstruct the rome from the training data. ### STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning ICML2023, 23/9, 5pts https://arxiv.org/pdf/2301.12038v2.pdf Abstract: model-based RL exploration problem, combining information gain and regret to get better intrinsic rewards for MBRL. ### Language to Rewards for Robotic Skill Synthesis Google, 23/9, 5pts https://language-to-reward.github.io/assets/l2r.pdf * Abstract: leverage reward functions as an interface that bridges the gap between language and low-level robot actions. input is the task description, output is the reward function's code. Set a good prompt to implement this. Test on qudraped, dexterous manipulation and real robots. ### TEXT2REWARD: AUTOMATED DENSE REWARD FUNC- TION GENERATION FOR REINFORCEMENT LEARNING HKU, 23/9, 5pts https://arxiv.org/pdf/2309.11489v2.pdf * Abstract: similar to the above work, it also allow feedbacks from expert to adjust the generated rewards function in a close loop way. Moreover, it allow expert to abstract the environment at the begining. ### Hierarchical reinforcement learning with natural language subgoals Deepmind, 23/9, 4pts https://arxiv.org/pdf/2309.11564v1.pdf Abstract: use LLMs output high level sub-goals. train in two stage. In the first stage data is observation, text(subgoal) output is action. In the second stage, data is observation output is subgoal(text). ### Training Diffusion Models with Reinforcement Learning Sergey, 23/10, 5pts https://rl-diffusion.github.io/files/paper.pdf abstract: use RL to train diffusion model. Use the similarity based language rewards. ### Semantically Aligned Task Decomposition in Multi-Agent Reinforcement Learning CUHK, 23/10, 5pts https://arxiv.org/pdf/2305.10865v2.pdf abstract: Use language model to assign different task to each agent. Use chain of thought to correct the mistake. ### Train Hard, Fight Easy: Robust Meta Reinforcement Learning nvidia, 23/10, 4pts https://arxiv.org/pdf/2301.11147v2.pdf Abstract: solve the problem of biased gradients and data inefficiency. The general objective is the average return over the worst α quantiles. ### SPRINT: Semantic Policy Pre-training via Language Instruction Relabeling USC, 23/10, 5pts https://clvrai.github.io/sprint/ Abstract: Pre-trained data contains data and language instructions. Use LLMs to relabel the instruction, may merge two sub goal into a new goal. ### Open-ended learning leads to generally capable agents Deepmind, 21, 6pts ### Muesli: Combining Improvements in Policy Optimization Deepmind, 22/5, 6pts https://arxiv.org/pdf/2104.06159.pdf Abstract: MuZero liked method. Good empirical result. ### BLENDING IMITATION AND REINFORCEMENT LEARN- ING FOR ROBUST POLICY IMPROVEMENT UChicago, 23/10, 4pts https://arxiv.org/pdf/2310.01737v1.pdf Abstract: combine imitation learning and reinforcement learning. ### A LONG WAY TO GO: INVESTIGATING LENGTH CORRELATIONS IN Princeton, 23/10, 5pts https://arxiv.org/pdf/2310.03716v1.pdf Abstract: rlhf tend to extend the length of the output. Furthermore, by replacing the reward model to a length based reward model, performance do not decrease. ### LESSON: Learning to Integrate Exploration Strategies for Reinforcement Learning via an Option Framework KAIST, 23/10, 6pts https://arxiv.org/pdf/2310.03342v1.pdf Abstract: from the perspective of exploration, investigate the problem of meta learning. ### Talk of Shurang Song * note: * action selection is important for learning dynamic. * instead of predicting the whole trajectory, we can learn how gradient of action affect the result. * the invariance in robot system can be used as 1. data augmentation, 2. transformation. Specifically, the second one means that when you get observation f(o), you should output action f(a). It is different from the first one in that the change of input also cause the change of output. * how can we learn from the inaccurate simulator? ### ON DOUBLE-DESCENT IN REINFORCEMENT LEARNING WITH LSTD AND RANDOM FEATURES x, 23/10, 4pts https://arxiv.org/pdf/2310.05518v1.pdf * Abstract: double descent, when N/m=1, the testing loss will reach a peak. When N(model size)<m(data size), the testing error is a U, representing the bias and variance trade-off. when N>m, the testing error drop again. * note: * how to define data size in rl, the visited states? state-action pair? * use it choose the size of model as well as the training time. ### Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning University of Edinburgh, 23/10, 6pts https://arxiv.org/pdf/2310.05723v1.pdf * Abstract: offline pre-training and online finetuning. During the online adaptation phase, we have to let the agent to explore the high return regiion. * note: * out of distribution scenes can be solved by online fine-tuning, just like out of distribution task can be solved by few shot learning. ### Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning YiWu, 23/10, 6pts https://arxiv.org/pdf/2310.04796v1.pdf * Abstract: set the initial state of each episode, so that the Nash equlibrium can be found in a linear time. ### Improving Reinforcement Learning Efficiency with Auxiliary Tasks in Non-Visual Environments: A Comparison x, 23/10, 4pts https://arxiv.org/pdf/2310.04241v2.pdf * Abstract: * compare different auxilary tasks' effect. * Representation learning with auxiliary tasks only provides performance gains in sufficiently complex environments * learning environment dynamics is preferable to predicting rewards. * build on `Can Increasing Input Dimensionality Improve Deep Reinforcement Learning`, decouple auxilary task and rl training. ### EFFICIENT DEEP REINFORCEMENT LEARNING REQUIRES REGULATING OVERFITTING Sergey, 23/4, 5pts https://arxiv.org/pdf/2304.10466.pdf * Abstract: prevent rl algorithm from overfitting. Use a method similar to validation set in supervised learning. But instead calculating the validate td error. ### Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning SJTU, 23/10, 5pts * Abstract: transformer, offline, multi-task, use diffusion to reconstruct transition and actions. ### REVISITING PLASTICITY IN VISUAL REINFORCEMENT LEARNING: DATA, MODULES AND TRAINING STAGES Tsinghua, 23/10, 4pts https://arxiv.org/pdf/2310.07418v1.pdf * Abstract: Plasticity = performance increase / new data. Reset and Data augmentation can improve Plasticity. Reset = reset part of parameters regularly during training. Data Augmentation > Data Augmentation + Reset > Reset > none. Critic's Plasticity > Actor's Plasticity. Early stage Plasticity > late stage. ### RL3: Boosting Meta Reinforcement Learning via RL inside RL2 ### MULTI-TIMESTEP MODELS FOR MODEL-BASED REINFORCEMENT LEARNING Huawei, 23/10, 3pts * Abstract: multi-step state reconstruction. Like td lambda. ### ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models x, 23/10, 7pts https://arxiv.org/pdf/2310.10505v2.pdf * Abstract: * Deal with RLHF. * Discard critic network to save 50% GPU comsumption. * Algorithm * Use a sample-based method to estimate return to go. * 3 reason * nlp task's dynamic is deterministic * complete a trjectory does not take a lot of time. * only recieve reward at the end of the trajectory. * use (sample=False) to collect a trajectory, used the recieved return to go as the value function. ### Policy Optimization for Continuous Reinforcement Learning columbia, 23/10, 6pts https://arxiv.org/pdf/2305.18901v4.pdf Abstract: continuous time-step RL. ### VISION-LANGUAGE MODELS ARE ZERO-SHOT REWARD MODELS FOR REINFORCEMENT LEARNING x, 23/10, 7pts Abstract: zero-shot reward model. test on MuJoCo humanoid. with the largest publicly available CLIP model, and real textures. ### ABSOLUTE POLICY OPTIMIZATION CMU, 23/10, 5pts https://arxiv.org/pdf/2310.13230v1.pdf Abstract: policy optimization should not be solely fixated on enhancing expected performance, but also on improving the worst-case performance. That is, $\maxJ(\pi) - variance(\pi)$. ### CONTRASTIVE PREFERENCE LEARNING: LEARNING FROM HUMAN FEEDBACK WITHOUT RL Stanford, 23/10, 4pts https://arxiv.org/pdf/2310.13639v1.pdf Abstract: instead of learning from rewards funcition, learn from optimal advantage function (negated regret). use Contrastive Preference Learning to learn a bijective projection between advantage functions and policies. ### Contrastive Retrospection: honing in on critical steps for rapid learning and generalization in RL mila, 23/10, 6pts https://arxiv.org/pdf/2210.05845v6.pdf Abstract: solve credit assignment. store trajectory in buffer, use contrastive learning to learn an embeding. Store some of the prototype(critic step). Finally use cosine distance to generate intrinsic rewards. ### The Primacy Bias in Model-based RL 2pts reset parameter of world model instead of the agent's parameter. ### Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning Stanford, 23/10, 6pts https://arxiv.org/pdf/2310.15145v1.pdf Abstract: we pre-train a multi-task policy and fine-tune a pre-trained Vision-Language Model (VLM) as a reward model using diverse off-the-shelf offline datasets and a small amount of target task demonstrations. Then, we fine-tune the pre-trained policy online reset-free with the VLM reward model.\ ### UNLEASHING THE POWER OF PRE-TRAINED LANGUAGE MODELS FOR OFFLINE REINFORCEMENT LEARNING Huazhe xu, 23/11, 4pts Abstract: use lora + pre-trained large language model. fine tune based on decision transformer. use language task as auxilary loss. Use GPT-2 only. ### Prioritized Level Replay Facebook, ICML2021, 6pts https://arxiv.org/pdf/2010.03934.pdf algorithm: * with probability p choose new task(uniformly), with 1-p choose among old tasks. * for old tasks * score = average td error. rank the td error. $p_s = \frac{exp(rank(i))^\beta}{\sum_j exp(rank(j))^\beta}$ * total count = $C$, task $i$ has been choose $c_i$ times, $p_c = \frac{C-c_i}{\sum_j C - c_j}$ * final prob = $\rho p_c + (1-\rho) p_s$ ### DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization Huazhe xu, 23/10, 6pts https://drm-rl.github.io Abstract: Dormant Ratio. they found that in the begining phase of training, agent tend to exhibit inactivity, which will limit their ability to explore. Test on visual environment. Add a periodical awaken exploration scheduler. ### From Explicit Communication to Tacit Cooperation: A Novel Paradigm for Cooperative MARL x, 23/4, 6pts https://arxiv.org/pdf/2304.14656.pdf Abstract: centralize training at first, gradually reduce the shared info. Finally get a decentralized policy. ### Context Shift Reduction for Offline Meta-Reinforcement Learning x, Nips 2023, 7pts https://arxiv.org/pdf/2311.03695v1.pdf Abstract: Maximize the mutual information between Z and T while minimizing the mutual information between Z and the behavior policy π in an offline setting. Since the trajectory is influenced by both the behavior policy and dynamics, with the former being unrelated to the task context. ### Offline Multi-Agent Reinforcement Learning with Implicit Global-to-Local Value Regularization x, Nips 23, 4pts https://arxiv.org/pdf/2307.11620v2.pdf Abstract: decomposite the offline regularization term. ### Survival Instinct in Offline Reinforcement Learning x, UW, 7pts https://arxiv.org/pdf/2306.03286v2.pdf * interesting obvservation * offline agent trained with random/negative/zero rewards outperform BC and the behavior policy. * even outperforms agent trained with original rewards sometimes. * perform some safe behavior * intuition * large data coverage: * pros: improves the best policy that can be learned by offline RL with the true reward * cons more sensitive to imperfect rewards. * Thus, might not be necessary or helpful * data bias * evaluate offline algorithm w/ wrong reward to quantify data bias * ### TEA: Test-time Energy Adaptation x, Nov23, 7pts https://arxiv.org/pdf/2311.14402v1.pdf ### IMITATION BOOTSTRAPPED REINFORCEMENT LEARNING Dorsa, Nov23, 4pts https://arxiv.org/pdf/2311.02198v3.pdf Abstract: imitation learning get a pre-trained model. Then, online fine-tuning it using RL. choose action = $arg\max_{a_{RL},a_{IL}} Q(s,a)$ ### Nearly Tight Bounds for the Continuum-Armed Bandit Problem * some insight * its estimate of Cost Function need only be accurate for strategies where the Cost Function is near its minimum. * ### Towards a Standardised Performance Evaluation Protocol for Cooperative MARL * Nips 2022 * some detail that should pay attention to when doing MARL experiement. ### LiFT: Unsupervised Reinforcement Learning with Foundation Models as Teachers USC, Dec 2023, 4pts https://arxiv.org/pdf/2312.08958v1.pdf * use VLM to unsupervised train a MineCraft agent. Prompts look like: `You see blocks of [objects]. You face entities of [entities]. You have a [item] in your hand. Target Skill:`. VLM alignment score = cosine distance in embedding space. ### Less is more - the dispatcher/ executor principle for multi-task Reinforcement Learning DeepMind, Dec 2023, 6pts https://arxiv.org/pdf/2312.09120v1.pdf * main idea: dispatcher + executor = Agent * dispatcher: semantically understanding the task, make commands. * executor: actual control signal, may have several executors that are specialized on realizing different skills. * communication channel: compositionality of the transmitted command information; reduce the transferred information to the minimum. * execution detail * code the information about the target object through a simple masking operation * run the full image through an edge detector and provide the result as a further argument to the executor to avoid collision. ### Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos OpenAI June2022 https://arxiv.org/pdf/2206.11795.pdf * collect 70k hours unlabeled data, 2k hours of labeled data. Trained a inverse dynamic model by the later dataset, label the former dataset. Train an agent, fine-tuning it by using RL. ### STEVE-1: A Generative Model for Text-to-Behavior in Minecraft UofT June2023 * use VPT + MineCLIP, to train a commands following agent. * DALLE-liked method * DALLE * CLIP model = text encoder + image encoder * model1: text prompt -> image embedding * model2: image decoder * STEVE * MineCLIP = video encoder + text encoder * model1: text prompt -> video embedding * model2: video embedding -> VPT agent that is able to achieve the goal * detail * policy loss = -log p(a|o,z), where image embedding z is randomly selected in the future timestep. * CVAE take text embedding and guassian prior as input, and output image embedding, which will be the input of the VPT. * have a dataset to train model1, by minimizing KL divergence. * drop out some image embedding during training, which make the policy no longer conditioned on the text. During testing time, lambda*policy_distribution_conditioned_on_text + (1-lambda) * policy_distribution_unconditioned_on_text * note * add lambda to improve grounding ability? * the idea's advantage is that you already have a pre-trained decoder. The only thing have to be change is the model1. ### Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning Yi Wu, aaai2024, 5pts https://arxiv.org/pdf/2310.04796v2.pdf * abstract: self-play. instead of choosing appropriate enemy for your current agent, choose a state to start with. use the difference between NE and current states as metric. However, we can nver know the value of NE. Thus, have to do some approximation. ### XSkill: Cross Embodiment Skill Discovery Shuran Song, corl2023, 3pts * abstract: learn from unlabeled human demonstration. learn a skill discriminator, and a policy that is conditioned on skill embedding. ### FROM PLAY TO POLICY: CONDITIONAL BEHAVIOR GENERATION FROM UNCURATED ROBOT DATA Lerrel Pinto, Dec2022, 4pts * abstract: offline learn a policy conditioned on future state. But how to get the future state during eval time. ### PRE-TRAINING FOR ROBOTS: LEVERAGING DIVERSE MULTITASK DATA VIA OFFLINE RL ICLR2023, 5pts * abstract: large scale pre-trained by offline RL + fine-tune on the target task with 10-15 tasks is better than IRL, RL or other method. ### Diffusion Reward: Learning Rewards via Conditional Video Diffusion Huazhe Xu, 6pts https://diffusion-reward.github.io/resources/Diffusion_Reward_Learning_Rewards_via_Conditional_Video_Diffusion.pdf * abstract: diffusion model conditioned on historic image and output the whole trajectory. reward r(s) is defined to be the entropy of the model's output, which can encourage exploration. The insight is taht the diffusion model can generate diverse trajectories. ## ICLR2024 * RLIF: Interactive Imitation Learning as Reinforcement Learning * 8666 * abstract: use RL to improve DAgger, uses the expert’s decision to intervene as a negative reward signal * Efficient Offline Reinforcement Learning: The Critic is Critical * 555 * abstract: use MC regression loss to pre-trained critic loss. * SUBMODULAR REINFORCEMENT LEARNING * 8686 * abstract: rewrads may depend on historical trajectories. Theoreticallly analyze the lower bound of the algorithm. * Towards Principled Representation Learning from Videos for Reinforcement Learning * 8885 * abstract: RL+video representation learning w/ iid noise or exogenous noise. * Harnessing Discrete Representations for Continual Reinforcement Learning * 8655 * abstract: use discrete state representation instead of continuous state. * Stochastic Subgoal Representation for Hierarchical Reinforcement Learning * 8661 * abstract: use stochastic latent representation(subgoal), improve long term decision making. * Discovering Temporally-Aware Reinforcement Learning Algorithms * 855 * abstract: meta learning used to learn objective function for different task. In this work, their learned objective function depend on time horizon. e.g., students may alter their studying techniques based on the proximity to exam deadlines and their self-assessed capabilities * Language Reward Modulation for Pretraining Reinforcement Learning * 6565 * abstract: use VLM's output as pre-training rewards. Fine-tune on the downstream task(w/ sparse rewards) * Training Diffusion Models with Reinforcement Learning * 5866 * abstract: DDPO considers the reverse generative process as MDP, where the reward is only given at the zeroth timestep. * lambda-AC: Effective decision-aware reinforcement learning with latent models * 5368 * abstract: analyze MuZero and its alternative. * Exposing the Silent Hidden Impact of Certified Training in Reinforcement Learning * 565 * adversarially trained value functions are shown to overestimate the optimal values * Time-Efficient Reinforcement Learning with Stochastic Stateful Policies * 586 * abstract: POMDP. the model's internal state is represented as a stochastic variable that is sampled at each time step. circumventing the issues associated with backpropagation over time. (why?) * Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining * 5686 * abstract: In context learning can do the same thing as UCB, etc. Has better generalize ability than supervised fine-tuning. * Goodhart's Law in Reinforcement Learning * 5668 * abstract: Goodhart's law. Give geometric explanation of how optimisation of a misspecified reward function can lead to worse performance beyond some threshold. Proposed a early stopping algorithm. * Reasoning with Latent Diffusion in Offline Reinforcement Learning * 685 * abstract: use diffusion model w/ offline RL to learn latent representation. multi-modal, conditioned on time. * CPPO: Continual Learning for Reinforcement Learning with Human Feedback * 6685 * abstract: RLHF+continous learning. Examples with high reward and low generation probability or high generation probability and low reward have a high policy learning weight(new knowledge) and low knowledge retention weight(old knowledge). * Revisiting Data Augmentation in Deep Reinforcement Learning * 6666 * abstract: analyze existing methods. include a regularization term called tangent prop. * Proximal Curriculum with Task Correlations for Deep Reinforcement Learning * 8355 * abstract: multi-task, our curriculum design on the Zone of Proximal Development concept. * Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula * 8666 * abstract: build on robust adversarial reinforcement learning by adding entropy regularization into the players' objectives, also annealing the temperature(curriculum learning). * Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning * 6553 * abstract: discretize action and tokenize skill(a series of actions) * Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning * 8866 * abstract: QD+PPO, super good performance. * Maximum Entropy Model Correction in Reinforcement Learning * 688 * abstact: MBRL, max entropy RL, use incorrect world model to improve training speed. * Value Factorization for Asynchronous Multi-Agent Reinforcement Learning * 6565 * abstract: Asynchronous value decomposition (wat us that ?) * Policy Rehearsing: Training Generalizable Policies for Reinforcement Learning * 8886 * abstract: use somehow OOD offline dataset to do training rehearsal. use rewards and done as input to learn the dynamic. * Robust Reinforcement Learning with Structured Adversarial Ensemble * 663 * abstract: propose an adversarial ensemble approach to address over-optimism and optimize average performance against the worst-k adversaries to mitigate over-pessimism. * Multi-Task Reinforcement Learning with Mixture of Orthogonal Experts * 55666 * abstract: use multiple model for mluti-task RL. Use Gram-Schmidt to make sure that each model will learn different representation. * Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations * 565 * abstract: adversarial training based on temporally-coupled perturbations. (temporally?) * Tactics of Robust Deep Reinforcement Learning with Randomized Smoothing * 5555 * abstract: robustness of DRL. randomized smoothing introduce trade-off between utility and robustness. introduce a more potent adversarial attack * Blending Imitation and Reinforcement Learning for Robust Policy Improvement * 8885 * abstract: combine RL and IL, use IL to encourage exploration in RL. * Compositional Instruction Following with Language Models and Reinforcement Learning * 5553 * abstract: use LLM to map a given natural language specification to an expression representing a Boolean combination of primitive tasks * Privileged Sensing Scaffolds Reinforcement Learning * 8 8 8 10 !!!!! * abstract: MBRL, Dream-liked algorithm. Use priviledge knowledge to learn a better world model. * CAMMARL: Conformal Action Modeling in Multi Agent Reinforcement Learning * 5566 * abstract: maintain a belief of your teammates' actions. Conditioned your policy on this belief. * Robust Model Based Reinforcement Learning Using Adaptive Control * 6668 * abstract: control input produced by the underlying MBRL is perturbed by the adaptive control, which is designed to enhance the robustness of the system against uncertainties. (?) * Decision Transformer is a Robust Contender for Offline Reinforcement Learning * 6666 * abstract: DT requires more data than CQL, but exhibits higher robustness to suboptimal data, sparse reward. DT and BC good at tasks with longer horizons or data collected from human demonstrations, while CQL good at tasks with both high stochasticity and lower data quality. * Exploiting Causal Graph Priors with Posterior Sampling for Reinforcement Learning * 8886 * abstract: causal graph as prior, use Byaes to calculate posterior. * PAE: Reinforcement Learning from External Knowledge for Efficient Exploration * 8666 * abstract: incorporate planner into RL framework. Take NLP as input, mainly focusing on dealing with long horizon task's exploartion problem. * SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores * 8886 * abstract: RL system. * Decoupling regularization from the action space * 665 * abstract: the scale of the entropy term should not be prop to the dimension of the action space. * intuition: changing the robot’s acceleration unit from meters per second squared to feet per minute squared should not lead to a different optimal policy * tune exp(Q/beta)'s beta based on dim(A). set target entropy based on (a-alpha)*H(uniform) + alpha*H(deterministic) * Decoupled Actor-Critic * 6665 * abstract: use Optimism model for exploration (do not interact with the environment), use Pessimism for exploitation. * S2AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic * 3668665 * abstract: compare with SQL/SAC. uses parameterized Stein Variational Gradient Descent (SVGD) to learn a max entropy policy.