If you have any questions, please contact chenwenze21@gmail.com.
## Second Time
### HIQL: Offline Goal-Conditioned RL with Latent States as Actions
levine 2023/7/24 7pts
https://arxiv.org/pdf/2307.11949v1.pdf
Abstract: directly learn an goal-reaching agent is hard. The paper proposed a hierarchical method, e.g. a higher-level agent learn to generate sub-goal(take state as action), and a traditional RL agent.
Intro:
* train a value function by IQL
* use that value function to generate two networks
* state(input), sub-goal(output)
* subgoal(input), action(output)
* the advantage of this kind of setting is that the subgoal is in latent space, thus we can utilize unlabel data to pre-train the policy.
* hierarchical structure also better to learn high horrizon goal.
### PASTA: Pretrained action-state transformer agents
InstaDeep 2023/7/22 4pts
https://arxiv.org/pdf/2307.10936v1.pdf
Abstract: a universal pre-trained model for RL task.
Intro:
* Tokenize trajectories at the component level is a better way. component level means that it see each state as a vector, like position of legs.
* Pre-train the same model on datasets from multiple domains improve model's ability on generalization.
* test on many down stream tasks to test it.
### Real Robot Challenge 2022: Learning Dexterous Manipulation from Offline Data in the Real World
collaborate, 2023/8/16 4pts
https://arxiv.org/pdf/2308.07741.pdf
Abstract: rules of the competition, present the methods used by the winning teams and compare their results with a benchmark of state-of-the-art offline RL algorithms on the challenge datasets. (the competition allowing participants to experiment remotely with a real robot)
Content:
* It is interesting that the 1st price taker use navie BC as their algorithm. When data is collected by expert, BC will be a better way in comparison with offline RL algorithms.
* They use a self-supervised method to train a discriminator to seperate expert and near-optimal data.
* An assumption they use is that trajectory with higher return do not mean that it is collected by an expert.
* Of course data augumentation help RL training in real world setting. However, in practice, things may not that sysmetric in real world. Should pay attention to this when using data augumentation.
### RT-2: Vision-Language-Action Models
Google, 2023/7/30 8pts
https://robotics-transformer2.github.io
* Abstract: incorporate pretrained large-scale language-vision model trained on internet data into RL training loop.
* Intro: train vision, language, robot action all together in an end-t-end way.
* discretize action space
* co-fine-tune: fine-tune action by interacting with the environement and fine-tune vision-language model by web-data allow model more general
* output-constraint: when training visual-language task there is no constraint, while training robot task, a constraint is added to the action space.
* can generate to unseen view and object.
### Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement
Apple, 2023/8/19, 4pts
https://arxiv.org/pdf/2303.08983v2.pdf
Abstract: disigned a distillation and data augumentation based method to inhence the quality of cv dataset.
Intro:
* add data augumentation into dataset
* add distill information(teacher model's output) into the dataset.
* teacher model with higher acc not always be the best teacher, should test its student's ability to check this.
### Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition
Shuran Song, 2023/8/1 7pts
https://www.cs.columbia.edu/~huy/scalingup/
Abtract: combine multi-task RL and LLM. Use LLM to give high level planing. Use RL planner to generate multiple path. Use these path to train a language conditioned policy. Train a network to detect failure, reset and train again when it fail.
Intro:
* want to achieve two goals
* generate as many laguage-trajectory pair data as possible: LLM output high level planning. sampling based policy output lower level control to output the trajectory.
* use these data as a multi-task RL policy: use BC, but conditioned on language input and is multi-task.
* step:
* step1: goal -> sub-goal:
* turn the whole task into multiple sub-goals by the guidiance of recursive LLMs, built a goal-tree.
* seperate the task based on objects. i.e., put A into B should be seperate into two parts.
* pros: LLMs provides general knowledge but can not complete the task on it own
* step2: implement:
* use sample based method to add randomness to the trajectory.
* follow the structure of the tree to complete the whole task.
* step3: verify & retry
* collect both success and failure task)
* train an inferred success function to check the task is success or failure.
* rerun the task with another seed, without reset, so the agent can learn to recover from failure.
* step4: Language-conditioned Policy Distillation
* use diffusion policy + imitation learning to get a multi-task policy, by adding language-conditioning
* is able to transfers to the real-world, without domain randomizaiton
* guess: imitation learning is good enough if the quality of the trajectory is good
* some note:
* how to utilize failure trajectory in imitation learning?
* is retry realy important to RL agent? how to mathmatecally formulate retry this behavior?
* seems like a good general RL agent should take both image and language as input, so we had better know the state of the art model of that kind of input-output.
### Language Reward Modulation for Pretraining Reinforcement Learning
Peter Abbeel, 23/8, 7pts
https://arxiv.org/pdf/2308.12270v1.pdf
Abstract: instead of using learned reward functions as singal to train subsquence reward, use Vision-Language Models to unsupervisly train the RL policy.
Intro:
* learned reward functions are noisy, directly use it to train a down-stream task may cause problem.
* learned reward functions do not need human effort to label them
* pre-training phase has a lower requirment on the precision of the reward functions.
Method:
* Reinforcement learning with vision-language reward:
* visual representations $F_φ(o_i)$ and text representations $L(x)$
* $r^{in}_t = D (F(o), L(x))$, $D(\cdot)$ means distance
* R3M:
* learn from large scale Ego4D dataset.
* trained to predict $G(F(o_1),F(o_i),L(x))$, represent whether the agent complete task discription $x$ between timestep $1$ to $i$
* use $G$ as reward model
* generate instructors
* query ChatGPT
* instruction may be human-centric, robot centric, as well as ambiguous
* training
* pre-train: add an exploration reward(Plan2Explore)
* fine-tune: fix the language for the whole task
* note:
* "ego-centric" instruction may be important, otherwise, there may be a domain shift between pre-trained and fine-tune env
* get rewards rather than state representation from large language/vision model is interesting. (but have to be aware of disturbance of the generative reward model)
### Learning Dexterous Manipulation from Exemplar Object Trajectories and Pre-Grasps
Vikash Kumar, ICRA2023, 5pts
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10161147
Abstract: an algorithm that can grasp various kind of object, without tuning hyper-parameter per task. Use exploration strategy induced by a surprisingly simple ingredient (single pre-grasp pose).
Intro:
* PGDM: Accelerating Exploration w/ Pre-Grasps
* observation: decompose dexterous tasks into a “reaching stage” and a “manipulation stage,”
* method: manually set the init state of the system to pre-grasps state.
* implementation: use a scene-agnostic trajectory optimizer to achieve pre-grasp state first, then solve it by PPO.
* TCDM
* 50 task, from (1) human MoCap recordings transferred to robot via IK, (2) expert pre-grasps extracted from Tele-Op data, (3) manually labeled pre-grasps, and (4) learned pre-grasps generated by an object mesh condition
### Real World Offline Reinforcement Learning with Realistic Data Source
Vikash Kumar, ICRA2023, 4pts
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10161474
Abstract: collect safe trajetories from different task. Use Offline RL to train a multi-task agent. Use other trajectories from other tasks as hetrogeneous data.
Intro:
* problem:
* simulator data:
* hardware noises, varying reset conditions
* sub-optimal data:
* source: add noise to optimal data
* may be unsafe
* method:
* use trajetories from other tasks as heterougeneous data
* it is safe
* it is meaningful
* use offline RL algorithm
* use out-of-domain data
* result
* general
* behavior cloning demonstrates strong robustness to varying representations and in-domain tasks.
* Offline RL could outperform BC for tasks where the initial state distributions change during deployment.
* IQL have similar performance with BC in in-domain task
* Offline RL good at utilizing heterogeneous data.
* note
* this work does not consider the safty problem it mentioned in the introduction part.
* how to capture the knowledge in different domain
### Internally Rewarded Reinforcement Learning
x, ICML2023, 8pts
https://arxiv.org/pdf/2302.00270v3.pdf
Abstract: The noise of the mutual information-based rewards function may be unneglectable. Use linear objective to replace log one. Use clip method to stablize training process.
Intro:
* Linear Reward
* use $q(z|s) - p(z)$ instead of $\log q(z|s) -\log p(z)$ as rewards
* has lower error(in terms of expectation and variance)
* it is in fact χ2-divergence (originally we use KL-divergence)
* Clipped Reward
* $r = q(z|s) - p(z) \approx p(z|s)-p(z)$
* $p(z|s)$ in general should greater than $p(z)$
* use a clip trick to ensure this constraint, i.e., use $\max(q(z|s), p(z))\approx p(z|s)$
* note: a very egineering paper
### Neural Amortized Inference for Nested Multi-agent Reasoning
stanford, 2023/8/29, 7pts
https://arxiv.org/pdf/2308.11071v1.pdf
Abstract: try to model the opponent and infer their behavior.
Intro:
* use the concept of k-level thinking.
* at level k, $a_i\sim P(a|b_j^{k-1},o_i)$, where player j is player i's opponent, and b is the belief.
* belief is derived from a recurrent way.
note:
* generally 2 to 3-level think perform well, can we use this as prior to design our systems.
* how to utilize the recurrent nature of this kind of network, can we use something like fix-point theory to solve this problem?
* how to ensure that you have a good model of the opponent? can we gain benefit from the symetric of the game, i.e., both agent use the same policy?
### Diffuser:Planning with diffusion for flexible behavior synthesis.
ICML2022, 7pts
https://arxiv.org/pdf/2205.09991.pdf
Abstract: use diffusion model to train a decision making agent, not in the RL classical setting.
Intro:
* problem:
* recently, model-base RL run model prediction and decision making seperately.
* Genearally, it will work like a adversarial game, because the RL agent want to exploit the model.
* use a universal framwork to combine two process.
* diffusion model
* use diffusion model as a trajectories generator.
* input: all trajectories in dataset, output: the entire trajectory is predicted simultaneously
* rewards model:
* given a trajectory we can get the rewards it obtain.
* objective $p(\hat\tau)=p(\tau)h(\tau)$, where $p(\tau)$ is the probability the trajectory is generated by the diffusion model
* note
* if we can generate a trajectory in one forward pass, then we don't need the value function, instead, we only need the rewards function and calculate the ture return by go through the trajectory.
### Adversarial Style Transfer for Robust Policy Optimization in Deep Reinforcement Learning
Purdue, 23/8, 7pts
https://arxiv.org/pdf/2308.15550v1.pdf
Abstract: seen RL training as an adversarial game. Generator transfer style and max entropy of the policy, the discriminator max the rewards function.
Intro:
* overview:
* generator change style of sinput images
* discriminator(policy) should output the same action distribution for image before and after style tranformation.
* generator:
* a pre-trained model is used to reduce data dimension.
* Gaussian Mixture Model is used to cluster the data into n clusters.
* use GAN to train a generator and a discriminator, where generator should cheat discriminator and RL-policy simultaneously
* RL policy
* min $KL(πθ(.|xt),πθ(.|x′t))$, where two x stand for image before and after style tranformation.
* note:
* view data augmentation as a style transformation is interesting.
* They define styles by clustering observations in the dataset, which can be somewhat narrow. We can define styles as trajectories or solutions for a given problem, where the same problem may have multiple solutions (styles). In this case, what should be fixed is the target state, not the policy distribution. In this way, we can also leverage methods similar to this work to enhance stability.
* Convergence properties can be analyzed within the framework of game theory.
* we have to pay attention to the adversarial relationship between RL-agnet and the environment.
### Lifelike Agility and Play on Quadrupedal Robots using Reinforcement Learning and Generative Pre-trained Models
Tencent, 23/8, 8pts
https://arxiv.org/pdf/2308.15143v1.pdf
Abstract: use large scale pre-trained model to learn env-agnostic knowledge from animal video. Then fine-tune on the down-stream tasks. deploy it on quadraped robots.
Intro:
* overall
* there are three training stage, each is independent, and they are from general to specific task.
* PMC learn latent representation of actions from videos.
* EPMC learn multiple env-decoders by interact with the simulator and use multi-expert distillation contract them into one model
* SEPMC
* PMC
* use multiple camera to capture a labrador's locomotion and behavior
* use inverse-kenametic method+pose estimation to obtain the labeled dataset
* Use a VAE like structure to learn a policy
* encoder: $P(z|s^p, s^f)$, where $z$ is a latent vector, $s^p$ is current observation, $s^f$ is future trajectory in the video
* discrete latent space:$z^e = arg\min_i ||z-z_i||$
* decoder: $P(a|z^e,s^p)$, where a is a desire position of the joint, and use a PD controller to implement it.
> why use discrete latent space
> output position rather than torque is better.
> observation should include history
* prioritized sampling: give behavior that is rare in dataset more weight
* EPMC
* flat terrain: use gail(imitation learning) to train the agent to follow the demonstration
* stair: add a residule network on top of the original decoder
* others: use a hyper-network to outputthe wieght of the latent code. The goal is to follow a average velocity
* multi-expert distillation: uniformly sample the task, if there is a new task, just train a specific network and distill it into the main network
* SEPMC
* a high-level network, take opponent's info, the map as input
* use Prioritized FSP(PFSP) to train an adversarial game.
* note
* the hierarchical structure is interesting, flexible, and reasonable. Shall we split it in a different way?
* multi-expert is a flexible design.
* the lower level network is fixed while training the high level policy, can we optimize them all together?
* learn from real animal is interesting, can we discover something by analysing the latent space in step1? why do they choose to use discrete latent space, I guess the reason is, in step2, it have to output a discrete distribution. Also, can we learn from wild data? like videos on youtube.
* The agents are able to move in intrinsic terrain with training dataset only contain trajetories in flat terrain, which is suprising. I think a network that output residule poistion is neccessary in this domain shift case.
* use VAE instead of an imitation learning+fine-tune pipeline is interesting.
### DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION
google, ICLR2020 oral, 6pts
https://arxiv.org/pdf/1912.01603.pdf
Abstract: learn a world model to tackle tasks with image input.
Intro:
* pipeline:
* execute policy in the world to obtain a dataset.
* learn world model from the dataset
* train policy/value function by querying world model
* world model
* The following two networks are required for step3 training
* Transition model: $q(s_t|s_{t-1}, a_{t-1})$
* Reward model: $q(r_t|s_t)$
* pros
* agents are able to make decision w/o image observation, efficient
* use RNN to implement transition model, long horizon
* x: Representation model: $p(s_t|s_{t-1},a_t,o_t)$
> $s_t$ are Markovian
* objective
* $\max I(s_\tau;o_\tau,r_\tau|a)-\beta I(s_\tau;o_\tau)$
* lower bound
* $\max E[\sum_t \ln q(o_t|s_t)]$
* $\max E[\sum_t \ln q(r_t|s_t)]$
* $\min E[\sum_t D_{KL}(p(s_{t+1}|s_t,o_t,,a_t)|q(s_{t+1}|s_t,a_t))]$
* meaning
* latent space encode the whole trajectory rather than current observation.
* try to use as less information from observation as possible.
* note
* $q(s_{t+1}|s_t,a_t)$ seems wierd, it is a open loop control, state prediction will not be corrected by the current observation. I do not agree with the model based setting. The error may accumulate if the trajectory is very long. But the point that information should not heavy rely on the current input is good.
* Will the gradient flow from $q(r|s)$'s $s$ to the encoder $q(s|o)$? how does it truely work?
### EFFICIENT RLHF: REDUCING THE MEMORY USAGE OF PPO
microsoft, 23/9, 4pts
https://arxiv.org/pdf/2309.00754v1.pdf
Abstract: PPO use 3x memory than supervised learning. In this work, they try to min the usage of memory but maintain the performance
Intro:
* memory/computation cost reduction through the implementation of model-sharing between Reference/Reward Models and Actor/Critic Models.
* $\pi_\theta$ and $\pi_{ref}$:
* they are both loaded from the same pre-trained model
* $\pi_\theta$ use LoRA
* can load pre-trained model once. Turn on LoRA while training $\pi_theta$, turn off LoRA when use $\pi_{ref}$
* Actor and Critic also can share the pre-trained model, by using different LoRA model.
### Double Clipping: Less-
ed Variance Reduction in Off-Policy Evaluation
amazon, 23/9, 7pts
https://arxiv.org/pdf/2309.01120v1.pdf
Abstract: clipping (importance sampling) can reduce variance in the sacrification of increase in bias. The bias is always a downward bias. This work design a method to compensate the bias but fix the variance.
Intro:
* the clipping bias is defined as:
* $b = E_{\pi}[\mathbb{I}(\frac{\pi}{\pi_0}>U)(\frac{U}{\pi/\pi_0}-1)E[r]]$
* once $r\geq 0$, the bias is always negative.
* note
* the clipping operator in PPO in fact let the policy to underestimate the value.
* underestimation is a good property for hurestic function in A star algorithm.
* the reward function should $\geq$ 0
### NEUROEVOLUTION IS A COMPETITIVE ALTERNATIVE TO REINFORCEMENT LEARNING FOR SKILL DISCOVERY
instadeep, 23/9, 7pts
https://arxiv.org/pdf/2210.03516v4.pdf
Intro:
* contribution: proposed three benchmark to test QD algorithms.
* compare 4 mutual information based methods with 4 QD-based methods.
* no algorithm is significantly outperform other algorithms.
### A Survey on Transformers in Reinforcement Learning
Tecent, 23/9, 7pts
https://arxiv.org/pdf/2301.03044v3.pdf
* representation learning
* AlphaStar: multi-head dot-product attention
* multi-entity
* multi-modal
* vit
* temporal sequence
* encoding the trajectory
* While Transformer outperforms LSTM/RNN as the memory horizon grows and parameter scales, it suffers from poor data efficiency with RL signals
* model learning
* transformer based world model better than Dreamer's
* sequential decision-making
* offline RL
* not for online at this moment
* generalist agents
* large scale multi-task Dataset
* Prompt-based Decision Transformer: samples a sequence of transitions from the few-shot demonstration dataset as prompt
* Gato, RT1 large scale multi modal dataset
* beneficial to finetune DT with Transformer pre-trained on language datasets or multi-modal datasets containing language modality.
* perspectives
* online sequential decision making
* transformer is originally designed for the text sequence.
* general agent, general world model
* similarity/difference with diffusion model
* notes
* add transformer in a good way can add inductive bias for the model
* language pretrained model help sequential decision process is interesting.
* transformer fail to learn unstable objective. the world model can be learned but the value function can not.
* use transformer to encode a seqence of partial oberservable observation into a global state in promising.
### Human-Timescale Adaptation in an Open-Ended Task Space
Deepmind, 23/10, 6pts
https://arxiv.org/pdf/2301.07608.pdf?trk=public_post_comment-text
Abstract: use auto-curriculum learning. meta rl. distillation, large model size(transformer).
Detail:
* META RL
* rl^2 based method. agent's memory will reset each trial. the return will not be truncated at the end of each episode.
* curriculum learning
* select “interesting” tasks at the frontier of the agent’s capabilities. There are two ways:
* no-op: compare the current policy with a no operation policy. Choose it once meets some condition.
* Prioritised level replay: fitness score, which approximates the agent’s regret for a given task
* RL
* use transformer outout next n token, update them.
* Memory:
* RNN with Attention: stores a number of past activations, do attention on them, use current hidden as query.
* Transformer-XL: variantion of transformer, allow longer input. May use sub-sampling the sequence to further extend the vision length.
* Distillation
* train a smaller teacher model
* Main model is seen as student, but is larger than teacher and use the same hyper-parameter to train.
* Exp result
* scale agent network size/memory length(multiple epsodes, or many shot setting) improve performance
* scale tasks distribution and complexity improve performance
*
### A survey of inverse reinforcement learning
x, 22/2, 6pts
https://link.springer.com/content/pdf/10.1007/s10462-021-10108-x.pdf
## First Time
### Beyond Black-Box Advice: Learning-Augmented Algorithms for MDPs with Q-Value Predictions
CUHK 2023/7/22 6pts
https://arxiv.org/pdf/2307.10524v1.pdf
Abstract: how to use additional information to advise the training of Q-value.
### On the Convergence of Bounded Agents
DeepMind 2023/7/22 4pts
https://arxiv.org/pdf/2307.11044v1.pdf
Abstract: It is easy to define an environment converged, but how to define an agent converged?
### A Definition of Continual Reinforcement Learning
DeepMind 2023/7/22 4pts
https://arxiv.org/pdf/2307.11046v1.pdf
Abstract: give continual reinforcement learning a definition.
### Leveraging Offline Data in Online Reinforcement Learning
UW 2023/7/22 5pts
https://arxiv.org/pdf/2211.04974v2.pdf
Abstract: how to combine online and offline data to accelerate training process
### Offline Reinforcement Learning with Closed-Form Policy Improvement Operators
UCSB 2023/7/24 4pts
https://arxiv.org/pdf/2211.15956v3.pdf
Abstract: when using constrained optimization, policies are prove to use linear approximation.
### Provable Reset-free Reinforcement Learning by No-Regret Reduction
Micorsoft 2023/7/24 6pts ICML
https://arxiv.org/pdf/2301.02389v3.pdf
Abstract: form reset-free RL as a two player zero-sum game to ensure that the policy will avoid to reset and achieve optimal performance.
### Toward Efficient Gradient-Based Value Estimation
Sutton 2023/7/24 5pts
https://arxiv.org/pdf/2301.13757v3.pdf
Abstract: Gradient-based RL algorithm often slower than TD-based methods. The paper let the update of value functions approximately follows the Gauss-Newton direction. In this way, the condition number of the H matrix will be low, thus, accelerate the training process.
### HINDSIGHT-DICE: STABLE CREDIT ASSIGNMENT FOR DEEP REINFORCEMENT LEARNING
Stanford, 2023/8/18, 5pts
Abstract: adapt existing importance-sampling ratio estimation techniques for off-policy evaluation to drastically improve the stability and efficiency of these so-called hindsight policy methods. focus on credit assignment.
### Parallel Q-Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation
MIT 20237/24 5pts
https://arxiv.org/pdf/2307.12983v1.pdf
Abstract: large scale off-policy RL framework
### A Connection between One-Step RL and Critic Regularization in Reinforcement Learning
levine 2023/7/24 5pts
https://arxiv.org/pdf/2307.12968v1.pdf
Abstract: theorerically show that one-step RL is somehow equal to critic regularization multi-step RL(used in offline RL), and empirically prove the ability of the observation.
### Pixel to policy: DQN Encoders for within & cross-game reinforcement learning
UCSD 2023/8/1 3pts
https://arxiv.org/pdf/2308.00318v1.pdf
Abstract: use limited data to train an agent that can take over Atari game. Use transfer learning.
### MADIFF: Offline Multi-agent Learning with Diffusion Models
SJTU 2023/8/15 4pts
https://arxiv.org/pdf/2305.17330v2.pdf
Abstract: naively combine diffusion model and offline RL. Also design some customed network structure.
### CaRT: Certified Safety and Robust Tracking in Learning-based Motion Planning for Multi-Agent Systems
CalTech 2023/8/15 5pts
https://arxiv.org/pdf/2307.08602v2.pdf
Abstract: design a hierarchical model to deal with safty in multi-agent path finding. Either project non-linear system back to safe linear system or filter the bad trajectory by the hierarchical structure.
### Model-Based Safe Reinforcement Learning with Time-Varying State and Control Constraints: An Application to Intelligent Vehicles
IEEE member 2023/8/15 3pts
Abstract: barrier force-based control policy structure for safty. multi-step policy evaluation mechanism is employed for time varing
### Generating Personas for Games with Multimodal Adversarial Imitation Learning
2023/8/16 4pts
Abstract: multi modal GAIL, train multiple reward functino(discriminator) and use RL to exploit them
### Principles and Guidelines for Evaluating Social Robot Navigation Algorithms
collaborate, 2023/8/16 6pts
https://arxiv.org/pdf/2306.16740.pdf
Abstract: social robot navigation means navigation in human-populated environments. metrics(easier to compare results from different simulators, robots and datasets), development of scenarios, benchmarks, datasets, and simulators
### Deep Reinforcement Learning with Multitask Episodic Memory Based on Task-Conditioned Hypernetwork
Beijing Post and tele, 2023/8/16, 4pts
https://arxiv.org/pdf/2306.10698.pdf
Abstract: selecting the most relevant past experiences for the current task, and integrate such experiences into the decision network
### Policy Regularization with Dataset Constraint for Offline Reinforcement Learning
ICML2023 2023/8/16 6pts
https://arxiv.org/pdf/2306.06569.pdf
Abstract: offline RL too conservative. Instead regularizing the policy towards the nearest state-action pair. It is a softer constraint but still keeps enough conservatism from out-of-distribution actions.
* Detail:
* distance: $dist((s,a),D) = \min_{(s',a')\in D} dist(s,s') + \beta dist(a,a')$
* loss: $\min_\theta L(\theta) = E_{s\sim D}[dist\big((s, \pi_\theta(s)), D\big)]$
* theoretical result: with Lipschitz assumption, once max distance of (s,a) is bounded $\max dist((s,\pi(\cdot |s)),D)<\epsilon$, then $|Q(s, \pi(\cdot|s))-Q(s, \mu(\cdot|s))|<K\epsilon$ is bounded.
* note:
* $\max L(\theta) \text{ s.t } dist(\theta, D)<\delta$, with point-wise distance.
* change actor loss, explicitly constraint the policy. I think the distance of policy is meaningless, while critic distance is explainable. Thus, I do not like this method.
* Offline RL encourage the agent to use some OOD (s,a). Otherwise, they just have to do "re-pair" operation in the original dataset.
### CHALLENGES AND OPPORTUNITIES OF USING TRANSFORMER-BASED MULTI-TASK LEARNING IN NLP THROUGH ML LIFECYCLE: A SURVEY
doxray(company) 2023/8/17 4pts
https://arxiv.org/pdf/2308.08234v1.pdf
Abstract: sytematically explore the multi-task NLP training. connect this to continuous learning.
### RoboAgent: Towards Sample Efficient Robot Manipulation with Semantic Augmentations and Action Chunking
CMU+Meta 2023/8/18 8pts
https://robopen.github.io
Abstract: Built a large scale robot system MT-ACT. Aiming to train an universal agent that can handle multi-tasks. By using sementation augmentation and action representation, the agent is able to learn from only 7500 trjectories.
### WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
x, 2023/8/18, 5pts
https://arxiv.org/pdf/2308.09583v1.pdf
Abstract: use Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to fine-tune Lamma-2
### CoMIX: A Multi-agent Reinforcement Learning Training Architecture for Efficient Decentralized Coordination and Independent Decision Making
UCL, 2023/8/19, 6pts
https://arxiv.org/pdf/2308.10721v1.pdf
Abstract: Co-Qmix. decentralized, flexible policies. Allow agents to communicate with each other.
### DPMAC: Differentially Private Communication for Cooperative Multi-Agent Reinforcement Learning
SJTU, 2023/8/19, 4pts
https://arxiv.org/pdf/2308.09902v1.pdf
Abstract: teach agent to collaborate and preserve private info. use game theory to prove its effectiveness.
### Never Explore Repeatedly in Multi-Agent Reinforcement Learning
趙千川, 2023/8/19, 3pts
https://arxiv.org/pdf/2308.09909v1.pdf
Abstract: solved the problem of revisitation, which means that agent recurrently visit an area. Proposed a dynamic reward scaling approach to stablize fluctuations in intrinsic rewards in previously explored areas.
* Detail
* reward = $r_{extrinsic} + \alpha r_{intrinsic}$, $\alpha$ should be dynamically adjust (be small in well-explored area, be large in un-familiar area)
* store all the visited observations in $D$. uncertainty of observation $o$ is defined as $\min_{o'\in D} dist(o,o')$
* use CDS like intrinsic rewards
* note
* store all the observation seems ridiculous.
* like a combination of CDS and RND. (re-weighted the CDS rewards by exp of RND rewards may give a similar performance)
### Reinforced Self-Training (ReST) for Language Modeling
Deepmind, 2023/8/19, 5pts
https://arxiv.org/pdf/2308.08998v2.pdf
Abstract: generate data. Use it to train offline policy. It is sample efficient beacause the data can be reused. Test on LLM task.
* Detail:
* Grow step, a policy generates a dataset.
* Improve step, the filtered dataset is used to fine-tune the policy.
* Both steps are repeated, Improve step is repeated more frequently to amortise the dataset creation cost.
* note:
* semi-supervised method work when the supervised signal generation is much more faster than running the simulation.
* dreamer like model based rl methods may be viewd as a semi-supervised method.
### Continual Learning as Computationally Constrained Reinforcement Learning
Stanford, 2023/8/19, 4pts
https://arxiv.org/pdf/2307.04345v2.pdf
Abstract: intro to continual learning, seen it a RL task.
### Some Supervision Required: Incorporating Oracle Policies in Reinforcement Learning via Epistemic Uncertainty Metrics
x, 2023/8/19, 6pts
https://arxiv.org/pdf/2208.10533v3.pdf
Abstract: proposed Critic Confidence Guided Exploration to incorporate oracle policy into rl model. use a external uncertainty estimate data. The agent will take in the oracle policy’s actions as suggestions and incorporates this information into the learning scheme when uncertainty is high.
* Detail: use UCB method to get $Q_{UCB}$, if the potential improvement$\frac{|Q_{UCB}^{oracle}-Q_{UCB}^{\pi}|}{Q^\pi}$ is greater than threshold, than use oracle action, otherwise use $a\sim \pi(\cdot|s)$
* note
* setting: online rl with expert data available.
* incorporate UCB with imitation learning, which is interesting.
* expert data give us a biased q value approximation. during online fine-tuning we are aim to explore those overestimated (s,a). w/ UCB we can filter out trajectories with highest approximation q values. w/o UCB, we may interest in every trajectory that has higher approximate q values than the expert data. Thus, deviate to the wrong trajectory at the begin of the episode.
### FoX: Formation-aware exploration in multi-agent reinforcement learning
x, 2023/8/29, 2pts
https://arxiv.org/pdf/2308.11272v1.pdf
Abstract: exploration problem in MARL. relate state in exploration space to previous states can reduce exploration space.
Exp: bad exp result!
### Active Exploration for Inverse Reinforcement Learning
ETH, 2023/8/29, 5pts
https://arxiv.org/pdf/2207.08645v4.pdf
Abstract: provide a sample-complexity bounds for IRL that does not require a generative model of the environment.
### Lifelong Multi-Agent Path Finding in Large-Scale Warehouses
Jiaoyang Li, 2021,
Intro:
* MAPF
* Multi-Agent Path Finding (MAPF): moving a team of agents from their start locations to their goal locations while avoiding collisions.
* lifelong MAPF: after an agent reaches its goal location, it is assigned a new goal location and required to keep moving
* view lifelong MAPF as running window problem of MAPF.
### Identifying Reaction-Aware Driving Styles of Stochastic Model Predictive Controlled Vehicles by Inverse Reinforcement Learning
Arizona, 23/8, 2pts
https://arxiv.org/pdf/2308.12069v1.pdf
Abstract: use inverse RL to model the behavior pattern of the opponent. focus too much on how to model the driving scenario.
### E(3)-Equivariant Actor-Critic Methods for Cooperative Multi-Agent Reinforcement Learning
USC, 23/8, 2pts
https://arxiv.org/pdf/2308.11842v1.pdf
Abstract: leverage sysmetric nature of MARL algorithm.
### MARLlib: A Scalable and Efficient Library For Multi-agent Reinforcement Learning
Yaodong Yang, 23/8, 6pts
https://arxiv.org/pdf/2210.13708v3.pdf
Abstract: 1) a standardized multi-agent environment wrapper, 2) an agent-level algorithm implementation, and 3) a flexible policy mapping strategy
Intro:
* collect data at agent level
* train policy at agent level
* share parameter/ group parameter/ independent parameter
### An Efficient Distributed Multi-Agent Reinforcement Learning for EV Charging Network Control
x, 23/8, 2pts
https://arxiv.org/pdf/2308.12921v1.pdf
Abstract: use CTDE to solve electircal vehicle problem
### DIFFUSION POLICIES AS AN EXPRESSIVE POLICY CLASS FOR OFFLINE REINFORCEMENT LEARNING
UTAustin, 23/8, 6pts
https://arxiv.org/pdf/2208.06193v3.pdf
Abstract: Diffusion Q- learning. Previous methods are constrained by policy classes with limited expressiveness. Use diffusion model can learn multi-modal distribution effectively.
Intro:
* problem
* policy classes are not expressive enough, most are Gaussian distribution
* offline datasets are often collected by a mixture of policies
* use Diffusion model, which is expressive enough
### Map-based experience replay: a memory-efficient solution to catastrophic forgetting in reinforcement learning
x, 23/8, 6pts
https://arxiv.org/pdf/2305.02054v2.pdf
Abstract: reduce the size of memory by merging similar samples
### BarlowRL: Barlow Twins for Data-Efficient Reinforcement Learning
x, 23/8, 5pts
https://arxiv.org/pdf/2308.04263v2.pdf
Abstract: combine Barlow Twin(unsupervised method) with RL
### Examining Policy Entropy of Reinforcement Learning Agents for Personalization Tasks
https://arxiv.org/pdf/2211.11869v3.pdf
Abstract: PG tend to env up with lower entropy, while Q-learning does not.
### Improving Reinforcement Learning Training Regimes for Social Robot Navigation
x, 23/8, 5pts
https://arxiv.org/pdf/2308.14947v1.pdf
Abstract: use cirriculum learning method to achieve better generalization performance.
### Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning
x, 23/8, 6pts
https://arxiv.org/pdf/2308.14897v1.pdf
Abstract: Implement importance sampling trick in off-line RL setting. By maintaining a behavior policy and let the policy output is a distribution.
### Cyclophobic Reinforcement Learning
x, 23/8, 6pts
https://arxiv.org/pdf/2308.15911v1.pdf
Abstract: add inductive bias to help exploration. Not reward novelty, but punishes redundancy by avoiding cycles
### Policy composition in reinforcement learning via multi-objective policy optimization
Deepmind, 23/8, 6pts
https://arxiv.org/pdf/2308.15470v2.pdf
Abstract: learn from multiple well-trained teacher models. Formulate as a multi-objective problem, where agent are able to select teacher models and determine whether to use them.
### RePo: Resilient Model-Based Reinforcement Learning by Regularizing Posterior Predictability
UW, 23/9, 7pts
https://arxiv.org/pdf/2309.00082.pdf
Abstract: Visual based RL agents are more likely to be distracted by the perturbation of the environment. Proposed a method to learn from dynamic and reward rather than observations. Also, designed a method for quick adaptation to handle significant domain shift.
Intro:
* spurious variance
* spurious variance = task irrelevent observation
* self-supervised pre-trained models do not know the down-stream task, thus is hard to distinguish the task-irrelevent observation
* objective
* $\max I(z,r|a)$, $\min I(z,o|a)$
* information bottleneck
* action policy conditioned on latent state $z$ rather than ground truth state $s$
* implement detail
* $I(z_\tau,r_\tau|a_\tau)\geq E[\sum_t\log q(r_t|z_t)]$
> encourage a latent representation that can improve the performance
* $I(z_\tau,o_\tau|a_\tau)\leq E[\sum_t D_{KL}(p_{z_{t+1}}(\cdot|z_t,a_t,o_t)||q_{z_{t+1}}(\cdot|z_t,a_t))]$
> use $q, which does not conditioned on observations, to generate next $z$
* trick:
* $D_{KL}(p ∥ q) = αD_{KL}(⌊p⌋ ∥ q) + (1 − α)DKL(p ∥ ⌊q⌋)$, where ⌊·⌋ denotes the stop gradient operator
### The Role of Diverse Replay for Generalisation in Reinforcement Learning
x, 23/9, 6pts
https://arxiv.org/pdf/2306.05727v2.pdf
Abstract: define "reachable", the state that $\rho_\pi(s)>0$. Analysis the relationship between the reachable property and the genernarity of the policy.
* note: a new quantitative way to analyse generality.
### MULTI-OBJECTIVE DECISION TRANSFORMERS FOR OFFLINE REINFORCEMENT LEARNING
x, 23/9, 5pts
https://arxiv.org/pdf/2308.16379v1.pdf
Abstract: improve decision transformer to deal with multi-objective problems.
### GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields
xiaolong wang, corl, 5pts
https://arxiv.org/pdf/2308.16891v2.pdf
Abstract: use LLMs and voxel to help robot learning.
### RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
google, 23/9, 7pts
https://arxiv.org/pdf/2309.00267v1.pdf
Abstract: reinforcement learning from ai feedback.
Intro:
* Position Bias: the order of the choice may influence the result.
* promt end with: "Consider the coherence, accuracy, coverage, and overall quality of each summary and explain which one is better. Rationale:"
* self-consistency: sampling multiple reasoning paths
* train RM by AI labeler, kind of distill model, may bypass RM model but RM model is smaller than label model.
### Task Aware Dreamer for Task Generalization in Reinforcement Learning
x, 23/9, 5pts
https://arxiv.org/pdf/2303.05092v2.pdf
Abstract: Task Aware Dreamer (TAD), let the policy know about the task it is solving.
### Leveraging Prior Knowledge in Reinforcement Learning via Double-Sided Bounds on the Value Function
x, 23/9, 5pts
https://arxiv.org/pdf/2302.09676v2.pdf
Abstract: get some good property by clipping the value function.
### Robust Quadrupedal Locomotion via Risk-Averse Policy Learning
x, 23/9, 5pts
https://arxiv.org/pdf/2308.09405v2.pdf
Abstract: measure the potential risk and quick adapt to it.
### Learning Shared Safety Constraints from Multi-task Demonstrations
CMU, 23/9, 6pts
https://arxiv.org/pdf/2309.00711v1.pdf
Abstract: form the problem as two player zero sum game, one player optimize rewards subject to the constraint, one player output the constraint.
Intro:
* give a expert dataset and a rewards function for multi-task RL. The general constraint may implicitly present in the expert dataset, try to figure it out.
* form the problem as two player zero sum game, one to optimize rewards but subject to the constraint, while the other one design the constraint that is similar to the expert data.
* expand to multi-task, one agent per task to max the rewards function
* IRL:
* gaol: learn a policy that performs as well as the expert’s, no matter the true reward function
* objective: $\min_\pi\max_R J(\pi,R)-J(\pi_{expert}, R)$
> the only reward function that actually makes the expert optimal is zero everywhere.
* CRL
* goal: give a rewards function and a constraint function, try to max rewards subject to violate constraint less than a degree.
* objective $\max_\pi J(\pi, r) \text{ s.t.} J(\pi,c)<\delta$
* final algorithm
* $\max_\pi J(\pi,r) \text{ s.t.} \max_c J(\pi,c)- J(\pi_{expert}, c)\leq 0$
* meaning: for every constraint function, the learned policy is always safter than the expert policy.
### RL + Model-based Control: Using On-demand Optimal Control to Learn Versatile Legged Locomotion
ETHZ, 23/9, 6pts
https://arxiv.org/pdf/2305.17842v3.pdf
Abstract: use Optimal Control to generate data and use imitation learning to learn from these data. Combine it with RL algorithm to obtain a simplified but robust model of quadroped locomotion.
### Efficient RL via Disentangled Environment and Agent Representations
deepak, 23/9, 6pts
https://arxiv.org/pdf/2309.02435v1.pdf
Abstract: decoupling RL-agent learning and world model learning. from the CV aspect.
### Marginalized Importance Sampling for Off-Environment Policy Evaluation
x, 23/9, 6pts
https://arxiv.org/pdf/2309.02157v1.pdf
Abstract: combine online RL(in simulator) and offline RL collected from real robots.
### ORL-AUDITOR: Dataset Auditing in Offline Deep Reinforcement Learning
x,23/9,3pts
https://arxiv.org/pdf/2309.03081v1.pdf
Abstract: use rewards to distiguish the trajectory in offline dataset is from which dataset.
### Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning
Hugging face, 23/9, 5pts
https://arxiv.org/pdf/2302.02662v3.pdf
Abstract: use LLMs to play a Interactive fiction game. Use BabyAI-text as benchmark.
### Pre- and post-contact policy decomposition for non-prehensile manipulation with zero-shot sim-to-real transfer
KAIST, 23/9, 5pts
https://arxiv.org/pdf/2309.02754v1.pdf
Abstract: Let the robot learn non-grasping behaviors, such as using walls to flip objects. utilize sim and real robot to train simultaneously.
### SUBWORDS AS SKILLS: TOKENIZATION FOR SPARSE-REWARD REINFORCEMENT LEARNING
x, 23/9, 5pts
https://arxiv.org/pdf/2309.04459v1.pdf
Abstract: consider a continuous task. Firstly, discretize the action space. Secondly, pruning subwords, to minimize the action space, the pruned actions are called skills. Lastly, use skill to plan.
### Improving Offline-to-Online Reinforcement Learning with Q-Ensembles
X, 23/9, 6pts
https://arxiv.org/pdf/2306.06871v2.pdf
Abstract: offline training + online finetuning is a promising method. However, the domain shift will be a great problem under this setting. Remove the conservative term in offline RL objective is not help, because the domain shift will let the policy deviate from the correct direction at the begin of online finetuning. Furthermore, to keep the optimistic term in offline RL during online finetuning also have some negative impact on the performance. this work try to solve this problem by using multiple Q network(?
### Leveraging World Model Disentanglement in Value-Based Multi-Agent Reinforcement Learning
McGill, 23/9, 4pts
https://arxiv.org/pdf/2309.04615v1.pdf
Abstract: use model-based RL to play StarCraft II. The trick they use is that they decouple the world model into three parts.
### Massively Scalable Inverse Reinforcement Learning in Google Maps
google, 23/9, 6pts
https://arxiv.org/pdf/2305.11290v3.pdf
Abstract: use large scale IRL method to solve path finding task on google map. Achieves a 16-24% improvement in global route quality.
### Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
collaborate, 23/9, 6pts
https://arxiv.org/pdf/2307.15217v2.pdf
Abstract: list some open problems of the RLHF.
### Inverse Reinforcement Learning without Reinforcement Learning
CMU, 23/9, 6pts
https://arxiv.org/pdf/2303.14623v3.pdf
Abstract: use DP method to solve IRL(?. The main motivation is that it is not neccessary to do RL each epcho.
### Computationally Efficient Reinforcement Learning: Targeted Exploration leveraging Simple Rules
EPFL, 23/9, 6pts
https://arxiv.org/pdf/2211.16691v3.pdf
Abstract: Give certain constraints to the action space to reduce the explorable space.
### Robot Parkour Learning
Stanford, 23/9, 4pts
https://arxiv.org/pdf/2309.05665v2.pdf
Abstract: learn 5 robots' locomotion skills, use distillation to combine them together.
### Investigating the Impact of Action Representations in Policy Gradient Algorithms
x, 23/9, 3pts
Abstract: Try to figure out the paradigm to find the optimal action space, but fail to make a conclusion.
note: I think it should be some relationship between action space, state space and rewards space that effect the performance
### Reasoning with Latent Diffusion in Offline Reinforcement Learning
Jeff, 23/9, 7pts
https://arxiv.org/pdf/2309.06599v1.pdf
Abstract: aim to solve offline rl task. Use a diffusion model to capture multi-modal and project them to a latent space z. Then constraint the behavior by choosing only in the given latent space(low level actions) so solve the conservative problem.
### Safe Reinforcement Learning with Dual Robustness
tsinghua, 23/9, 6pts
https://arxiv.org/pdf/2309.06835v1.pdf
Abstract: unify safe RL and robust RL.
### Self-Refined Large Language Model as Automated Reward Function Designer for Deep Reinforcement Learning in Robotics
x, 23/9, 5pts
https://arxiv.org/pdf/2309.06687v1.pdf
Abstract: use LLM to generate a rewards model. A close-loop task, the feedback is some general description of the task. For instance, the goal is 1. The quadruped robot should run forward straightly as fast as possible. 2. The quadruped robot cannot fall over. The feedback should be: The robot's average linear velocity on the x-axis is [NUM].
note: good idea. But the feedback info highly corelated to the goal of the task, why not just use it.
### Equivariant Data Augmentation for Generalization in Offline Reinforcement Learning
x, 23/9, 5pts
https://arxiv.org/pdf/2309.07578v1.pdf
abstract: solve domain shift problem in offline RL. transform in state space, and check whethere it's equivalent.
### backdoor detection in RL
x, 23/9, 6pts
https://arxiv.org/pdf/2210.04688v3.pdf
https://arxiv.org/pdf/2202.03609v5.pdf
Abstract: learn what backdoor is.
### Your Diffusion Model is Secretly a Zero-Shot Classifier
deepak, 23/9, 6pts
deepak: https://arxiv.org/pdf/2303.16203.pdf
google: https://arxiv.org/pdf/2303.15233.pdf
Abstract: well-trained diffusion model can be used as a zero shot classifier, where input is the class description and the image output is the class.
Method
* google
* for every $x_t$, use description to generate a image, calculate the similarity between generated image and the original one.
* get a series of points, use a predefined weights to combine them
* can remove impossible class at the begining of the process
* deepak
* calculate the similarity between the original $\epsilon$ and the $\epsilon$ that conditioned on the text input.
* note
* model trained on generation task can be used as a classifier. In RL, actor is a generation task, and critic is a classification/regression task.
* naive example:
* suppose there are two reward = {1,-1}
* given an oracle actor $f(s,r) = a$
* to get critic, we have to calculate $f(s,r_1)=a_1, f(s,r_2)=a_2$. Then, calculate the simularity between $a,a_1,a_2$ to determine to choose $r_1$ or $r_2$
### Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions
Google, 23/9, 5pts
https://q-transformer.github.io/assets/q-transformer.pdf
* Abstract: use transformer to solve offline RL task. discretize the action space, each dimension of action is coresponded to a token of transformer, add conservative by forcing actions that are not in dataset toward zero.
* method
* discrete the action space
* each dimension of action is coresponded to a token of transformer, from $a_i,..,a_n$
* the update rule of each element is $Q_{a_i} = \max Q_{a_{i+1}}$
* the last dimension use traditional bellmen function $Q = R + \max Q$
* force the q value of actions not in the dataset toward zero
* use MC to accelerate. Specificly, use $\max\{MC, Q\}$ replace $Q$ (generally $MC<Q^*$, thus do not affect convergency)
### Guiding Pretraining in Reinforcement Learning with Large Language Models
UCB, 23/9, 5pts
https://arxiv.org/pdf/2302.06692v2.pdf
* algorithm:
* use LLMs output sub-goals.
* policy conditioned on sub-goal $\pi(a|o, g_{sub})$
* intrinsic rewards: $-Dist(E(o,a,o'), E(g_{sub}))$
* need a task description
* note
* how to get the task description
* outputs of LLMs is noisy, use it as intrinsic/pre-trained signal is better.
### CHAIN-OF-THOUGHT REASONING IS A POLICY IM- PROVEMENT OPERATOR
Harvard, 23/9, 5pts
https://arxiv.org/pdf/2309.08589v1.pdf
Abstract: let the model recursively generate new data and leran from those data. Have the problem of ERROR AVALANCHING, which can be solve by independently update a set of model, stop when the output of these models diverge.
### PTDE: Personalized Training with Distillated Execution for Multi-Agent Reinforcement Learning
x, 22/10, 6pts
https://arxiv.org/pdf/2210.08872.pdf
Abstract: use global info to train a teacher and distill the knowledge to the decentralized agent.
### DOMAIN: MilDly COnservative Model-BAsed OfflINe Reinforcement Learning
x, 23/9, 5pts
https://arxiv.org/pdf/2309.08925v1.pdf
Abstract: use model-based RL + offline RL to cover more domain than only using offline RL. Use adative sampling(?) for model-based RL, the OOD Q value is proved to be the lower bound of the ture value.
### Contrastive Initial State Buffer for Reinforcement Learning
ETHz, 23/9, 5pts
https://arxiv.org/pdf/2309.09752v1.pdf
Abstract: project dataset to a latent space, use KNN to cluster similar skill. Being able to learn from datas collected long time ago.
### Your Room is not Private: Gradient Inversion Attack on Reinforcement Learning
CMU, 23/9, 3pts
https://arxiv.org/pdf/2306.09273v2.pdf
Abstract: can reconstruct the rome from the training data.
### STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning
ICML2023, 23/9, 5pts
https://arxiv.org/pdf/2301.12038v2.pdf
Abstract: model-based RL exploration problem, combining information gain and regret to get better intrinsic rewards for MBRL.
### Language to Rewards for Robotic Skill Synthesis
Google, 23/9, 5pts
https://language-to-reward.github.io/assets/l2r.pdf
* Abstract: leverage reward functions as an interface that bridges the gap between language and low-level robot actions. input is the task description, output is the reward function's code. Set a good prompt to implement this. Test on qudraped, dexterous manipulation and real robots.
### TEXT2REWARD: AUTOMATED DENSE REWARD FUNC- TION GENERATION FOR REINFORCEMENT LEARNING
HKU, 23/9, 5pts
https://arxiv.org/pdf/2309.11489v2.pdf
* Abstract: similar to the above work, it also allow feedbacks from expert to adjust the generated rewards function in a close loop way. Moreover, it allow expert to abstract the environment at the begining.
### Hierarchical reinforcement learning with natural language subgoals
Deepmind, 23/9, 4pts
https://arxiv.org/pdf/2309.11564v1.pdf
Abstract: use LLMs output high level sub-goals. train in two stage. In the first stage data is observation, text(subgoal) output is action. In the second stage, data is observation output is subgoal(text).
### Training Diffusion Models with Reinforcement Learning
Sergey, 23/10, 5pts
https://rl-diffusion.github.io/files/paper.pdf
abstract: use RL to train diffusion model. Use the similarity based language rewards.
### Semantically Aligned Task Decomposition in Multi-Agent Reinforcement Learning
CUHK, 23/10, 5pts
https://arxiv.org/pdf/2305.10865v2.pdf
abstract: Use language model to assign different task to each agent. Use chain of thought to correct the mistake.
### Train Hard, Fight Easy: Robust Meta Reinforcement Learning
nvidia, 23/10, 4pts
https://arxiv.org/pdf/2301.11147v2.pdf
Abstract: solve the problem of biased gradients and data inefficiency. The general objective is the average return over the worst α quantiles.
### SPRINT: Semantic Policy Pre-training via Language Instruction Relabeling
USC, 23/10, 5pts
https://clvrai.github.io/sprint/
Abstract: Pre-trained data contains data and language instructions. Use LLMs to relabel the instruction, may merge two sub goal into a new goal.
### Open-ended learning leads to generally capable agents
Deepmind, 21, 6pts
### Muesli: Combining Improvements in Policy Optimization
Deepmind, 22/5, 6pts
https://arxiv.org/pdf/2104.06159.pdf
Abstract: MuZero liked method. Good empirical result.
### BLENDING IMITATION AND REINFORCEMENT LEARN- ING FOR ROBUST POLICY IMPROVEMENT
UChicago, 23/10, 4pts
https://arxiv.org/pdf/2310.01737v1.pdf
Abstract: combine imitation learning and reinforcement learning.
### A LONG WAY TO GO: INVESTIGATING LENGTH CORRELATIONS IN
Princeton, 23/10, 5pts
https://arxiv.org/pdf/2310.03716v1.pdf
Abstract: rlhf tend to extend the length of the output. Furthermore, by replacing the reward model to a length based reward model, performance do not decrease.
### LESSON: Learning to Integrate Exploration Strategies for Reinforcement Learning via an Option Framework
KAIST, 23/10, 6pts
https://arxiv.org/pdf/2310.03342v1.pdf
Abstract: from the perspective of exploration, investigate the problem of meta learning.
### Talk of Shurang Song
* note:
* action selection is important for learning dynamic.
* instead of predicting the whole trajectory, we can learn how gradient of action affect the result.
* the invariance in robot system can be used as 1. data augmentation, 2. transformation. Specifically, the second one means that when you get observation f(o), you should output action f(a). It is different from the first one in that the change of input also cause the change of output.
* how can we learn from the inaccurate simulator?
### ON DOUBLE-DESCENT IN REINFORCEMENT LEARNING WITH LSTD AND RANDOM FEATURES
x, 23/10, 4pts
https://arxiv.org/pdf/2310.05518v1.pdf
* Abstract: double descent, when N/m=1, the testing loss will reach a peak. When N(model size)<m(data size), the testing error is a U, representing the bias and variance trade-off. when N>m, the testing error drop again.
* note:
* how to define data size in rl, the visited states? state-action pair?
* use it choose the size of model as well as the training time.
### Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning
University of Edinburgh, 23/10, 6pts
https://arxiv.org/pdf/2310.05723v1.pdf
* Abstract: offline pre-training and online finetuning. During the online adaptation phase, we have to let the agent to explore the high return regiion.
* note:
* out of distribution scenes can be solved by online fine-tuning, just like out of distribution task can be solved by few shot learning.
### Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning
YiWu, 23/10, 6pts
https://arxiv.org/pdf/2310.04796v1.pdf
* Abstract: set the initial state of each episode, so that the Nash equlibrium can be found in a linear time.
### Improving Reinforcement Learning Efficiency with Auxiliary Tasks in Non-Visual Environments: A Comparison
x, 23/10, 4pts
https://arxiv.org/pdf/2310.04241v2.pdf
* Abstract:
* compare different auxilary tasks' effect.
* Representation learning with auxiliary tasks only provides performance gains in sufficiently complex environments
* learning environment dynamics is preferable to predicting rewards.
* build on `Can Increasing Input Dimensionality Improve Deep Reinforcement Learning`, decouple auxilary task and rl training.
### EFFICIENT DEEP REINFORCEMENT LEARNING REQUIRES REGULATING OVERFITTING
Sergey, 23/4, 5pts
https://arxiv.org/pdf/2304.10466.pdf
* Abstract: prevent rl algorithm from overfitting. Use a method similar to validation set in supervised learning. But instead calculating the validate td error.
### Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning
SJTU, 23/10, 5pts
* Abstract: transformer, offline, multi-task, use diffusion to reconstruct transition and actions.
### REVISITING PLASTICITY IN VISUAL REINFORCEMENT LEARNING: DATA, MODULES AND TRAINING STAGES
Tsinghua, 23/10, 4pts
https://arxiv.org/pdf/2310.07418v1.pdf
* Abstract: Plasticity = performance increase / new data. Reset and Data augmentation can improve Plasticity. Reset = reset part of parameters regularly during training. Data Augmentation > Data Augmentation + Reset > Reset > none. Critic's Plasticity > Actor's Plasticity. Early stage Plasticity > late stage.
### RL3: Boosting Meta Reinforcement Learning via RL inside RL2
### MULTI-TIMESTEP MODELS FOR MODEL-BASED REINFORCEMENT LEARNING
Huawei, 23/10, 3pts
* Abstract: multi-step state reconstruction. Like td lambda.
### ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
x, 23/10, 7pts
https://arxiv.org/pdf/2310.10505v2.pdf
* Abstract:
* Deal with RLHF.
* Discard critic network to save 50% GPU comsumption.
* Algorithm
* Use a sample-based method to estimate return to go.
* 3 reason
* nlp task's dynamic is deterministic
* complete a trjectory does not take a lot of time.
* only recieve reward at the end of the trajectory.
* use (sample=False) to collect a trajectory, used the recieved return to go as the value function.
### Policy Optimization for Continuous Reinforcement Learning
columbia, 23/10, 6pts
https://arxiv.org/pdf/2305.18901v4.pdf
Abstract: continuous time-step RL.
### VISION-LANGUAGE MODELS ARE ZERO-SHOT REWARD MODELS FOR REINFORCEMENT LEARNING
x, 23/10, 7pts
Abstract: zero-shot reward model. test on MuJoCo humanoid. with the largest publicly available CLIP model, and real textures.
### ABSOLUTE POLICY OPTIMIZATION
CMU, 23/10, 5pts
https://arxiv.org/pdf/2310.13230v1.pdf
Abstract: policy optimization should not be solely fixated on enhancing expected performance, but also on improving the worst-case performance. That is, $\maxJ(\pi) - variance(\pi)$.
### CONTRASTIVE PREFERENCE LEARNING: LEARNING FROM HUMAN FEEDBACK WITHOUT RL
Stanford, 23/10, 4pts
https://arxiv.org/pdf/2310.13639v1.pdf
Abstract: instead of learning from rewards funcition, learn from optimal advantage function (negated regret). use Contrastive Preference Learning to learn a bijective projection between advantage functions and policies.
### Contrastive Retrospection: honing in on critical steps for rapid learning and generalization in RL
mila, 23/10, 6pts
https://arxiv.org/pdf/2210.05845v6.pdf
Abstract: solve credit assignment. store trajectory in buffer, use contrastive learning to learn an embeding. Store some of the prototype(critic step). Finally use cosine distance to generate intrinsic rewards.
### The Primacy Bias in Model-based RL
2pts
reset parameter of world model instead of the agent's parameter.
### Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning
Stanford, 23/10, 6pts
https://arxiv.org/pdf/2310.15145v1.pdf
Abstract: we pre-train a multi-task policy and fine-tune a pre-trained Vision-Language Model (VLM) as a reward model using diverse off-the-shelf offline datasets and a small amount of target task demonstrations. Then, we fine-tune the pre-trained policy online reset-free with the VLM reward model.\
### UNLEASHING THE POWER OF PRE-TRAINED LANGUAGE MODELS FOR OFFLINE REINFORCEMENT LEARNING
Huazhe xu, 23/11, 4pts
Abstract: use lora + pre-trained large language model. fine tune based on decision transformer. use language task as auxilary loss. Use GPT-2 only.
### Prioritized Level Replay
Facebook, ICML2021, 6pts
https://arxiv.org/pdf/2010.03934.pdf
algorithm:
* with probability p choose new task(uniformly), with 1-p choose among old tasks.
* for old tasks
* score = average td error. rank the td error. $p_s = \frac{exp(rank(i))^\beta}{\sum_j exp(rank(j))^\beta}$
* total count = $C$, task $i$ has been choose $c_i$ times, $p_c = \frac{C-c_i}{\sum_j C - c_j}$
* final prob = $\rho p_c + (1-\rho) p_s$
### DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization
Huazhe xu, 23/10, 6pts
https://drm-rl.github.io
Abstract: Dormant Ratio. they found that in the begining phase of training, agent tend to exhibit inactivity, which will limit their ability to explore. Test on visual environment. Add a periodical awaken exploration scheduler.
### From Explicit Communication to Tacit Cooperation: A Novel Paradigm for Cooperative MARL
x, 23/4, 6pts
https://arxiv.org/pdf/2304.14656.pdf
Abstract: centralize training at first, gradually reduce the shared info. Finally get a decentralized policy.
### Context Shift Reduction for Offline Meta-Reinforcement Learning
x, Nips 2023, 7pts
https://arxiv.org/pdf/2311.03695v1.pdf
Abstract: Maximize the mutual information between Z and T while minimizing the mutual information between Z and the behavior policy π in an offline setting. Since the trajectory is influenced by both the behavior policy and dynamics, with the former being unrelated to the task context.
### Offline Multi-Agent Reinforcement Learning with Implicit Global-to-Local Value Regularization
x, Nips 23, 4pts
https://arxiv.org/pdf/2307.11620v2.pdf
Abstract: decomposite the offline regularization term.
### Survival Instinct in Offline Reinforcement Learning
x, UW, 7pts
https://arxiv.org/pdf/2306.03286v2.pdf
* interesting obvservation
* offline agent trained with random/negative/zero rewards outperform BC and the behavior policy.
* even outperforms agent trained with original rewards sometimes.
* perform some safe behavior
* intuition
* large data coverage:
* pros: improves the best policy that can be learned by offline RL with the true reward
* cons more sensitive to imperfect rewards.
* Thus, might not be necessary or helpful
* data bias
* evaluate offline algorithm w/ wrong reward to quantify data bias
*
### TEA: Test-time Energy Adaptation
x, Nov23, 7pts
https://arxiv.org/pdf/2311.14402v1.pdf
### IMITATION BOOTSTRAPPED REINFORCEMENT LEARNING
Dorsa, Nov23, 4pts
https://arxiv.org/pdf/2311.02198v3.pdf
Abstract: imitation learning get a pre-trained model. Then, online fine-tuning it using RL. choose action = $arg\max_{a_{RL},a_{IL}} Q(s,a)$
### Nearly Tight Bounds for the Continuum-Armed Bandit Problem
* some insight
* its estimate of Cost Function need only be accurate for strategies where the Cost Function is near its minimum.
*
### Towards a Standardised Performance Evaluation Protocol for Cooperative MARL
* Nips 2022
* some detail that should pay attention to when doing MARL experiement.
### LiFT: Unsupervised Reinforcement Learning with Foundation Models as Teachers
USC, Dec 2023, 4pts
https://arxiv.org/pdf/2312.08958v1.pdf
* use VLM to unsupervised train a MineCraft agent. Prompts look like: `You see blocks of [objects]. You face entities of [entities]. You have a [item] in your hand. Target Skill:`. VLM alignment score = cosine distance in embedding space.
### Less is more - the dispatcher/ executor principle for multi-task Reinforcement Learning
DeepMind, Dec 2023, 6pts
https://arxiv.org/pdf/2312.09120v1.pdf
* main idea: dispatcher + executor = Agent
* dispatcher: semantically understanding the task, make commands.
* executor: actual control signal, may have several executors that are specialized on realizing different skills.
* communication channel: compositionality of the transmitted command information; reduce the transferred information to the minimum.
* execution detail
* code the information about the target object through a simple masking operation
* run the full image through an edge detector and provide the result as a further argument to the executor to avoid collision.
### Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
OpenAI June2022
https://arxiv.org/pdf/2206.11795.pdf
* collect 70k hours unlabeled data, 2k hours of labeled data. Trained a inverse dynamic model by the later dataset, label the former dataset. Train an agent, fine-tuning it by using RL.
### STEVE-1: A Generative Model for Text-to-Behavior in Minecraft
UofT June2023
* use VPT + MineCLIP, to train a commands following agent.
* DALLE-liked method
* DALLE
* CLIP model = text encoder + image encoder
* model1: text prompt -> image embedding
* model2: image decoder
* STEVE
* MineCLIP = video encoder + text encoder
* model1: text prompt -> video embedding
* model2: video embedding -> VPT agent that is able to achieve the goal
* detail
* policy loss = -log p(a|o,z), where image embedding z is randomly selected in the future timestep.
* CVAE take text embedding and guassian prior as input, and output image embedding, which will be the input of the VPT.
* have a dataset to train model1, by minimizing KL divergence.
* drop out some image embedding during training, which make the policy no longer conditioned on the text. During testing time, lambda*policy_distribution_conditioned_on_text + (1-lambda) * policy_distribution_unconditioned_on_text
* note
* add lambda to improve grounding ability?
* the idea's advantage is that you already have a pre-trained decoder. The only thing have to be change is the model1.
### Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning
Yi Wu, aaai2024, 5pts
https://arxiv.org/pdf/2310.04796v2.pdf
* abstract: self-play. instead of choosing appropriate enemy for your current agent, choose a state to start with. use the difference between NE and current states as metric. However, we can nver know the value of NE. Thus, have to do some approximation.
### XSkill: Cross Embodiment Skill Discovery
Shuran Song, corl2023, 3pts
* abstract: learn from unlabeled human demonstration. learn a skill discriminator, and a policy that is conditioned on skill embedding.
### FROM PLAY TO POLICY: CONDITIONAL BEHAVIOR GENERATION FROM UNCURATED ROBOT DATA
Lerrel Pinto, Dec2022, 4pts
* abstract: offline learn a policy conditioned on future state. But how to get the future state during eval time.
### PRE-TRAINING FOR ROBOTS: LEVERAGING DIVERSE MULTITASK DATA VIA OFFLINE RL
ICLR2023, 5pts
* abstract: large scale pre-trained by offline RL + fine-tune on the target task with 10-15 tasks is better than IRL, RL or other method.
### Diffusion Reward: Learning Rewards via Conditional Video Diffusion
Huazhe Xu, 6pts
https://diffusion-reward.github.io/resources/Diffusion_Reward_Learning_Rewards_via_Conditional_Video_Diffusion.pdf
* abstract: diffusion model conditioned on historic image and output the whole trajectory. reward r(s) is defined to be the entropy of the model's output, which can encourage exploration. The insight is taht the diffusion model can generate diverse trajectories.
## ICLR2024
* RLIF: Interactive Imitation Learning as Reinforcement Learning
* 8666
* abstract: use RL to improve DAgger, uses the expert’s decision to intervene as a negative reward signal
* Efficient Offline Reinforcement Learning: The Critic is Critical
* 555
* abstract: use MC regression loss to pre-trained critic loss.
* SUBMODULAR REINFORCEMENT LEARNING
* 8686
* abstract: rewrads may depend on historical trajectories. Theoreticallly analyze the lower bound of the algorithm.
* Towards Principled Representation Learning from Videos for Reinforcement Learning
* 8885
* abstract: RL+video representation learning w/ iid noise or exogenous noise.
* Harnessing Discrete Representations for Continual Reinforcement Learning
* 8655
* abstract: use discrete state representation instead of continuous state.
* Stochastic Subgoal Representation for Hierarchical Reinforcement Learning
* 8661
* abstract: use stochastic latent representation(subgoal), improve long term decision making.
* Discovering Temporally-Aware Reinforcement Learning Algorithms
* 855
* abstract: meta learning used to learn objective function for different task. In this work, their learned objective function depend on time horizon. e.g., students may alter their studying techniques based on the proximity to exam deadlines and their self-assessed capabilities
* Language Reward Modulation for Pretraining Reinforcement Learning
* 6565
* abstract: use VLM's output as pre-training rewards. Fine-tune on the downstream task(w/ sparse rewards)
* Training Diffusion Models with Reinforcement Learning
* 5866
* abstract: DDPO considers the reverse generative process as MDP, where the reward is only given at the zeroth timestep.
* lambda-AC: Effective decision-aware reinforcement learning with latent models
* 5368
* abstract: analyze MuZero and its alternative.
* Exposing the Silent Hidden Impact of Certified Training in Reinforcement Learning
* 565
* adversarially trained value functions are shown to overestimate the optimal values
* Time-Efficient Reinforcement Learning with Stochastic Stateful Policies
* 586
* abstract: POMDP. the model's internal state is represented as a stochastic variable that is sampled at each time step. circumventing the issues associated with backpropagation over time. (why?)
* Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining
* 5686
* abstract: In context learning can do the same thing as UCB, etc. Has better generalize ability than supervised fine-tuning.
* Goodhart's Law in Reinforcement Learning
* 5668
* abstract: Goodhart's law. Give geometric explanation of how optimisation of a misspecified reward function can lead to worse performance beyond some threshold. Proposed a early stopping algorithm.
* Reasoning with Latent Diffusion in Offline Reinforcement Learning
* 685
* abstract: use diffusion model w/ offline RL to learn latent representation. multi-modal, conditioned on time.
* CPPO: Continual Learning for Reinforcement Learning with Human Feedback
* 6685
* abstract: RLHF+continous learning. Examples with high reward and low generation probability or high generation probability and low reward have a high policy learning weight(new knowledge) and low knowledge retention weight(old knowledge).
* Revisiting Data Augmentation in Deep Reinforcement Learning
* 6666
* abstract: analyze existing methods. include a regularization term called tangent prop.
* Proximal Curriculum with Task Correlations for Deep Reinforcement Learning
* 8355
* abstract: multi-task, our curriculum design on the Zone of Proximal Development concept.
* Robust Adversarial Reinforcement Learning via Bounded Rationality Curricula
* 8666
* abstract: build on robust adversarial reinforcement learning by adding entropy regularization into the players' objectives, also annealing the temperature(curriculum learning).
* Subwords as Skills: Tokenization for Sparse-Reward Reinforcement Learning
* 6553
* abstract: discretize action and tokenize skill(a series of actions)
* Proximal Policy Gradient Arborescence for Quality Diversity Reinforcement Learning
* 8866
* abstract: QD+PPO, super good performance.
* Maximum Entropy Model Correction in Reinforcement Learning
* 688
* abstact: MBRL, max entropy RL, use incorrect world model to improve training speed.
* Value Factorization for Asynchronous Multi-Agent Reinforcement Learning
* 6565
* abstract: Asynchronous value decomposition (wat us that ?)
* Policy Rehearsing: Training Generalizable Policies for Reinforcement Learning
* 8886
* abstract: use somehow OOD offline dataset to do training rehearsal. use rewards and done as input to learn the dynamic.
* Robust Reinforcement Learning with Structured Adversarial Ensemble
* 663
* abstract: propose an adversarial ensemble approach to address over-optimism and optimize average performance against the worst-k adversaries to mitigate over-pessimism.
* Multi-Task Reinforcement Learning with Mixture of Orthogonal Experts
* 55666
* abstract: use multiple model for mluti-task RL. Use Gram-Schmidt to make sure that each model will learn different representation.
* Game-Theoretic Robust Reinforcement Learning Handles Temporally-Coupled Perturbations
* 565
* abstract: adversarial training based on temporally-coupled perturbations. (temporally?)
* Tactics of Robust Deep Reinforcement Learning with Randomized Smoothing
* 5555
* abstract: robustness of DRL. randomized smoothing introduce trade-off between utility and robustness. introduce a more potent adversarial attack
* Blending Imitation and Reinforcement Learning for Robust Policy Improvement
* 8885
* abstract: combine RL and IL, use IL to encourage exploration in RL.
* Compositional Instruction Following with Language Models and Reinforcement Learning
* 5553
* abstract: use LLM to map a given natural language specification to an expression representing a Boolean combination of primitive tasks
* Privileged Sensing Scaffolds Reinforcement Learning
* 8 8 8 10 !!!!!
* abstract: MBRL, Dream-liked algorithm. Use priviledge knowledge to learn a better world model.
* CAMMARL: Conformal Action Modeling in Multi Agent Reinforcement Learning
* 5566
* abstract: maintain a belief of your teammates' actions. Conditioned your policy on this belief.
* Robust Model Based Reinforcement Learning Using Adaptive Control
* 6668
* abstract: control input produced by the underlying MBRL is perturbed by the adaptive control, which is designed to enhance the robustness of the system against uncertainties. (?)
* Decision Transformer is a Robust Contender for Offline Reinforcement Learning
* 6666
* abstract: DT requires more data than CQL, but exhibits higher robustness to suboptimal data, sparse reward. DT and BC good at tasks with longer horizons or data collected from human demonstrations, while CQL good at tasks with both high stochasticity and lower data quality.
* Exploiting Causal Graph Priors with Posterior Sampling for Reinforcement Learning
* 8886
* abstract: causal graph as prior, use Byaes to calculate posterior.
* PAE: Reinforcement Learning from External Knowledge for Efficient Exploration
* 8666
* abstract: incorporate planner into RL framework. Take NLP as input, mainly focusing on dealing with long horizon task's exploartion problem.
* SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores
* 8886
* abstract: RL system.
* Decoupling regularization from the action space
* 665
* abstract: the scale of the entropy term should not be prop to the dimension of the action space.
* intuition: changing the robot’s acceleration unit from meters per second squared to feet per minute squared should not lead to a different optimal policy
* tune exp(Q/beta)'s beta based on dim(A). set target entropy based on (a-alpha)*H(uniform) + alpha*H(deterministic)
* Decoupled Actor-Critic
* 6665
* abstract: use Optimism model for exploration (do not interact with the environment), use Pessimism for exploitation.
* S2AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic
* 3668665
* abstract: compare with SQL/SAC. uses parameterized Stein Variational Gradient Descent (SVGD) to learn a max entropy policy.