Inverse RL Algorithms Progress Report 6/26/2019

--- title: Inverse RL Algorithms Progress Report 6/26/2019 tags: Inverse RL description: Progress Report on Inverse RL --- # Inverse RL Algorithms survey ## IRL versus RL In RL, agent is provided with a reward function. This reward function is used to obtain an optimal policy, one where the expected future reward is maximal. In IRL, the setting is inverse. We are now given some agent's policy or a history of behavior (trajectories) and we try to find a reward function that explains the given behavior. A **Markov decision process (MDP)** is defined as a tuple **<S,A,R,T,γ>** * **S** is the set of n environment states * **A** is the set of k actions * **R** is the reward function which maps each state to a real valued reward S→R * **T:S×A×S→[0,1]** contains the transition probabilities, e.g. T(s′|s,a) is the probability of landing in state s′ after performing action a in state s * **γ** is the discount factor for future reward The value of a state under policy π is denoted as Vπ(s). By the Bellman equation, this value is defined as: ![](https://i.imgur.com/MwROV7H.png) The value of state-action pairs under some policy π is denoted as Q^π(s,a) and is defined as: ![](https://i.imgur.com/lGBI4aA.png) ## Inverse RL Scenarios (classic algorithms) **Goal:** Find the optimal reward function (all scenarios). **1. Optimal policy π is known. Small state space.** We force the expected value of the policy actions π(s) to be larger or equal than the expected value of any other action for all states. ![](https://i.imgur.com/HBTJ9Zm.png) Reformulate the above equation in terms of matrices and vectors: ![](https://i.imgur.com/iKFeXqV.png) **2. Optimal policy π is known. Large or infinite state space** * Infinite state space S∈R^n (N dimensional real domain) * Reward = R^n→R (from real domain with n dimensions to real with 1 dimension) This is possible in theory, it is very hard to process algorithmically. Instead, approximate reward function using linear combination of d fixed, known, and bounded basis functions ϕi: ![](https://i.imgur.com/4lJwUUC.png) Express value function (V^π) in terms of basis function to estimate reward function. ![](https://i.imgur.com/HSlp9Gb.png) Estimate the expected value based on a sampled subset of spaces S0⊂S and enforce that policy actions lead to higher expected value than non-policy actions: ![](https://i.imgur.com/fvEyEUP.png) The real reward function might not be representable as a linear combination of our selected basis function. To resolve this issue, we relax our linear constraints and simply penalize whenever they are violated. ![](https://i.imgur.com/d1KcUYf.png) We can see that the formulation above tries to maximize the smallest difference it could find, i.e. it selects the best non-policy action and tries to maximize the difference to the policy action in each state. **3. Optimal policy π is unknown. Behavior trajectories ζ are given** **Scenario:** We don't know the exact policy of the agent but we are able to observe trajectories ζ of states and actions generated by some optimal policy based on the unkown reward function. **Goal:** Find the optimal reward function **Algorithms works in the following steps:** 1. Find the Rerward (the overall empirical value of a trajectory). The emirical value of a trajectory tells how much reward did some trajectory actually yield. * calculate the corresponding value estimate for each basis function ![](https://i.imgur.com/NCi40BS.jpg) * the overall empirical value of a trajectory is ![](https://i.imgur.com/jprF0b4.png) * Let's call the trajectory generated by the expert **ζπ∗**. Now our goal is to find parameters **αi** such that our optimal expert trajectory **ζπ∗** yields a higher empirical reward than other trajectories which were generated by different policies. ![](https://i.imgur.com/Zh3P8yU.png) 1. Find a policy πi which is optimal for current reward function. 1. For each new policy πi generated this way, we will generate a corresponding trajectory ζπi. * In the first iteration our set of competing trajectories only consisted of ζπ1. After m iterations of the algorithm the constraints will grow to ![](https://i.imgur.com/QGVapDg.png) This algorithm then runs for some large number of iterations until a satisfactory reward function estimate is found. For this algorithm to work we have to make two important assumptions: * We can generate trajectories for any given policy * Given any reward function, we can generate a policy which is optimal for this reward function Final Objective function: ![](https://i.imgur.com/e7qnGDf.png) ### Challenges so far: * The IRL is an underdefined problem which means that there can be many fitting reward functions for observed demonstrations * Observed demonstrations may not be precisely optimal ## Inverse RL (modern algorithms) 1. **Maximum Entropy Inverse RL** **Goal** : handle ambiguity using probabilistic model of behavior (which reward function is the right one?) ![](https://i.imgur.com/YYATTDU.jpg) *********************************************** **Maximum Entropy IRL Optimization** ![](https://i.imgur.com/7aIDilX.png) ************************************************* ![](https://i.imgur.com/Rxe6joF.png) ************************************************ ![](https://i.imgur.com/KdidK24.png) ************************************************** **Applicable in settings**: * When the environment dynamics are known (transition model) * Low dimensioanl state and action space is required to compute the optimal policy and compute state visitation frequency (probability) **Key note**: the algorithm solves MDP in the inner loop **How can we**: * handle unknown dynamics? * We cannot compute partition function Z if environment dynamics are unknown * avoid solving the MDP in the inner loop ![](https://i.imgur.com/Njqi1Gr.png) ******************************************************* * Use sampling to estimate partition function Z * What distribution should we sample from? * How should we sample in order to get a good estimate for Z? * Sample adaptively to estimate Z * As we get a better esitmate of reward function, we can construct a sampling distribution that is proportional to the exponential of the reward function. * We construct sampling distribution as policy and we will sample from that policy. * This called the **Guided Cost Learning Algorithm** **************************************************************** 1. **Guided cost learning algorithm** ![](https://i.imgur.com/eYA4Rkc.png) ************************************* **Connection between GCL Inverse RL and GANs:** ![](https://i.imgur.com/RFRwzdL.png) *********************************************** ![](https://i.imgur.com/56xoRxP.png) *********************************************** ![](https://i.imgur.com/T26suuA.png) ********************************************** ## References 1. https://thinkingwires.com/posts/2018-02-13-irl-tutorial-1.html 2. Algorithms for inverse reinforcement learning. Ng & Russel, ICML 2000 3. Generative Adversarial Imitation Learning. Jonathan Ho, Stefano Ermon, NeuralPs 2016 4. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization. Chelsea Finn, Sergey Levine, Pieter Abbeel, ICML 2016 5. https://people.eecs.berkeley.edu/~cbfinn/_files/bootcamp_inverserl.pdf 6. https://web.stanford.edu/class/cs234/slides/2017/cs234_guest_lecture_maxent_invrl.pdf