Raj-Ghugare

@Raj-Ghugare

Joined on Apr 6, 2020

  • Reproduced results Details about the results, ablation, methodology, hyper-parameters can be found in the following report. Report on reproducibility Introduction: In this paper the authors argue that the limited representational resources of model-based RL agents are better used to build models that are directly useful for value-based planning rather than optimising it to predict the best transition probabilities. Their major contribution is the following theorem, Equivalence theorem : two models are value equivalent with respect to a set of functions and policies if they yield the same Bellman updates.
     Like 1 Bookmark
  • Author Raj Ghugare Multi-Armed-Bandit problem Author Raj Ghugare Introduction :
     Like 1 Bookmark
  • Author Raj Ghugare Introduction: The proposed algorithm promises iterative monotonic improvement of large non linear policy parametarizations like neural networks. Preliminaries: Every MDP is described by the tuple $(S,A,P,r,P_{0},\gamma)$ where, $S \rightarrow$ set of all states(finite)
     Like 1 Bookmark
  • Author Raj Ghugare Note: These notes only cover the the advantage estimation part. Other stuff in the paper about previous state of the art algos like TRPO are not reviewed. A brief outline: The paper addresses the problems of vanilla policy gradient methods. High variance in the gradient estimates.
     Like  Bookmark
  • Author Raj Ghugare Preliminaries: This paper assumes that they have access to a simulation of the MDP. This is what they mean by a restarting distribution.The definition for MDP is standard Every MDP is described by the tuple $(S,A,P,r,P_{0},\gamma)$ where, $S \rightarrow$ set of all states(finite) $A \rightarrow$ set of all actions(finite)
     Like 1 Bookmark
  • Author Raj Ghugare A brief introduction: Often predictions problems are tackled using the average outputs of an ensemble of large neural networks. This provides high accuracy but also has high computational requirements for being deployment friendly. This paper further develops an approach to store the "knowledge" of such larger models in a small model which can be deployed for a large number of users. About the simplest approach: The first obvious thing that comes up when thinking about the knowledge of a Neural network are its parameters, which makes it difficult to think of ways to transfer knowledge to a smaller model because the smaller model can have only so many parameters.But the real objective of a neural network is to generalize well on previously unseen data. If a smaller neural network is taught how to generalize in the same way as these large learned ensembles of models, it would perform better than being directly trained on the same data set as the larger model.
     Like 1 Bookmark
  • Author Raj Ghugare Introduction: The main reasons behind the success of Deep Q Networks was its ability to learn directly from pixels. Before the emergence of DQN it was considered difficult to approx high dimension non linear policies. The core ideas behind DQN were: Off-policy learning: The model learns from experiences, randomly sampled from a replay memory. It uses a target network for temporal difference like updates.
     Like 1 Bookmark
  • Author Raj Ghugare Table Of Contents: Introduction Appendix Proof of Bellman Equation Exisitence and Uniqueness of Value Function
     Like 1 Bookmark
  • Author Raj Ghugare What have we learnt till now: In the value function space we know that for the operator $L$ there exists a fixed and unique point, $V^*$ which we also call the true value of states. Using Banach's fixed point theorem and continuity of Value function space we say that the sequence $v^{N+1} = L^{N+1}v_0$ converges to $V^*$ as N becomes very large. where $L$ is: $$Lv = max_\pi[r_\pi + P^\pi v]$$
     Like 1 Bookmark
  • Author Raj Ghugare Introduction: Bellman equations were one of the first ways to formalize MDPs and have been the backbone behind theoretical reinforcement learning since. Most of you guys who have seen the first few David silver lectures or read blogs about introduction to RL might be familiar with the intuition behind bellman equations and why they seem to work or converge.I want to go through the mathematical details of the proofs of basic algorithms like I am going to assume that the reader is familiar with the basic concepts of Value functions - $V(s)$
     Like 1 Bookmark
  • Author Raj Ghugare Chapter 4 - Dynamic Programming: Exercise 4.1: $q_{\pi}(11,down) = -1 + V_{T} = -1$ $q_{\pi}(7,down) = -1 + V_{\pi}(11)$
     Like 1 Bookmark
  • Author Raj Ghugare Chapter 5 - Multi-armed Bandits: Exercise 2.1: Probability of greedy action being selected in this case is : $0.75$ Exercise 2.2:
     Like 1 Bookmark
  • Author Raj Ghugare Finite Markov Decision Processes Exercise 3.4: s a
     Like 1 Bookmark
  • Author Raj Ghugare Chapter 5 - Monte Carlo Methods: Exercise 5.1: Figure 5.1 corresponds to the value function according to the policy that sticks only if the agent has a value of 20 or 21.If the agent has 20 or 21 cards and he sticks of all the possible uniform outcomes majority of them lead to a reward of +1 for the agent hence the value for the last two row is near 1. For any player-sum the value of the states slightly falls off in the states where the dealer is showing an ace.This is because the dealer ccan use the ace as a usable ace and this increases the number of outcomes that go in the favour of the dealer.For example even if the player sticks at a value of 20, and the dealer is showing an ace. This situation will lead to the dealers victory if his second card(hidden card) is either a 10, jack, queen or king. Which would not have been the case otherwise.
     Like 1 Bookmark
  • Author Raj Ghugare Research Paper : Policy Gradient Methods for Reinforcement Learning with Function Approximation Review and research notes Introduction Policy based algorithms are the ones where we parameterize our policy and optimize the parameters based an objective function(expected return of the policy).
     Like 1 Bookmark
  • Author Raj Ghugare Temporal Difference Learning Exercise 6.1: $V_t$ is the array of state values used at time $t$ in the TD error and the TD update. Derivation:
     Like 1 Bookmark
  • Author Raj Ghugare Review and research notes Introduction Policy gradient methods are directly aligned with the goal of RL problems, i.e finding a policy to obtain the highest expected returns. The main idea behind policy grad methods is parameterizing a policy in different ways and then optimizing those parameters to obtain highest expected returns.
     Like 1 Bookmark
  • Author Raj Ghugare Introduction : Multi-armed bandits is a rich area of research on its own but it is often treated by the RL community as an immediate RL problem. It scales down the more intricate details of full RL problems while maintaining the same gist. The basic setup of a K-armed bandit problem is as follows: There are K number of arms, $A = {0,1,2.....K}$ is the Arm set. Each time an arm is selected a reward will be generated, sampled from any distribution but specific to every arm.This means that every arm will have its own reward distribution.
     Like 1 Bookmark
  • Author Raj Ghugare Once you've understood the proof of convergence of Value iteration it is fairly easy to understand why policy iteration shall converge. Let's dive right into the algorithm : Set n as 0. Select an arbitrary deterministic policy, $\pi_n$ (Policy Evaluation) Evaluate $v_{\pi_n}$ by using $L_{\pi_n}$ iteratively over any point in value function space. As the number of iterations become large we will converge towards $v_{\pi_n}$. We can also find $v_{\pi_n}$ using the equation:
     Like 1 Bookmark
  • Author Raj Ghugare Topic of discussion: PAC Bounds for Multi-Armed Bandit and Markov Decision Process Review and research notes Introduction
     Like 1 Bookmark