# Reinforcement learning Udacity course ###### tags: honours, reinforecement learning # Lesson 2 **Reinforcement learning:** Similar to supervised learning, there is a stream of pairs of data, but here we get x and z instead of $y = f(x)$ x and y. The world: gridworld, nice approximation of the entire universe. Action: Up, down, left, right * If nothing can go wrong, we just work out in advance what to do * 80% of completing right, 20% of going at a right angle. * We have to re-plan, or incorporate it into the probability ## Markov Decision Process * There are states, which is a set of different ways the world can be. * Action: Thing that you can do in a particular state (up, down, left, right), can also be in the context of states * Model: $T(state_1, action, state_2)$ Rules of the game, that don't change, Probability of going from $state_1$ to $state_2$, **Markovian property:** only the present matters, the probability that you will be in $state_2$ only depends on $state_1$. * the current state can remember everything that you did before * the world is stationary, the rules don't change, the physics don't change * Reward: if you are in a state, you get a reward, based on the usefulness of being in that state. You can also get a reward for being in a state and taking an action. **Solution to MDP:** * Policy: for every position, tells you what action to take. $\pi^*$ is the optimal policy, the most rewards you get. * Includes state, action and reward. From that, we need to find the optimal solution. * If I'm in this state, what is the next best action I can take. ### Rewards * delayed reward: minor changes matter, take actions that set you up for other actions that set you up for other actions * kind of like chess, early moves that seem reasonable can turn out to be putting you at a disadvantage. * you don't really know. **Credit assignment problem:** Sequence of events over time, so temporal * Small negative rewards encourage you to end the game. * Going the long way around, to avoid getting into a position that could get you into a negative reward (due to stochasticity) ## Sequence of rewards The amount of timesteps you have matter (*finite horizon*), it changes the calculations(the expected reward), thus it changes the policy, even if you are in the same state, you might choose to go somewhere else. **Utility of sequences:** $$U(state_1, state_2, state_3, ...) > U(state_1, state_2^, state_3^, ...) \rightarrow U(state_2, state_3, ...) > U(state_2^, state_3^, ...)$$ This is when the world is stationary, or infinite horizons Existential dilemma of living forever: if you only get positive rewards, the utility of an infinite sequence is infinite, so it doesn't matter where we go. ## Assumptions $$U(state_1, state_2, state_3, ...) = \sum \gamma^tR_{max}, 0 <= \gamma < 1$$ * Policies Reward: immediate feedback Utility: long term feedback, how good it is long term