# Reinforcement learning Udacity course
###### tags: honours, reinforecement learning
# Lesson 2
**Reinforcement learning:** Similar to supervised learning, there is a stream of pairs of data, but here we get x and z instead of $y = f(x)$ x and y.
The world: gridworld, nice approximation of the entire universe.
Action: Up, down, left, right
* If nothing can go wrong, we just work out in advance what to do
* 80% of completing right, 20% of going at a right angle.
* We have to re-plan, or incorporate it into the probability
## Markov Decision Process
* There are states, which is a set of different ways the world can be.
* Action: Thing that you can do in a particular state (up, down, left, right), can also be in the context of states
* Model: $T(state_1, action, state_2)$ Rules of the game, that don't change, Probability of going from $state_1$ to $state_2$,
**Markovian property:** only the present matters, the probability that you will be in $state_2$ only depends on $state_1$.
* the current state can remember everything that you did before
* the world is stationary, the rules don't change, the physics don't change
* Reward: if you are in a state, you get a reward, based on the usefulness of being in that state. You can also get a reward for being in a state and taking an action.
**Solution to MDP:**
* Policy: for every position, tells you what action to take. $\pi^*$ is the optimal policy, the most rewards you get.
* Includes state, action and reward. From that, we need to find the optimal solution.
* If I'm in this state, what is the next best action I can take.
### Rewards
* delayed reward: minor changes matter, take actions that set you up for other actions that set you up for other actions
* kind of like chess, early moves that seem reasonable can turn out to be putting you at a disadvantage.
* you don't really know.
**Credit assignment problem:** Sequence of events over time, so temporal
* Small negative rewards encourage you to end the game.
* Going the long way around, to avoid getting into a position that could get you into a negative reward (due to stochasticity)
## Sequence of rewards
The amount of timesteps you have matter (*finite horizon*), it changes the calculations(the expected reward), thus it changes the policy, even if you are in the same state, you might choose to go somewhere else.
**Utility of sequences:**
$$U(state_1, state_2, state_3, ...) > U(state_1, state_2^, state_3^, ...) \rightarrow U(state_2, state_3, ...) > U(state_2^, state_3^, ...)$$
This is when the world is stationary, or infinite horizons
Existential dilemma of living forever: if you only get positive rewards, the utility of an infinite sequence is infinite, so it doesn't matter where we go.
## Assumptions
$$U(state_1, state_2, state_3, ...) = \sum \gamma^tR_{max}, 0 <= \gamma < 1$$
* Policies
Reward: immediate feedback
Utility: long term feedback, how good it is long term