# RL Midterm Notes
## Quick Notes for self
* Deternimistic or not? "Randomly choosing between two policies w/ identical max value" is NOT deterministic
* Proof that an algo coverges to the true state or state/action values
* Planning: Figuring out what to do without needing to try
* With/without model: tradeoff is on how much data and how much compute each method uses
* L'Hospital --> Don't take derivatives all the way to the end, just do it enough times for the numerator/denominator to make sense!
* For off policy (normally) the starting timestamp for V --> t
* For off policy (normally) the starting timestamp for Q --> t + 1
* The direction of progression of $t$ and state labels are sometimes reversed!
## Chapter 1
:)
## Chapter 2: Bandits
### Incremental Implementation of Bandits

update rule:

### A simple bandit algorithm

* Update accounts for the appearance frequency of actions
### Exponential recency-weighted average

* use a constant step-size parameter to give more weight to recent rewards than to long-past rewards
### UCB

* The usage of UCB is to account for the variance of a's estimate value
### Gradient Bandit
soft-max distribution:


## Chapter 3: MDPs
### Returns across different timestamps
* General equation:

* Infinite timestamps:

^^ (if R==1)
### Value function

### Action-value function

### The Bellman Equation


### Bellman Optimality Equation



## Chapter 4 (DP)
### Policy Prediction

^^ For this to work you need the model (i.e. transistions, states) of the environment
### Policy improvement

### Policy Iteration and Value Iteration


### Asynchronous DP
Asynchronous DP algorithms are in-place iterative DP algorithms that are not organized in terms of systematic sweeps of the state set. These algorithms update the values of states in any order whatsoever, using whatever values of other states happen to be available.
## Chapter 5: Monte Carlo methods
### Monte Carlo Prediction

* Note that first visit == first timestamp of a state
* But the Return update is done BACKWARDS!
### Monte-Carlo ES

### On-Policy MC

### Importance Sampling
importance-sampling ratio

### Off-Policy w/ weighted sampling:
* Reduces variance but becomes biased
* The scaling factor is guarantted to be between 0 and 1



### Ordinary importance sampling:

Ordinary importance sampling is unbiased whereas weighted importance sampling is biased (though the bias converges asymptotically to zero). On the other hand, the variance of ordinary importance sampling is in general unbounded because the variance of the ratios can be unbounded, whereas in the weighted estimator the largest weight on any single return is one.
How much more often would have this trajectory happened in pi' as supposed to in pi?
## Chapter 6: Temporal Difference Methods
### TD(0) Prediction

* (Definition) A Markov reward process, or MRP, is a Markov decision process without actions. We will often use MRPs when focusing on the prediction problem, in which there is no need to distinguish the dynamics due to the environment from those due to the agent.
### Sarsa (On-policy TD control)

### Q-Learning
* Is off-policy because the target policy (epsilon greedy in the box), and the behaviour policy is greedy (the highlighted *max_a*)


### Expected Sarsa
#### --> a.k.a. Q-learning that uses expected values to update Q values
Consider the learning algorithm that is just like Q-learning except that instead of the maximum over next state–action pairs it uses the **expected value**, taking into account how likely each action is under the current policy.

### Double Q-Learning

## Chapter 7: n-step Bootstrapping
### n-step TD prediction



### n-step SARSA

### n-step off-policy learning



### n-step tree backup
* No importance sampling!


* Recursive definition:

## Chapter 8: Planning and Learning with Tabular Methods



### DYNA-Q+
This agent keeps track for each state–action pair of how many time steps have elapsed since the pair was last tried in a real interaction with the environment. **The more time that has elapsed, the greater (we might presume) the chance that the dynamics of this pair has changed and that the model of it is incorrect.**
### Prioritized sweeping
A queue is maintained of every state–action pair whose estimated value would change nontrivially if updated , prioritized by the size of the change. When the top pair in the queue is updated, the e↵ect on each of its predecessor pairs is computed. If the effect is greater than some small threshold, then the pair is inserted in the queue with the new priority (if there is a previous entry of the pair in the queue, then insertion results in only the higher priority entry remaining in the queue). In this way the effects of changes are efficiently propagated backward until quiescence.
* Prioritizing transitions experienced in an episode could update value estimates that are more relevant to the policy being used, leading to information propagating more quickly.

#### TL;DR: No value in updating 0-->0 state transitions
### MCTS



## Chapter 9: On-policy prediction with approximation
### Prediction Error

### SGD



* Recall that your value function estimate is just the dot product between the weights and features. Therefore, since the gradient is the derivative with respect to each weight, **the gradient vector will be identical to the feature vector** in this instance.
### Linear Methods
Since

and that

therefore

### Memory-based function approximation
These methods simply save training examples in memory as they arrive (or at least save a subset of the examples) without updating any parameters. Then, whenever a query state’s value estimate is needed, a set of examples is retrieved from memory and used to compute a value estimate for the query state.
## Chapter 10: On-policy control with approximation
### Extend to Action-value


### Average rewards for continuous tasks
* Discounting is problematic for continuous tasks when using function approximation as we can no longer guarantee that enhancing the discounted reward shall lead to a better policy!

* Returns are defined as the difference between the reward and the average reward:


* Ergodicity Assumption: Where the MDP starts or any early decision made by the agent can have only a temporary effect; in the long run the expectation of being in a state depends only on the policy and the MDP transition probabilities.
## Chapter 11: Off-policy methods w/ approximation

### The deadly triad
* All 3 present == Uh oh oh no


### Baird’s Counterexample


## Chapter 12: Eligibility Traces
### The Lambda return

* Offline: you do the update **after** the episode
* Online: you do the update **before** the episode


\----------------------------------------------------------------------------


\----------------------------------------------------------------------------
* Recursive definition of the $\lambda$ return

### Offline TD lamdba


* Drawback: the $\lambda$ return is not known until the end of the episode
### TD-Lambda (online)






* In TD($\lambda$), the eligibility trace vector is initialized to zero at the beginning of the episode, is incremented on each time step by the value gradient, and then fades away by $\lambda \gamma$.
* The eligibility trace keeps track of which components of the weight vector have contributed, positively or negatively, to recent state valuations, where “recent” is defined in terms of $\gamma\lambda$.
* Intuition: Which states are more worthy of getting the update? (backwards view)

\----------------------------------------------------------------------------

### n-step truncated $\lambda$ return methods

\---------------------------------------------------------------------------------

\---------------------------------------------------------------------------------

\---------------------------------------------------------------------------------

\---------------------------------------------------------------------------------

### True online TD-lambda

\---------------------------------------------------------------------------------


## Chapter 13: Policy Gradients
* Directly learn how to take actions without consulting an action value.
* I.e. directly learning a "parameterized" ***policy***.
* Methods that learn approximations to both policy and value functions are often called actor–critic methods, where ‘actor’ is a reference to the learned policy, and ‘critic’ refers to the learned value function, usually a state-value function.

\----------------------------------------------------------------------------

\----------------------------------------------------------------------------

### REINFORCE

\----------------------------------------------------------------------------

\----------------------------------------------------------------------------



### REINFORCE with baseline
* In general, the baseline leaves the expected value of the update unchanged, but it can have a large effect on its variance.
* One natural choice for the baseline is an estimate of the state value, v(St,w).

## Post-Midterm Thoughts
- There are 20 True/False questions and 7 other questions (each with a number of sub-parts).
- The 7 questions are from various topics - Bandits, MC, MCTS, TD, Policy Gradient, Eligibity Traces, Q-Learning, DYNA, etc.
- The eligibility traces and policy gradient questions are especially calculative.
- Take care of alphas and gammas in the equations, as it is easy to get them wrong.
- The time was just enough for me, when I had to lookup a lot of topics in the textbook (like Baird's counterexample, REINFORCE algorithm)
- Some questions were indeed exactly same as the homework questions. So do look at the HW ques once.