Homework 5 Conceptual

## Question Setup For these conceptual questions, you will be working with an MDP with three discrete states $S=\{s_1, s_2, s_3\}$ and two discrete actions $A=\{a_1, a_2\}$. You are using a policy estimated by a logistic regression for each state. The parameters of our logistic regressions are as follows: $\theta = [[0.5, -0.5], [0.2, 0.2], [0.1, 0.7]]$. The probabilities of taking an action can be computed as $\pi(a | s)=\text{softmax}(\theta_{s,a})$. For $s_1$, the probability of taking $a_1$ or $a_2$ can be found by taking the softmax of $[0.5, -0.5]$. For $s_2$, it is the softmax of $(0.2, 0.2)$, and so on. You are using a discount factor of $\gamma=0.9$. You have collected three trajectories so far. Here, we'll write trajectories as a list of (state, action, reward) pairs. 1. $\tau_1 = [(s_1, a_1, +2), (s_2, a_2, +1), (s_3, a_1, +3)]$ 2. $\tau_2 = [(s_1, a_2, -1), (s_3, a_2, +2)]$ 3. $\tau_3 = [(s_2, a_1, +1), (s_3, a_2, +2)]$ ### Questions 1. Calculate the probabilities of taking each action from each state. (6 total probabilities) 2. Calculate the discounted returns $G_t$ for each timestep in each trajectory. (compute 7 $G_t$ total) 3. Calculate a value estimate of each state, $V^\pi(s)$. $V^\pi(s)$ can be estimated by calculated the average discounted return for each state. (One value for each state) 4. Compute the policy gradient $\nabla \ln \pi(a|s, \theta)$ for each (state, action) pair in the collected trajectories. You will only compute the gradient for (state, action) pairs that appear in your trajectories, which will result in 7 values total. 5. Calculate the REINFORCE gradient estimate $\nabla J(\theta)$ for each trajecotory. (One gradient for each trajectory) 6. Compute the mean gradient estimate across all trajectories. (For each element of the gradient, take the mean over all 3 trajectories) 7. Compute the variance of the gradient estimates across all trajectories. (For each element of the gradient, take the variance over all 3 trajectories) 8. Recompute the gradient estimate for each trajectory with REINFORCE using a baseline function $b(s) = V(s)$, which you computed in 3. 9. Compute the mean and variance of these gradient estimates and compare your answer to the result of question 7. 10. What is the advantage of using a baseline function in REINFORCE?