# HW9 drafty draft
### Topics
1. Reminder about Bayes filter
2. Belief-space process is an MDP
3. Belief-space intuition (simplex)
4. Expectimax finds policy tree
6. Worst-case complexity of exact solution: formulate
7. MLS, including example of bad behavior
8. Q_MDP, including example of bad behavior
9. Determinize (using MLO or all-outcomes)
### Exact solution
A policy tree at depth h has n nodes; each has an action; so it's $|A|^{|O|^h}$
### Robot intro story
Our robot has lost its charger! The robot lives in a world with $K$ locations but it needs to have the charger in location $0$ in order to plug it in and charge.
The low battery means that its sensors aren't working very well, and so when the robot is in location $K$, it can execute a *look(loc)* action, which will return an observation in $\{0, 1\}$. If the charger is in the location *loc*, then it will get observation $1$ with probability 0.9. If the charger is not in location *loc*, it will get observation $1$ with probability 0.4.
The robot an also try to execute a *move-charger(loc1, loc2)*. If the charger is actually in *loc1*, then with probability 0.8, the charger will be moved to *loc2*, otherwise, it will stay in *loc1*. If the charger is not in *loc1*, then nothing will happen.
The robot has one more action *charge*. If it executes that action when the charger is in location 0, then it gets a reward of 10 and the game terminates. If it executes that action in any other state, it gets reward -100 and the game terminates. It gets reward -x for every look action and -y for every move action in all other states.
### Bayes filter
Assume a world in which $K = 3$. The robot starts out with a uniform distribution over where its charger is. It executes the following sequence of actions and receives the noted observations. What is its belief about the charger location after each step? (Feel free to write a program to solve this---we would!)
* A = *look(0)*
* O = 0
* A = *look(1)*
* O = 1
* A = *move(1, 0)*
* A = *look(0)*
* O = 0
* A = *look(0)*
* O = 0
```
init: DDist{0: 0.33333, 1: 0.33333, 2: 0.33333}
action l0 : DDist{0: 0.33333, 1: 0.33333, 2: 0.33333}
obs 0 : DDist{0: 0.07692, 1: 0.46154, 2: 0.46154}
action l1 : DDist{0: 0.07692, 1: 0.46154, 2: 0.46154}
obs 1 : DDist{0: 0.04878, 1: 0.65854, 2: 0.29268}
action m10 : DDist{0: 0.57561, 1: 0.13171, 2: 0.29268}
obs 0 : DDist{0: 0.57561, 1: 0.13171, 2: 0.29268}
action l0 : DDist{0: 0.57561, 1: 0.13171, 2: 0.29268}
obs 0 : DDist{0: 0.18438, 1: 0.25312, 2: 0.56250}
action l0 : DDist{0: 0.18438, 1: 0.25312, 2: 0.56250}
obs 0 : DDist{0: 0.03631, 1: 0.29908, 2: 0.66462}
```
Maybe a final question asking why it's pretty sure the object is in location 2?
**Should we ask them to code a method in the POMDP class and then use it to do this?**
### Belief-space process
Let's construct the MDP that corresponds to our simple robot POMDP.
* State space: the three-dimensional simplex = the set of vectors of three non-negative numbers that add up to 1.
* Action space: as before
Assume that you have a Bayes filter function $\textit{bf}(b, a, o)$ that takes a belief state $b$, action $a$, and observation $o$, and returns a new belief state $b'$ as you computed in the previous problem.
1. What is the probability of receiving observation $1$ given belief state *b* and action *look(l)*?
Ans: b[l] * .9 + (1 - b[l]) * .4
2. Transition model: We want to express the transition model of the belief-space process.
* If the robot executes action *look(l)*, how many possible resulting belief states are there?
Ans: 2
* What are they?
Ans: bf(b, look(l), 1) with probability P(O = 1 | b, look(l)) and bf(b, look(l), 0) with probability P(O = 0 | b, look(l)) . Not sure how best to ask for this. Could be some form of multiple choice or typing in formulae?
* If the robot executes action *move(l1, l2)*, how many possible resulting belief states are there?
Ans: 1
* What are they?
Ans: bf(b, move(l1, l2), None)
3. Reward
What is the reward value of taking action *charge* in belief state b?
Ans: 10*b[0] - 100(1 - b[0])
### Simplex
The set of possible belief states over $K$ world states is a *simplex*, which is a $K-1$ dimensional manifold embedded in the $K$ dimensional space. Here is a picture of the simplex for $K = 3$.
1. For each of the following belief states, explain what part of the simplex it corresponds to. Could do this by matching plots or matching descriptions
* b(s = 0) = 1 (Ans: a vertex)
* b(s = 0) = 0 (Ans: an edge)
* b(s = 0) > p (Ans: a triangle)
* b(s = 0) = b(s=1) (Ans: a line from the b(s=2) vertex to the midpoint of the opposite side)
2. Which of the following plots corresponds to belief states in which the value of charging > 0?




3. If our robot has no discounting and a horizon of 2 (so it can take two actions), what is its optimal strategy?
Ans: (I think) move(l*, l0) where l* is whichever of the two locations, s1 or s2 has higher belief, and then if the posterior belief in s0 is > 100/110 then charge.
(In the case where the posterior after the move doesn't satisfy the requirements for charging on the last step, it doesn't matter what the robot does.)
### Optimal strategy (skippable)
Might be interesting (for us) to compute an optimal strategy (discounted or, say, 5 steps) and then ask them some questions about it.
### Belief MDP
Ask them to implement a function that takes a POMDP as input and constructs a belief MDP (at least, enough methods so that expectimax will work). *I have ugly code for this*.
### Expectimax
All kinds of weird and interesting stuff starts to happen here! My code, which is a miserable amalgamation of some of my old utilities that I'm familiar with, with our MDP and expectimax code, and some hacks, is here: [colab](https://colab.research.google.com/drive/1mcHz866JV-Yzdwb7QGBN7GyMvJ5JTlCn?usp=sharing)
I may still have some bugs, but maybe not.
Ask them to find the optimal horizon 4 policy tree for some initial belief?
Or, maybe, make a version that will write out a .dot file?
*Here is a really interesting question, and I think it might be related to something that one of the old POMDP crowd called "infinitely delayed splurge".*
Consider the situation when we are doing receding horizon control with horizon = 4, and belief state (0.02, 0.0, 0.98). The policy just keeps picking look actions. Why?
Because it has a plan to eventually do move(2, 0) followed by charge. But because the move(2, 0) is more expensive than looking, it decides to do a cheap look first.
### Most likely state
Consider a situation in which:
* the charger is actually in location 1
* the robot's prior is (0.4, 0.3, 0.3)
What action will the MLS strategy take?
ans: charge
What is the reward of taking this action?
### QMDP
Consider the same situation.
For QMDP, we start by solving the MDP.
*Tell them the Q values of the underlying MDP*
What action will the Q_MDP strategy take?
*Would be nice if it just waits around---I guess we could give it a "nop" action and see that it picks that instead of looking.*
Could skip this if too long
Could compare actual execution reward vs optimal
### Most likely observation
I implemented a version of this. It would be good to construct an example where it works well (it seems to be good in this problem) and ask them to say what path it will get.
### All outcomes determinization
I think we're all tired by now ;-)
Jiayuan Note:
add a thinking question about a robot with battery state variable