Actions are mapped to indices from 0 to 4 and are defined as follows -
0 -> Stay, 1 -> Left, 2 -> Right, 3 -> Up, 4 -> Down
States are mapped to indices from 0 to 127.
A state's index can be used to determine the player position, target position and the call status of the current state, using the following formulae
player position = index//16
target position = (index%16)//2
call status = (index)%2
The players positions mapped to the actual positions on the grid is as follows:
0,1,2,3
4,5,6,7
A 1 in the call state means that the call is active, while a 0 means the call is inactive.
# Question 1
Target is in (1,0) cell and observation is o6.
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.1, 0.1, 0, 0, 0, 0, 0, 0]
Any state with player position 1,2,3,6,or 7; target position at 4 and call status 0 or 1 will have a probability of 0.1 while all other states will have a probability of 0.
# Question 2
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.25, 0, 0, 0, 0, 0, 0.25, 0, 0.25, 0, 0.25, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Any state with player position 5; target position at 1 or 4 or 5 or 6 and call status 0 will have a probability of 0.25 while all other states will have a probability of 0.
# Question 3
Expected reward for Q1 is 9.06
Expected reward for Q2 is 20.13
# Question 4
We use the formula n = 
where __ = 265,288,703,664,880,029,479,731 and |A| = 5
# Question 5
P of o2 = 0.1 as there is only one case when o2 is observed. This case is when agent is at 0,0 and target is at 0,1.
P of o4 is 0.15 as there is only one case when o2 is observed. This case is when agent is at 1,3 and target is at 1,2.
P of o6 is 0.75 as, in all other cases this occurs. This is the observation that is most likely to be observed.