# CS188 MT Study Guide
Last updated 10/11/19, created by Matthew Tang
# Uninformed Search
* A type of planning problem
* Search problem:
* State Space:
* World state contains all details of environment
* Search state contains bare minimum for planning
* Successor function:
* Start State
* Goal Test
| Search | Description | Advantages | Disadvantages |
| -------- | -------- | -------- | -------- |
| Depth-First Search | Expand deepest nodes first | Better space | Can get stuck searching deep nodes |
| Breath-First Search | Expand shallowest nodes first | Better time, optimal if edge weights are 1 | Worse space |
| Iterative Deepening | Do depth limited search, increasing depth each time | DFS space advantage with BFS time | Still uninformed search |
| Uniform Cost Search | Use cummulative cost as priority for queue | Optimal guarantee | Explores in every direction, no info about goal |
| Search | Fringe | Complete | Optimal | Time | Space |
| -------- | -------- | -------- | -------- | -------- | -------- |
| Depth-First Search | Stack (LIFO) | If no cycles | No | $\mathcal{O}(b^d)$ | $\mathcal{O}(bd)$ |
| Breath-First Search | Queue (FIFO) | Yes | No | $\mathcal{O}(b^d)$ | $\mathcal{O}(b^d)$ |
| Iterative Deepening | Stack (LIFO) | Yes | No | |
| Uniform Cost Search | Priority Queue | Yes if positive edge weights | Yes | $\mathcal{O}(b^{C^* / \epsilon})$ | $\mathcal{O}(b^{C^* / \epsilon})$ |
# Informed Search
* Heuristics
* Estimate how close a state is to a goal
* Ex. Manhattan distance (ignore walls), euclidean distance
* Greedy Search
* Expand the node that seems closest (heuristic)
* Commonly takes you in a suboptimal route
* Graph Search
* Don't expand a state twice
* Tree search + a set of expanded states
* Consistency of heuristic -
* With admissible heuristic, Tree A* is optimal
* With consistent heuristic, Graph A* is optimal
## $A^*$
* Combining UCS and Greedy
* UCS orders by path cost, backwards cost - $g(n)$
* Greedy orders by goal proximity, forwards cost - $h(n)$
* Add $g(n) + h(n)$
* Stop when goal is dequeued
* Heuristic estimates need to be less than the actual costs
* **Admissible**: If $0 \leq h(n) \leq h^*(n)$ where $h^*(n)$ is the true cost to ta nearest goal
* **Consistent**: $\forall A, C, h(A) - h(C) \leq \text{cost}(A,C)$
* Stronger condition than admissible
* Heuristic "arc" cost $\leq$ actual cost for each arc
* Consistent $\implies$ admissible
* **Dominance**: $\forall n: h_a(n) \geq h_b(n)$
* Often admissible heuristics are solutions to relaxed problems where new actions are available
* Example heuristics:
* Manhattan distance (horiz + vert dist)
* Euclidean distance (true dist)
* Max/min of things
# Constraint Satisfactory Problems (CSP)
* A type of identification problem
* Problem formulation
* Variables: N variables
* Domain: A set $\{x_1 \ldots x_d\}$ of all possible domain values
* Constraints: Restrictions on values of variables
* **Unary Constraints**: Involve a single variable in the CSP.
* Not represented in constraint graphs, only prune domain of 1 variable
* **Binary constraints**: Involve two variables
* An edge in a constraint graph
* Constraint graph - edge between 2 variables in a binary constraint
## Solving CSPs
### Backtracking
* An optimization on DFS
* 1) Fix an ordering for variables, and select values for variables in this order.
* 2) Only select values for variables that don’t conflict with any previously assigned values.
* If no such values exist, backtrack and return to the previous variable
### Filtering
* **Forward-checking**: If this assignment leads to no possible assignments for future variables, prune the domain here.
* **Arc consistency (AC-3, 2-consistent)**: Check all arcs (directed edges) in constraint graph
* 1) Store all arcs in a queue $Q$
* 2) For an arc $X_i \rightarrow X_j$, remove domain values from the tail $X_i$ that produce 0 valid values for $X_j$
* 3) If a value is removed from $X_i \rightarrow X_j$, queue $X_k \rightarrow X_i$ (unless already queued)
* 4) Repeat until $Q$ is empty or a domain is empty and backtrack
* Time: $\mathcal{O}(ed^3)$
* $e$: number of arcs, $d$: size of largest domain
* **Strong k-consistency**: Any subset of $k$ nodes is $k, k-1, \ldots 1$ consistent
* Ordering
* **Minimum Remaining Values (MRV)** - Choose the variable with the fewest valid remaining values to assign next
* Most constrained variable is most likely to run out of assignments so need to do it sooner
* **Least Constraining Value (LCV)** - After choosing a variable, select the value that prunes the fewest values from the domains of the remaining unassigned values.
* Requires requires additional computation like arc consistency
### Structure
* Tree-structured CSP: (no loops) runtime is $\mathcal{O}(nd^2)$ instead of $\mathcal{O}(d^n)$
* **Algorithm**
* 1) Pick any node to be root
* 2) Convert undirected edges to be directed away from root
* 3) Linearize (topological sort) the DAG
* 4) Do a backwards pass (from $i=n$ to $i=2$) enforcing arc consistency from $\text{Parent}(X_i) \rightarrow X_i$
* 5) Do a forward assignment (from $X_1$ to $X_n$)
* **Cutset conditioning**: Find the smallest subset of variables to remove to leave a tree
* Time $\mathcal{O}(d^c (n-c)d^2)$, good for small cutset size $c$
## Local Search
* **Min-conflicts heuristic**: select a random conflicted variable and reassign its value to the one that violates the fewest constraints
* A form of iterative improvement
* Improve a single option until you can’t make it better (no fringe)
* Better time and memory, but incomplete and suboptimal
# Games
* **Games**: Adversarial search problems
* Returns a policy which is the best move for this state
* **Minimax**: Opponent behaves optimally performing the move worst for us
* **Game tree**: Successor states = children
* **Utility**: The score an agent receives
* Terminal utility is a known value
* Minimax does a postorder traversal of game tree
* **Expectimax**: chance nodes compute the expected utility/value
* **Multi-agent utilities**: Agente maximize their own utility
## Pruning
* **Alpha-beta pruning** (minimax)
* If the current level is a minimizer which produces a value less than an explored branch, prune the rest at this level
* If the current level is a maximizer which produces a value greater than an explored branch, prune the rest at this level
* Best case: $\mathcal{O}(b^{m/2})$
* **Evaluation Functions**: take in a state and output an estimate of the true minimax value of that node
* Ex. $\text{Eval}(s) = w_1f_1(s) + w_2f_2(s) + \ldots w_nf_n(s)$
* **Depth-limited minimax**: Treat non-terminal nodes at max depth as terminal nodes with utilities from evaluation functions
## Utilities
* **Principle of maximum utility** - rational agents must select the action that maximizes their expected utility
* Agent prefers A to B: $A \succ B$
* Agent indifferent between A and B: $A \sim B$
* Lottery with probability $p$ of getting $A$, $(1-p)$ of getting $B$: $L = [p, A; (1-p), B]$
* Axioms of rationality
* **Orderability**: $(A \succ B) \lor (B \succ A) \lor (A \sim B)$
* Must either A or B or indifferent
* **Transitivity**: $(A \succ B)\land (B \succ C)) \implies (A \succ C)$
* If prefers A to B and B to C, then it prefers A to C.
* **Continuity**: $(A \succ B \succ C) \implies \exists p [p, A; (1- p); C] \sim B$
* If prefers A to B but B to C, it’s possible to construct a lottery L between A and C such that the agent is indifferent between L and B
* **Substitutability**: $(A \sim B) \implies [p, A; (1-p), C] \sim [p, B; (1-p), C]$
* If indifferent between A and B, is also indifferent between any two lotteries with different substitutions
* **Monotonicity**: $(A \succ B)\implies (p \geq q \Leftrightarrow [p, A; (1-p), B] \succeq [q, A; (1-q), B]$
* If prefers A over B, then given a choice between lotteries involving only A and B, the agent prefers the lottery assigning the highest probability to A.
* **Utility function**: Utility of a lottery = expected value of lottery
* Risk:
* **Risk-averse**: Avoids risk (concave down function)
* **Risk-neutral**: Indifferent (linear, 0 concavity)
* **Risk-seeking**: Prefers risk (concave up function)
# Non-Deterministic Search: MDPs
* Problem formulation:
* **States**: A set of states $s \in S$
* **Actions**: A set of actions $a \in A$
* **Transition function**: $T(s,a,s')$
* Probability function that $a$ from $s$ leads to $s'$
* $P(s'|s, a)$
* Aka Model/dynamics
* **Reward function**: $R(s, a, s')$
* Typically small living reward and large terminal reward
* **Discount factor**: $0 < \gamma \leq 1$
* $\gamma = 0$ is a reflex agent
* **Policy**: choice of action for each state
* Goal is to find $\pi^*(s \rightarrow a)$, optimal policy
* **Utility**: sum of (discounted) rewards
* Q-states/action states nodes in "search trees"
* Markov property/memoryless property: Future states do not depend on past states
## Bellman Equation
* $Q^*(s,a) = \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V^*(s')]$
* Best possible reward starting in state $s$ and taking action $a$
* $V^*(s) = \max\limits_a \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V^*(s')]$
* Recursive definition of value: $V^* = \max\limits_a Q^*(s,a)$
* Best possible reward of all possible actions to take
* Bellman equation is a condition for optimality, if it holds true then it implies optimality
* Geometric series sum (finite) = $a_1\left( \frac{1-r^n}{1-r}\right)$
* Geometric series sum (infinite) = $\frac{a_1}{1-r}$
* $1 + x + x^2 + x^3 \ldots = \frac{1}{1-x}$
## Value Iteration
* Define $V_k(s)$ to be optimal value of $s$ if game ends in $k$ more time steps
* Equivalent to depth-k expectimax
* **Algorithm**
* $\forall s \in S, V_0(s) = 0$ (b/c no time steps left so no reward)
* $\forall s \in S, V_{k+1}(s) = \max\limits_a \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V_k(s')]$
* Repeat until convergence: $V_k(s) = V_{k+1}(s) = V^*(s)$
* Policy extraction: $\pi^* = \underset{a}{\mathrm{argmax}} Q^*(s,a)$
* Problems:
* Time per iteration: $O(S^2A)$
* Max at each state rarely changes
* Policy often converges before values do
## Policy Iteration
* $V^\pi(s) \triangleq \sum\limits_{s'} T(s,\pi(s),s')[R(s,\pi(s),s')+ \gamma V^\pi(s')]$
* **Algorithm**
* 1) Define $\pi_0$, the initial policy which can be arbitrary and will converge
* 2) $\forall s \in S, V^\pi_{k+1}(s) \leftarrow \sum\limits_{s'} T(s,\pi_i(s),s')[R(s,\pi_i(s),s')+ \gamma V^{\pi_i}_k(s')]$
* Or solve system of equations to update
* 3) Policy improvement: $\pi_{i+1}(s) = \underset{a}{\mathrm{argmax}} \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V^{\pi_i}(s')]$
* 4) Repeat until convergence: $\pi_{i+1} = \pi_i = \pi^*$
# Reinforcement Learning
* **Sample**: $(s,a, s', r)$ tuple
* **Episode**: collection of samples
* **Sampling**: Have to try things multiple times b/c chance
* Looking for a policy $\pi(s)$ but we don't know $T$ or $R$
## Model-Based Learning
* Generate an approximation of $\hat{T}(s,a,s')$
* Count and normalize from samples what $s'$ it ends up in
* By the law of large numbers, $\hat{T}, \hat{R}$ converge
## Model-Free Learning
### Direct evaluation
* Average the total discounted reward(play until end) after performing action $a$ in state $s$
* Eventually learns state values but can be slow b/c does not contain info about state transitions
### Temporal Difference Learning
* Take an exponential moving average
* Converges much faster
* **Algorithm**
* 1) $\forall s, V^\pi(s) = 0$
* 2) Obtain a $\text{sample} = R(s, \pi(s), s') + \gamma V^\pi (s')$
* 3) Update $V^\pi_k(s) \leftarrow (1- \alpha)V^\pi_{k-1}(s) + (\alpha)\text{sample}_k$
* Learning rate: $0 \leq \alpha \leq 1$
### Q-Learning
* Learn q-values of states directly w/o $T$ or $R$
* Off-policy learning
* **Algorithm**
* 1) Start with $Q(s,a) = 0$
* 2) $Q_{k+1}(s,a) \leftarrow \sum\limits_{s'}T(s,a,s')[R(s,a,s') + \gamma \max\limits_{a'}Q_k(s',a')]$
* $\text{sample} = Q_{new}(s,a) = R(s, \pi(s), s') + \gamma \max\limits_{a'}Q_k(s',a')$
* $Q(s,a) \leftarrow (1- \alpha)Q(s,a) + \alpha \text{sample}$
* Or $Q(s,a) \leftarrow Q(s,a) + \alpha[\text{difference}]$
### Approximate Q-Learning
* Use feature-based representations to represent states as a feature vector
* Advantage: experience is summed up in a few powerful nums
* Disadvantage: States can share features but actually be very diff in value
* $V(s) = w_1f_1(s) + w_2f_2(s) + \ldots w_nf_n(s)$
* **Algorithm**
* 1) $Q(s, a) = w_1f_1(s, a) + w_2f_2(s, a) + \ldots w_nf_n(s, a)$
* 2) $\text{difference} \triangleq [R(s,a,s') + \gamma \max\limits_{a'} Q(s', a')] - Q(s,a)$
* 3) $w_i \leftarrow w_i + \alpha[\text{difference}]f_i(s,a)$
* Adjust weights of active features
* If something bad happens, blame the features that were present
## Exploration
* $\epsilon$ greedy - Act randomly with a small probability $0 < \epsilon < 1$, exploit with large probability $1-\epsilon$
* Problems: Keep thrashing once learning is done
* Solution: Lower $\epsilon$ over time or have exploration functions
* Exploration functions
* Explore areas that we aren't that certain are bad, but eventually stop
* $\text{sample} = R(s, \pi(s), s') + \gamma \max\limits_{a'}f(s',a')$
* Take a value estimate $u$ and visit count $n$ and return an optimistic utility like $f(s,a) = Q(s,a) + \dfrac{k}{N(s,a)}$, $N(s,a)$ is the number of times the q state has been visited
* Modified Q updates propagate "bonus" back to states that lead to unknown states
## Regret
* Measure of total mistake cost: difference between your rewards (including old suboptimality) and optimal rewards
* Minimizing regret requires optimally learning to be optimal
* Random exploration tends to have high regret
* Can never have 0 regret unless given a known optimal policy
* Or could define regret relative to the best of a class of learning algorithms