CS188 MT Study Guide

# CS188 MT Study Guide Last updated 10/11/19, created by Matthew Tang # Uninformed Search * A type of planning problem * Search problem: * State Space: * World state contains all details of environment * Search state contains bare minimum for planning * Successor function: * Start State * Goal Test | Search | Description | Advantages | Disadvantages | | -------- | -------- | -------- | -------- | | Depth-First Search | Expand deepest nodes first | Better space | Can get stuck searching deep nodes | | Breath-First Search | Expand shallowest nodes first | Better time, optimal if edge weights are 1 | Worse space | | Iterative Deepening | Do depth limited search, increasing depth each time | DFS space advantage with BFS time | Still uninformed search | | Uniform Cost Search | Use cummulative cost as priority for queue | Optimal guarantee | Explores in every direction, no info about goal | | Search | Fringe | Complete | Optimal | Time | Space | | -------- | -------- | -------- | -------- | -------- | -------- | | Depth-First Search | Stack (LIFO) | If no cycles | No | $\mathcal{O}(b^d)$ | $\mathcal{O}(bd)$ | | Breath-First Search | Queue (FIFO) | Yes | No | $\mathcal{O}(b^d)$ | $\mathcal{O}(b^d)$ | | Iterative Deepening | Stack (LIFO) | Yes | No | | | Uniform Cost Search | Priority Queue | Yes if positive edge weights | Yes | $\mathcal{O}(b^{C^* / \epsilon})$ | $\mathcal{O}(b^{C^* / \epsilon})$ | # Informed Search * Heuristics * Estimate how close a state is to a goal * Ex. Manhattan distance (ignore walls), euclidean distance * Greedy Search * Expand the node that seems closest (heuristic) * Commonly takes you in a suboptimal route * Graph Search * Don't expand a state twice * Tree search + a set of expanded states * Consistency of heuristic - * With admissible heuristic, Tree A* is optimal * With consistent heuristic, Graph A* is optimal ## $A^*$ * Combining UCS and Greedy * UCS orders by path cost, backwards cost - $g(n)$ * Greedy orders by goal proximity, forwards cost - $h(n)$ * Add $g(n) + h(n)$ * Stop when goal is dequeued * Heuristic estimates need to be less than the actual costs * **Admissible**: If $0 \leq h(n) \leq h^*(n)$ where $h^*(n)$ is the true cost to ta nearest goal * **Consistent**: $\forall A, C, h(A) - h(C) \leq \text{cost}(A,C)$ * Stronger condition than admissible * Heuristic "arc" cost $\leq$ actual cost for each arc * Consistent $\implies$ admissible * **Dominance**: $\forall n: h_a(n) \geq h_b(n)$ * Often admissible heuristics are solutions to relaxed problems where new actions are available * Example heuristics: * Manhattan distance (horiz + vert dist) * Euclidean distance (true dist) * Max/min of things # Constraint Satisfactory Problems (CSP) * A type of identification problem * Problem formulation * Variables: N variables * Domain: A set $\{x_1 \ldots x_d\}$ of all possible domain values * Constraints: Restrictions on values of variables * **Unary Constraints**: Involve a single variable in the CSP. * Not represented in constraint graphs, only prune domain of 1 variable * **Binary constraints**: Involve two variables * An edge in a constraint graph * Constraint graph - edge between 2 variables in a binary constraint ## Solving CSPs ### Backtracking * An optimization on DFS * 1) Fix an ordering for variables, and select values for variables in this order. * 2) Only select values for variables that don’t conflict with any previously assigned values. * If no such values exist, backtrack and return to the previous variable ### Filtering * **Forward-checking**: If this assignment leads to no possible assignments for future variables, prune the domain here. * **Arc consistency (AC-3, 2-consistent)**: Check all arcs (directed edges) in constraint graph * 1) Store all arcs in a queue $Q$ * 2) For an arc $X_i \rightarrow X_j$, remove domain values from the tail $X_i$ that produce 0 valid values for $X_j$ * 3) If a value is removed from $X_i \rightarrow X_j$, queue $X_k \rightarrow X_i$ (unless already queued) * 4) Repeat until $Q$ is empty or a domain is empty and backtrack * Time: $\mathcal{O}(ed^3)$ * $e$: number of arcs, $d$: size of largest domain * **Strong k-consistency**: Any subset of $k$ nodes is $k, k-1, \ldots 1$ consistent * Ordering * **Minimum Remaining Values (MRV)** - Choose the variable with the fewest valid remaining values to assign next * Most constrained variable is most likely to run out of assignments so need to do it sooner * **Least Constraining Value (LCV)** - After choosing a variable, select the value that prunes the fewest values from the domains of the remaining unassigned values. * Requires requires additional computation like arc consistency ### Structure * Tree-structured CSP: (no loops) runtime is $\mathcal{O}(nd^2)$ instead of $\mathcal{O}(d^n)$ * **Algorithm** * 1) Pick any node to be root * 2) Convert undirected edges to be directed away from root * 3) Linearize (topological sort) the DAG * 4) Do a backwards pass (from $i=n$ to $i=2$) enforcing arc consistency from $\text{Parent}(X_i) \rightarrow X_i$ * 5) Do a forward assignment (from $X_1$ to $X_n$) * **Cutset conditioning**: Find the smallest subset of variables to remove to leave a tree * Time $\mathcal{O}(d^c (n-c)d^2)$, good for small cutset size $c$ ## Local Search * **Min-conflicts heuristic**: select a random conflicted variable and reassign its value to the one that violates the fewest constraints * A form of iterative improvement * Improve a single option until you can’t make it better (no fringe) * Better time and memory, but incomplete and suboptimal # Games * **Games**: Adversarial search problems * Returns a policy which is the best move for this state * **Minimax**: Opponent behaves optimally performing the move worst for us * **Game tree**: Successor states = children * **Utility**: The score an agent receives * Terminal utility is a known value * Minimax does a postorder traversal of game tree * **Expectimax**: chance nodes compute the expected utility/value * **Multi-agent utilities**: Agente maximize their own utility ## Pruning * **Alpha-beta pruning** (minimax) * If the current level is a minimizer which produces a value less than an explored branch, prune the rest at this level * If the current level is a maximizer which produces a value greater than an explored branch, prune the rest at this level * Best case: $\mathcal{O}(b^{m/2})$ * **Evaluation Functions**: take in a state and output an estimate of the true minimax value of that node * Ex. $\text{Eval}(s) = w_1f_1(s) + w_2f_2(s) + \ldots w_nf_n(s)$ * **Depth-limited minimax**: Treat non-terminal nodes at max depth as terminal nodes with utilities from evaluation functions ## Utilities * **Principle of maximum utility** - rational agents must select the action that maximizes their expected utility * Agent prefers A to B: $A \succ B$ * Agent indifferent between A and B: $A \sim B$ * Lottery with probability $p$ of getting $A$, $(1-p)$ of getting $B$: $L = [p, A; (1-p), B]$ * Axioms of rationality * **Orderability**: $(A \succ B) \lor (B \succ A) \lor (A \sim B)$ * Must either A or B or indifferent * **Transitivity**: $(A \succ B)\land (B \succ C)) \implies (A \succ C)$ * If prefers A to B and B to C, then it prefers A to C. * **Continuity**: $(A \succ B \succ C) \implies \exists p [p, A; (1- p); C] \sim B$ * If prefers A to B but B to C, it’s possible to construct a lottery L between A and C such that the agent is indifferent between L and B * **Substitutability**: $(A \sim B) \implies [p, A; (1-p), C] \sim [p, B; (1-p), C]$ * If indifferent between A and B, is also indifferent between any two lotteries with different substitutions * **Monotonicity**: $(A \succ B)\implies (p \geq q \Leftrightarrow [p, A; (1-p), B] \succeq [q, A; (1-q), B]$ * If prefers A over B, then given a choice between lotteries involving only A and B, the agent prefers the lottery assigning the highest probability to A. * **Utility function**: Utility of a lottery = expected value of lottery * Risk: * **Risk-averse**: Avoids risk (concave down function) * **Risk-neutral**: Indifferent (linear, 0 concavity) * **Risk-seeking**: Prefers risk (concave up function) # Non-Deterministic Search: MDPs * Problem formulation: * **States**: A set of states $s \in S$ * **Actions**: A set of actions $a \in A$ * **Transition function**: $T(s,a,s')$ * Probability function that $a$ from $s$ leads to $s'$ * $P(s'|s, a)$ * Aka Model/dynamics * **Reward function**: $R(s, a, s')$ * Typically small living reward and large terminal reward * **Discount factor**: $0 < \gamma \leq 1$ * $\gamma = 0$ is a reflex agent * **Policy**: choice of action for each state * Goal is to find $\pi^*(s \rightarrow a)$, optimal policy * **Utility**: sum of (discounted) rewards * Q-states/action states nodes in "search trees" * Markov property/memoryless property: Future states do not depend on past states ## Bellman Equation * $Q^*(s,a) = \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V^*(s')]$ * Best possible reward starting in state $s$ and taking action $a$ * $V^*(s) = \max\limits_a \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V^*(s')]$ * Recursive definition of value: $V^* = \max\limits_a Q^*(s,a)$ * Best possible reward of all possible actions to take * Bellman equation is a condition for optimality, if it holds true then it implies optimality * Geometric series sum (finite) = $a_1\left( \frac{1-r^n}{1-r}\right)$ * Geometric series sum (infinite) = $\frac{a_1}{1-r}$ * $1 + x + x^2 + x^3 \ldots = \frac{1}{1-x}$ ## Value Iteration * Define $V_k(s)$ to be optimal value of $s$ if game ends in $k$ more time steps * Equivalent to depth-k expectimax * **Algorithm** * $\forall s \in S, V_0(s) = 0$ (b/c no time steps left so no reward) * $\forall s \in S, V_{k+1}(s) = \max\limits_a \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V_k(s')]$ * Repeat until convergence: $V_k(s) = V_{k+1}(s) = V^*(s)$ * Policy extraction: $\pi^* = \underset{a}{\mathrm{argmax}} Q^*(s,a)$ * Problems: * Time per iteration: $O(S^2A)$ * Max at each state rarely changes * Policy often converges before values do ## Policy Iteration * $V^\pi(s) \triangleq \sum\limits_{s'} T(s,\pi(s),s')[R(s,\pi(s),s')+ \gamma V^\pi(s')]$ * **Algorithm** * 1) Define $\pi_0$, the initial policy which can be arbitrary and will converge * 2) $\forall s \in S, V^\pi_{k+1}(s) \leftarrow \sum\limits_{s'} T(s,\pi_i(s),s')[R(s,\pi_i(s),s')+ \gamma V^{\pi_i}_k(s')]$ * Or solve system of equations to update * 3) Policy improvement: $\pi_{i+1}(s) = \underset{a}{\mathrm{argmax}} \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V^{\pi_i}(s')]$ * 4) Repeat until convergence: $\pi_{i+1} = \pi_i = \pi^*$ # Reinforcement Learning * **Sample**: $(s,a, s', r)$ tuple * **Episode**: collection of samples * **Sampling**: Have to try things multiple times b/c chance * Looking for a policy $\pi(s)$ but we don't know $T$ or $R$ ## Model-Based Learning * Generate an approximation of $\hat{T}(s,a,s')$ * Count and normalize from samples what $s'$ it ends up in * By the law of large numbers, $\hat{T}, \hat{R}$ converge ## Model-Free Learning ### Direct evaluation * Average the total discounted reward(play until end) after performing action $a$ in state $s$ * Eventually learns state values but can be slow b/c does not contain info about state transitions ### Temporal Difference Learning * Take an exponential moving average * Converges much faster * **Algorithm** * 1) $\forall s, V^\pi(s) = 0$ * 2) Obtain a $\text{sample} = R(s, \pi(s), s') + \gamma V^\pi (s')$ * 3) Update $V^\pi_k(s) \leftarrow (1- \alpha)V^\pi_{k-1}(s) + (\alpha)\text{sample}_k$ * Learning rate: $0 \leq \alpha \leq 1$ ### Q-Learning * Learn q-values of states directly w/o $T$ or $R$ * Off-policy learning * **Algorithm** * 1) Start with $Q(s,a) = 0$ * 2) $Q_{k+1}(s,a) \leftarrow \sum\limits_{s'}T(s,a,s')[R(s,a,s') + \gamma \max\limits_{a'}Q_k(s',a')]$ * $\text{sample} = Q_{new}(s,a) = R(s, \pi(s), s') + \gamma \max\limits_{a'}Q_k(s',a')$ * $Q(s,a) \leftarrow (1- \alpha)Q(s,a) + \alpha \text{sample}$ * Or $Q(s,a) \leftarrow Q(s,a) + \alpha[\text{difference}]$ ### Approximate Q-Learning * Use feature-based representations to represent states as a feature vector * Advantage: experience is summed up in a few powerful nums * Disadvantage: States can share features but actually be very diff in value * $V(s) = w_1f_1(s) + w_2f_2(s) + \ldots w_nf_n(s)$ * **Algorithm** * 1) $Q(s, a) = w_1f_1(s, a) + w_2f_2(s, a) + \ldots w_nf_n(s, a)$ * 2) $\text{difference} \triangleq [R(s,a,s') + \gamma \max\limits_{a'} Q(s', a')] - Q(s,a)$ * 3) $w_i \leftarrow w_i + \alpha[\text{difference}]f_i(s,a)$ * Adjust weights of active features * If something bad happens, blame the features that were present ## Exploration * $\epsilon$ greedy - Act randomly with a small probability $0 < \epsilon < 1$, exploit with large probability $1-\epsilon$ * Problems: Keep thrashing once learning is done * Solution: Lower $\epsilon$ over time or have exploration functions * Exploration functions * Explore areas that we aren't that certain are bad, but eventually stop * $\text{sample} = R(s, \pi(s), s') + \gamma \max\limits_{a'}f(s',a')$ * Take a value estimate $u$ and visit count $n$ and return an optimistic utility like $f(s,a) = Q(s,a) + \dfrac{k}{N(s,a)}$, $N(s,a)$ is the number of times the q state has been visited * Modified Q updates propagate "bonus" back to states that lead to unknown states ## Regret * Measure of total mistake cost: difference between your rewards (including old suboptimality) and optimal rewards * Minimizing regret requires optimally learning to be optimal * Random exploration tends to have high regret * Can never have 0 regret unless given a known optimal policy * Or could define regret relative to the best of a class of learning algorithms