# CS188 Notes Part 1 Final notes can be found [here](https://hackmd.io/@matthewtang/188-final) # Lecture 1: 8/29/19 ## Designing Rational Agents * Agent - entity that perceives and acts * Rational agent- select actions to max utility * Percepts, environment, action space dictate techniques for rational actions # Lecture 2: 9/3/19 * Reflex agents - choose action based on current precept (maybe memory) * Do not consider future consequences of actions * Consider how the world **is** * Planning agents - make decisions based on hypothesized consequences of actions * Needs model of how world evolves in response to actions * Formulate a goal * Consider how world **would be** * Optimal vs complete planing * Completeness - tell us if no solution exists * Planning vs replanning * Tradeoff of time for more complicated planning agents ## Search problems * State space * Successor function (with actions, costs) * A start state and a goal test * A solution is a sequence of actions (a plan) which transforms the start state to a goal state ### Example Traveling in Romania * State space: cities * Don't want too many states * Successor function: raods - go to adjacent city. cost = distance * Start state: Arad * Goal test: Is state == Bucharest? * Solution: TBD ### Example: State Space Sizes * World state: * Agent positions: 120 * Food count: 30 * Ghost positions: 12 * Agent facing: NSEW * Total world states: $120(2^{30})(12^2)4$ * Total states for pathing: $120$ * Only need to keep track of agent positions * States for eat all dots: $120(2^{30})$ * Depending on the problem, your state space will be a subset of world state ### Example: Safe Passage * Boolean consumption for big and small dots * Remaining time for scared ghosts * Goal test must work on state ## State Space Graphs * Mathematical representation of a search problem * Nodes are world configurations * Arrows are successors (action results) * Goal test is a set of goal nodes * In a search graph, each state only occurs once * Takes too much memory to explicitly construct this ## Search Trees * A "what if" tree and their outcomes * Root node: start state * Children: successors * Nodes corresponds to plans that achieve those states * Can't build the whole tree for problems ## General Tree Search * Fringe, expansion, exploration strategy * Main idea: which fringe nodes to explore? # Lecture 3: 9/5/19 ## Search Algorithm Properties * Complete: Guaranteed to find a solution if one exists? * Return in finite time if not? * Optimal: Guaranteed to find least cost path? ### Depth First Search * Stack (LIFO) * Time complexity * Total number of nodes: $O(b^m)$ nodes where b is branching factor, m is the depth * Space complexity * $O(bm)$ * Complete if no cycles * Not optimal ### Breadth First Search * Fringe/queue (FIFO) * Processes all nodes above shallowest solution * Time: $O(b^s)$ where b is branching factor, s is depth of shallowest solution * Space: $O(b^s)$ * Complete: Yes, guaranteed to return a solution if exists * Optimal: No ### Iterative Deepening * Combine BFS and DFS. * Get DFS space advantage with BFS time * Run DFS with depth limit 1. If no solution, increase limit by 1. * Early depths have redundant searches but it's not too bad since the last levels take the most time (exponential) ### Uniform Cost Search * Fringe is a priority queue (priority is cumulative cost) * If solution costs $C^*$ and arcs cost at least $\epsilon$, effective depth is $\frac{C^*}{\epsilon}$ * Time: $O(B^{C^* / \epsilon})$ * Space: $O(B^{C^* / \epsilon})$ * Complete: If solution has a finite cost and minimum cost arc is positive, yes * Optimal: yes * Downsides: * Explores options in every "direction" * No information about end goal ## Informed Search * Heuristics * Estimate how close a state is to a goal * Ex. Manhattan distance (ignore walls), euclidean distance * Greedy Search * Expand the node that seems closest * Commonly takes you in a suboptimal route * $A^*$ * Combining UCS and Greedy * UCS orders by path cost, backwards cost - $g(n)$ * Greedy orders by goal proximity, forwards cost - $h(n)$ * Add $g(n) + h(n)$ * Stop when goal is dequeued * Heuristic estimates need to be less than the actual costs * Inadmissible(pessimistic) heuristics overestimate * Admissible(optimistic) if $o \leq h(n) \leq h^*(n)$ where $h^*(n)$ is the true cost to ta nearest goal * Often admissible heuristics are solutions to relaxed problems where new actions are available * Inadmissible heuristics are useful too to find a solution faster * Ex. For 8 puzzle * Heuristic: number of misplaced tiles (terrible) * Better heuristic: Manhattan distance of each tile * Graph Search ### Graph Search * Don't expand a state twice * Tree search + a set of expanded states * Consistency of heuristic - heuristic "arc" cost $\leq$ actual cost for each arc * With admissible heuristic, Tree A* is optimal * With consistent heuristic, Graph A* is optimal # Lecture 4: 9/10/19 ## Constraint satisfaction problems (CSP) * $N$ variables, domain $D$ of all colors * Constraints - restrictions on values of variables in regard to other variables * Planning: for planning a sequence of actions (path) to a goal * Paths have various costs, depths * Difficult to plan low cost path, goal is easy to determine * Identification: Assignments to variables * Goal itself is important, not the path * All paths at the same depth (for some formulations) * CSPs are specialized for identification problems * State is a "black box"/ arbitrary data structure * Goal test can be a function over states * Successor can be anything * CSP * A special subset of search problems * State is defined by variables $X_i$ with values from a domain D (sometimes D depends on i) * Goal test is a set of constraints ### Example: Map Coloring * Variables: WA, NT, Q, NSW, V, SA, T * Domains: $D = \{red, green, blue\}$ * Constraints: Adjacent regions must have diff colors * Implicit: $WA \neq NT$ * Explicit: $(WA, NT) \in \{ (red, green), (red, blue), \ldots \}$ ## Backtracking Search * BFS is no good, DFS is better but still not good * Idea: Incremental goal tests - check constraints along the way * Idea: Fix one variable and then do dfs * Backtracking = DFS + variable-ordering + fail-on-violation * Filtering - Keep track of domains for unassigned variables, cross off bad options * Forward checking: Cross off values that violate a constraing when adding new assignment * Constraint Propagation: Propagates info from assigned to unassigned variables. * Thinks about effects at deeper levels * Arc consistency: Property such that every x in the tail has a y that could be assigned without violating a constraint * Delete from the tail to make consistent # Lecture 5: 9/12/19 ## Filtering * Forward checking * Cross off values that violate a constraint when added to the existing assignment * K consistency - for every k nodes, there exists a consistent assignment * 1-Consistency: Check each single node * 2-Consistency: Arc consistency * Strong K-consistency: Also means $k-1, \ldots, 2, 1$ consistent * Strong N-consistent means no need for backtracking ## Ordering * Variable Ordering: * Minimum Remaining Value (MRV) - Choose the variable with the fewest legal left values in its domain * Aka "most constrained" * Value Ordering: * Least constrained value - Given a chosen variable, choose an assignment that is least constraining ## Iterative Improvement * While not solved: * Variable selection: randomly select any conflicted variable * Value selection: min-conflicts heuristic * Choose a value that violates the fewest constraints * Very good performance except for a golden ratio R ## Structure * Independent subproblems are identifiable as connected components of constraint graph * If the constraint graph is a tree (no loops), can be solved in $O(nd^2)$ as opposed to $O(d^n)$ * $d^2$ comes from checking everything at head and tail ### Tree CSPs * Choose a root variable, order variables so that parents precede children * 1. Do a single pass backwards from n to the 2nd node to ensure the arcs pointing to those nodes are consistent * All root to leaf arcs are consistent * Each X-> Y was consistent at one point and Y's children were always processed first so Y's domain cannot be reduced later * 2. Assign forward: for 1 to n, assign consistently * No need to backtrack ### Nearly Trees * Conditioning - Instantiate a variable, prune neighbors domains * Cutset conditioning - Pick a node, assign it all possible ways, solve the remaining tree * Cutset size c, total time $O(d^c (n-c)d^2)$ * Very fast for small c * Tree Decomposition * Idea: create a tree-structured graph of mega-variables * Each mega-variable encodes part of the original CSP * Subproblems overlap to ensure consistent solutions ## CSP Summary * Basic solution: Backtracking search * Improvements: * Ordering * Filtering * Structure * Iterative min-conflicts is often most practical ## Local Search * Tree search keeps unexplored alternatives on the fringe (ensures completeness) * Local search: improve a single option until you canâ€™t make it better (no fringe!) * Successor function is local changes now * Better time and memory, but incomplete and suboptimal * Idea: Start wherever, move to best neighboring state * If no better neighbors, quit ### Simulated Annealing * Escape local maxima by allowing downhill movement * But more rare as time goes on * $T$, a "temperature" controls the probability of downward steps * If $T$ is decreased slow enough, will converge on optimal solution ### Genetic Algorithms * Natural selection strategy * Keep best N hypotheses at each step (selection) based on a fitness function * Pairwise crossover operators, with optional mutation to give variety # Lecture 6: 9/17/19 ## Search with other agents * Types of games: * Deterministic or stochastic? * 1, 2, or more players? * 0 sum? * Perfect information (can you see the state?) * Sometimes have to estimate the state from sensor input * Do we know the objectives/utilities of the other agents? * For now, assume that we know their objectives exactly. ## Types of games * General games * Agents have independent utilities * Cooperation, indifference, competition are possible * Zero-sum games * Agents have opposite utilities * One maximizes, other minimizes * Adversarial, pure competition ## Deterministic Games with Terminal Utilities * States: $S$ * Players: $P = \{1 \ldots N\}$ (usually take turns) * Actions: $A$ (can depend on player/state) * Transition function: $SxA \implies S$ * Terminal Test: $S \implies \{t,f\}$ * Terminal Utilities: $SxP \implies R$ * Solution for a player is a policy $S \implies A$ * No longer minimizing cost, agent tries to maximize score/utility * Value of a state: best achievable outcome/utility from that state * Non-terminal states: $V(x) = \max(V(x'))$ ## Minimax Adversarial Search * Deterministic, zero-sum games * One player maximizes result, other minimizes result * Minimax search * State-space search tree * Players alternate turns * Compute each node's minimax value- best achievable utility against a rational, optimal adversary python= def value(state): if state is terminal: return state utility if next agent is MAX: return max-value(state) if next agent is MIN: return min-value(state)  * Time: $O(b^m)$, Space: $O(bm)$ * But we don't need to explore the whole tree ## Alpha-Beta Pruning * Node $n$ has children * Stop considering options better/worse than one already found (depending on who is pruning) * Good child ordering improves pruning effectiveness * Perfect ordering - Time: $O(b^{m/2})$ * Doubles solvable depth * Metareasoning - computing what to compute # Lecture 7: 9/19/19 ## Resource Limits * Cannot realistically search to leaves * Solution: Depth limited search * Evaluation function - estimate/score non-terminals in depth limited search * Ex. Weighted linear sum of features * Depth matters: * The deeper in the tree the eval function is, quality matters less * Tradeoff between complexity of features and computation ## General Sum Games * "Maximax" games where terminals have utility tuples * Each player maximizes its own component * Sometimes you get competition, sometimes collaboration ## Expectimax Search - Uncertain Outcomes * Explicit randomness, unpredictable opponents/humans * Reflect average-case instead of worst case * Compute average score under optimal play * Calculate expected utilities (weighted average of children) * Aka Markov Decision Process (MDP) * Cannot prune * Probabilities can be uniform distribution or sophisticated # Lecture 8: 9/24/19 ## Markov Decision Processes * Use when world is stochastic or you don't know the whole world in your model * Ex. Noisy movement in a grid world * Agent receives rewards at each time step * Small living reward at each step (can be negative) * Big rewards at the end (good or bad) * Goal: maximize sum of rewards * Expectimax is a naive tool to solve * Given the present state, future and past are independent * Action outcomes depend on only current state ## MDP Formulation * A set of states $s \in S$ * A set of actions $a \in A$ * A transition function $T(s,a,s')$ * Probability that $a$ from $s$ leads to $s'$ * $P(s'|s, a)$ * Aka Model/dynamics * Reward function $R(s, a, s')$ * Sometimes just $R(s)$ or $R(s')$ * Start state * Terminal state * Policy = choice of action for each state * Utility = sum of (discounted) rewards ## Policies * In deterministic single agent search, we have an optimal plan * For MDPs we want an optimal policy $\pi^*: S \rightarrow A$ * Policy $\pi$ gives action for each state * Optimal policy maximizes expected utility * Explicit policy = reflex agent * Too small of a living cost in reward function can cause agent to get stuck in local minima * Too large can cause agent to "end suffering" ## Discounting * Maximize sum of rewards but prefer rewards earlier * Solution: Exponential decay of value later on * Discount factor $\gamma$ * Think of it as a gamma chance of ending the process at each step * Helps algorithms converge ## Infinite Utilities * Finite horizon (similar to depth-limited search) * Terminates after fixed T steps (life) * Give nonstationary policies ($\pi$ depends on time left) * Discounting factor $0 < \gamma < 1$ * Converging geometric series * $\gamma = 0$ only cares about current state reward (not practical) * $\gamma > 1$ makes things in the future worth more * Smaller $\gamma$ means smaller "horizon"/focus # Lecture 9: 9/26/19 ## Bellman Equation * Q values have a value for every possible action * Recursive definition of value: $V^* = \max\limits_a Q^*(s,a)$ * Best possible reward of all possible actions to take * $Q^*(s,a) = \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V^*(s')]$ * Best possible reward given an action at current state * $V^*(s) = \max\limits_a \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V^*(s')]$ ## Time-Limited Values * Define $V_k(s)$ to be optimal value of s if game ends in k more time steps * Equivalent to depth-k expectimax ## Value Iteration * $V_0(s) = 0$ b/c no time steps left so no reward * $V_{k+1}(s) = \max\limits_a \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V_k(s')]$ * Repeat until convergence * Problems: * Complexity of iterations: $O(S^2A)$ * Max at each state rarely changes * Policy often converges before values do ### Convergence * If tree has maximum depth M, $V_M$ holds actual untruncated values * If $\gamma < 1$, the values converge as k increases ### Computing Actions from Values * Do a mini-expectimax of 1 step * $\pi^*(s) = \text{argmax}_a \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V_k(s')]$ ## Policy Iteration * 1) Policy evaluation - calculate utilities for a fixed policy until convergence * 2) Policy improvement - update policy using 1 step look ahead with resulting converged (but not optimal) utilitieis as future values * Repeat until policy converges * If we fixed a policy $\pi(s)$, only 1 action per state * Compute utility of state s under a fixed (usually not optimal) policy * Define utility of state s under fixed policy $\pi$ as: * $V^\pi(s) =$ expected total discounted rewards starting in s, following $\pi$ * Recursive relation doesn't need to use max since we use $\pi(s)$ instead of optimal action * Evaluation: Find values with policy evaluation and iterate until values converge * Improvement: Get a better policy using policy extraction ## Comparison * Both value and policy iteration compute same thing * Value iteration: * Every iteration updates both values and implicity policy * Don't track plicy but taking max over actions implicitly recomputes it * Policy Iteration * Several passes that update utilities with fixed policy * After policy is evaluated, a new policy is chosen * New policy improves # Lecture 10: 10/1/19 ## Reinforcement Learning * Missing information about MDP so you can't make an optimal policy * Exploitation: Use what you know eventually * Regret: Even if you learn intelligently, you make mistakes * Sampling: Have to try things repeatedly b/c of chance * Difficulty: Learning is much harder than solving a known MDP * Exploration: you have to try unknown actions to get information * Looking for a policy $\pi(s)$ * We don't know T or R * Don't know which states are good or what actions do * Can no longer take actions offline since we don't know T or R * Not a POMDP since you know the output (is observable) ## Passive Reinforcement Learning * Model-Based Idea * Idea: We eventually learn the right model * Learn an approximate model based on experiences * Solve for values as if learned model was correct * 1) Learn empirical MDP model * Count outcomes $s'$ for each action, normalize * Discover rewards from actions * 2) Solve the learned MDP (Ex. via value iteration) ## Model-Free * Idea: Samples appear with the right frequencies * Simplified task: policy evaluation * Input a fixed policy $\pi(s)$, don't know T or R * Goal: Learn state values * Learner is "along for the ride" * No choice about what actions to take * Execute policy and learn from the experience ### Direct Evaluation * Compute values for each state under $\pi$ * Average together observed sample values * Instead of trying to learn a transition function, just find values for states * Pros: Don't need to understand T or R, eventually computes correct average values * Cons: Wastes info about state connectivity, takes a while to learn each state ## Temporal Difference Learning * Update $V(s)$ each time we experience a transition * Likely outcomes $s'$ contribute updates more often * Sample: $R(s, \pi(s), s') + \gamma V^\pi (s')$ * Update: $V^\pi(s) \leftarrow (1- \alpha)V^\pi(s) + (\alpha)sample$ * Aka $V^\pi(s) \leftarrow V^\pi(s) + \alpha(sample - V^\pi(s))$ * Exponential moving average * We forget about the past since more recent is more representative * Problem: We can't turn values into a new policy * Solution: Learn Q-values, not values ## Q-Value Iteration * Value Iteration (from before): Find successive (depth limited) values * Start with $V_0(s) = 0$ * Given $V_k$, calculate depth $k+1$ values for all states * Q-Value Iteration: * Start with $Q(s,a) = 0$ * $Q_{k+1}(s,a) = \sum\limits_{s'}T(s,a,s')[R(s,a,s') + \gamma \max\limits_{a'}Q_k(s',a')]$ ## Q-Learning * Sample based Q-value iteration * Learn $Q(s,a)$ values as you go * Receive a sample $(s,a,s',r)$ * Incorporate new estimate into a running average * Converges to optimal policy even if you act suboptimally * Have to explore enough * Make learning rate small but not decrease too quickly # Lecture 11: 10/3/19 ## Summary * Known MDP - Offline solution * Compute $V^*, Q^*, \pi^*$ - Value/policy iteration * Evaluate fixed policy $\pi$ - Policy iteration * Unknown MDP - Model-based * Compute $V^*, Q^*, \pi^*$ - Value/policy iteration on approx MDP * Evaluate fixed policy $\pi$ - Policy iteration on approx MDP * Unknown MDP - Model-free * Compute $V^*, Q^*, \pi^*$ - Q-Learning * Evaluate fixed policy $\pi$ - Value Learning * Active Reinforcement Learning * Act according to current optimal (based on Q-Values) but also explore * Use the feedback it receives to iteratively update its policy while learning until eventually determining the optimal policy after sufficient exploration ## Exploration vs Exploitation * $\epsilon$ greedy - With a small probability $\epsilon$ act randomly, with large probability $1-\epsilon$ * Problems: Keep thrashing once learning is done * Solution: Lower $\epsilon$ over time or have exploration functions * Exploration functions * Explore areas that we aren't that certain are bad, but eventually stop * Take a value estimate $u$ and visit count $n$ and return an optimistic utility like $f(u,n) = u + k/n$ * Modified Q updates propagate "bonus" back to states that lead to unknown states ## Regret * Measure of total mistake cost: difference between your rewards (including old suboptimality) and optimal rewards * Minimizing regret requires optimally learning to be optimal * Random exploration tends to have high regret * Can never have 0 regret unless given a known optimal policy * Or could define regret relative to the best of a class of learning algorithms # Lecture 12: 10/8/19 ## Approximate Q-Learning * Basic Q-Learning keeps a table of q-values * Too many states to visit them all in training and hold in memory * Want to generalize learning from small number of training states to new similar states * Fundamental idea in ML * Solution: Describe a state using a vector of properties * Features: functions from states to real numbers (often 1 or 0) to capture important properties of states * Ex. Dist to closest ghost, food, num ghosts, 1/dist^2 * Linear Value functions * $V(s) = w_1f_1(s) + w_2f_2(s) + \ldots w_nf_n(s)$ * Advantage: experience is summed up in a few powerful nums * Disadvantage: States can share features but actually be very diff in value * Approximate Q-Learning * $Q(s, a) = w_1f_1(s, a) + w_2f_2(s, a) + \ldots w_nf_n(s, a)$ * difference = $[r + \gamma \max\limits_{a'} Q(s', a')] - Q(s,a)$ * $Q(s,a) \leftarrow Q(s,a) + \alpha[\text{difference}]$ * $w_i \leftarrow w_i + \alpha[\text{difference}]f_i(s,a)$ * Adjust weights of active features. * If something bad happens, blame the features that were present ## Least Squares * Total error $= \sum\limits_i (y_i - \hat{y_i})^2 = \sum\limits_i \left(y_i - \sum\limits_k w_kf_k(x_i)\right)^2$ * $error(w) = \frac{1}{2}\left(y - \sum\limits_k w_kf_k(x_i)\right)^2$ * $\frac{\partial{error(w)}}{\partial{w_m}} = -\left(y - \sum\limits_k w_kf_k(x_i)\right)f_m(x)$ * By chain rule * $w_m \leftarrow w_m + \alpha \left(y - \sum\limits_k w_kf_k(x_i)\right)f_m(x)$ * $w_m \leftarrow w_m + \alpha[r + \gamma \max\limits_{a'} Q(s', a') - Q(s,a)]f_m(s,a)$ * target - prediction ## Policy Search * Problem: Feature-based policies that work well don't approximate V/Q best * Q learning priority: get Q-values close (modeling) * Action selection priority: Get ordering of Q-values right (prediction) * Solution: learn policies that maximize rewards, not the values that predict them * Policy search: Start with an ok solution (e.g. Q-learning) then fine-tune by hill climbing on feature weights ## Model-based vs Model-free * Model-based * Model is nice to learn thru reward functions * Learning values is more complicated than learning dynamics * Model-free * Good if dynamics are complicated