# CS188 Notes Part 1
Final notes can be found [here](https://hackmd.io/@matthewtang/188-final)
# Lecture 1: 8/29/19
## Designing Rational Agents
* Agent - entity that perceives and acts
* Rational agent- select actions to max utility
* Percepts, environment, action space dictate techniques for rational actions
# Lecture 2: 9/3/19
* Reflex agents - choose action based on current precept (maybe memory)
* Do not consider future consequences of actions
* Consider how the world **is**
* Planning agents - make decisions based on hypothesized consequences of actions
* Needs model of how world evolves in response to actions
* Formulate a goal
* Consider how world **would be**
* Optimal vs complete planing
* Completeness - tell us if no solution exists
* Planning vs replanning
* Tradeoff of time for more complicated planning agents
## Search problems
* State space
* Successor function (with actions, costs)
* A start state and a goal test
* A solution is a sequence of actions (a plan) which transforms the start state to a goal state
### Example Traveling in Romania
* State space: cities
* Don't want too many states
* Successor function: raods - go to adjacent city. cost = distance
* Start state: Arad
* Goal test: Is state == Bucharest?
* Solution: TBD
### Example: State Space Sizes
* World state:
* Agent positions: 120
* Food count: 30
* Ghost positions: 12
* Agent facing: NSEW
* Total world states: $120(2^{30})(12^2)4$
* Total states for pathing: $120$
* Only need to keep track of agent positions
* States for eat all dots: $120(2^{30})$
* Depending on the problem, your state space will be a subset of world state
### Example: Safe Passage
* Boolean consumption for big and small dots
* Remaining time for scared ghosts
* Goal test must work on state
## State Space Graphs
* Mathematical representation of a search problem
* Nodes are world configurations
* Arrows are successors (action results)
* Goal test is a set of goal nodes
* In a search graph, each state only occurs once
* Takes too much memory to explicitly construct this
## Search Trees
* A "what if" tree and their outcomes
* Root node: start state
* Children: successors
* Nodes corresponds to plans that achieve those states
* Can't build the whole tree for problems
## General Tree Search
* Fringe, expansion, exploration strategy
* Main idea: which fringe nodes to explore?
# Lecture 3: 9/5/19
## Search Algorithm Properties
* Complete: Guaranteed to find a solution if one exists?
* Return in finite time if not?
* Optimal: Guaranteed to find least cost path?
### Depth First Search
* Stack (LIFO)
* Time complexity
* Total number of nodes: $O(b^m)$ nodes where b is branching factor, m is the depth
* Space complexity
* $O(bm)$
* Complete if no cycles
* Not optimal
### Breadth First Search
* Fringe/queue (FIFO)
* Processes all nodes above shallowest solution
* Time: $O(b^s)$ where b is branching factor, s is depth of shallowest solution
* Space: $O(b^s)$
* Complete: Yes, guaranteed to return a solution if exists
* Optimal: No
### Iterative Deepening
* Combine BFS and DFS.
* Get DFS space advantage with BFS time
* Run DFS with depth limit 1. If no solution, increase limit by 1.
* Early depths have redundant searches but it's not too bad since the last levels take the most time (exponential)
### Uniform Cost Search
* Fringe is a priority queue (priority is cumulative cost)
* If solution costs $C^*$ and arcs cost at least $\epsilon$, effective depth is $\frac{C^*}{\epsilon}$
* Time: $O(B^{C^* / \epsilon})$
* Space: $O(B^{C^* / \epsilon})$
* Complete: If solution has a finite cost and minimum cost arc is positive, yes
* Optimal: yes
* Downsides:
* Explores options in every "direction"
* No information about end goal
## Informed Search
* Heuristics
* Estimate how close a state is to a goal
* Ex. Manhattan distance (ignore walls), euclidean distance
* Greedy Search
* Expand the node that seems closest
* Commonly takes you in a suboptimal route
* $A^*$
* Combining UCS and Greedy
* UCS orders by path cost, backwards cost - $g(n)$
* Greedy orders by goal proximity, forwards cost - $h(n)$
* Add $g(n) + h(n)$
* Stop when goal is dequeued
* Heuristic estimates need to be less than the actual costs
* Inadmissible(pessimistic) heuristics overestimate
* Admissible(optimistic) if $o \leq h(n) \leq h^*(n)$ where $h^*(n)$ is the true cost to ta nearest goal
* Often admissible heuristics are solutions to relaxed problems where new actions are available
* Inadmissible heuristics are useful too to find a solution faster
* Ex. For 8 puzzle
* Heuristic: number of misplaced tiles (terrible)
* Better heuristic: Manhattan distance of each tile
* Graph Search
### Graph Search
* Don't expand a state twice
* Tree search + a set of expanded states
* Consistency of heuristic - heuristic "arc" cost $\leq$ actual cost for each arc
* With admissible heuristic, Tree A* is optimal
* With consistent heuristic, Graph A* is optimal
# Lecture 4: 9/10/19
## Constraint satisfaction problems (CSP)
* $N$ variables, domain $D$ of all colors
* Constraints - restrictions on values of variables in regard to other variables
* Planning: for planning a sequence of actions (path) to a goal
* Paths have various costs, depths
* Difficult to plan low cost path, goal is easy to determine
* Identification: Assignments to variables
* Goal itself is important, not the path
* All paths at the same depth (for some formulations)
* CSPs are specialized for identification problems
* State is a "black box"/ arbitrary data structure
* Goal test can be a function over states
* Successor can be anything
* CSP
* A special subset of search problems
* State is defined by variables $X_i$ with values from a domain D (sometimes D depends on i)
* Goal test is a set of constraints
### Example: Map Coloring
* Variables: WA, NT, Q, NSW, V, SA, T
* Domains: $D = \{red, green, blue\}$
* Constraints: Adjacent regions must have diff colors
* Implicit: $WA \neq NT$
* Explicit: $(WA, NT) \in \{ (red, green), (red, blue), \ldots \}$
## Backtracking Search
* BFS is no good, DFS is better but still not good
* Idea: Incremental goal tests - check constraints along the way
* Idea: Fix one variable and then do dfs
* Backtracking = DFS + variable-ordering + fail-on-violation
* Filtering - Keep track of domains for unassigned variables, cross off bad options
* Forward checking: Cross off values that violate a constraing when adding new assignment
* Constraint Propagation: Propagates info from assigned to unassigned variables.
* Thinks about effects at deeper levels
* Arc consistency: Property such that every x in the tail has a y that could be assigned without violating a constraint
* Delete from the tail to make consistent
# Lecture 5: 9/12/19
## Filtering
* Forward checking
* Cross off values that violate a constraint when added to the existing assignment
* K consistency - for every k nodes, there exists a consistent assignment
* 1-Consistency: Check each single node
* 2-Consistency: Arc consistency
* Strong K-consistency: Also means $k-1, \ldots, 2, 1$ consistent
* Strong N-consistent means no need for backtracking
## Ordering
* Variable Ordering:
* Minimum Remaining Value (MRV) - Choose the variable with the fewest legal left values in its domain
* Aka "most constrained"
* Value Ordering:
* Least constrained value - Given a chosen variable, choose an assignment that is least constraining
## Iterative Improvement
* While not solved:
* Variable selection: randomly select any conflicted variable
* Value selection: min-conflicts heuristic
* Choose a value that violates the fewest constraints
* Very good performance except for a golden ratio R
## Structure
* Independent subproblems are identifiable as connected components of constraint graph
* If the constraint graph is a tree (no loops), can be solved in $O(nd^2)$ as opposed to $O(d^n)$
* $d^2$ comes from checking everything at head and tail
### Tree CSPs
* Choose a root variable, order variables so that parents precede children
* 1. Do a single pass backwards from n to the 2nd node to ensure the arcs pointing to those nodes are consistent
* All root to leaf arcs are consistent
* Each X-> Y was consistent at one point and Y's children were always processed first so Y's domain cannot be reduced later
* 2. Assign forward: for 1 to n, assign consistently
* No need to backtrack
### Nearly Trees
* Conditioning - Instantiate a variable, prune neighbors domains
* Cutset conditioning - Pick a node, assign it all possible ways, solve the remaining tree
* Cutset size c, total time $O(d^c (n-c)d^2)$
* Very fast for small c
* Tree Decomposition
* Idea: create a tree-structured graph of mega-variables
* Each mega-variable encodes part of the original CSP
* Subproblems overlap to ensure consistent solutions
## CSP Summary
* Basic solution: Backtracking search
* Improvements:
* Ordering
* Filtering
* Structure
* Iterative min-conflicts is often most practical
## Local Search
* Tree search keeps unexplored alternatives on the fringe (ensures completeness)
* Local search: improve a single option until you can’t make it better (no fringe!)
* Successor function is local changes now
* Better time and memory, but incomplete and suboptimal
* Idea: Start wherever, move to best neighboring state
* If no better neighbors, quit
### Simulated Annealing
* Escape local maxima by allowing downhill movement
* But more rare as time goes on
* $T$, a "temperature" controls the probability of downward steps
* If $T$ is decreased slow enough, will converge on optimal solution
### Genetic Algorithms
* Natural selection strategy
* Keep best N hypotheses at each step (selection) based on a fitness function
* Pairwise crossover operators, with optional mutation to give variety
# Lecture 6: 9/17/19
## Search with other agents
* Types of games:
* Deterministic or stochastic?
* 1, 2, or more players?
* 0 sum?
* Perfect information (can you see the state?)
* Sometimes have to estimate the state from sensor input
* Do we know the objectives/utilities of the other agents?
* For now, assume that we know their objectives exactly.
## Types of games
* General games
* Agents have independent utilities
* Cooperation, indifference, competition are possible
* Zero-sum games
* Agents have opposite utilities
* One maximizes, other minimizes
* Adversarial, pure competition
## Deterministic Games with Terminal Utilities
* States: $S$
* Players: $P = \{1 \ldots N\}$ (usually take turns)
* Actions: $A$ (can depend on player/state)
* Transition function: $SxA \implies S$
* Terminal Test: $S \implies \{t,f\}$
* Terminal Utilities: $SxP \implies R$
* Solution for a player is a policy $S \implies A$
* No longer minimizing cost, agent tries to maximize score/utility
* Value of a state: best achievable outcome/utility from that state
* Non-terminal states: $V(x) = \max(V(x'))$
## Minimax Adversarial Search
* Deterministic, zero-sum games
* One player maximizes result, other minimizes result
* Minimax search
* State-space search tree
* Players alternate turns
* Compute each node's minimax value- best achievable utility against a rational, optimal adversary
```python=
def value(state):
if state is terminal: return state utility
if next agent is MAX: return max-value(state)
if next agent is MIN: return min-value(state)
```
* Time: $O(b^m)$, Space: $O(bm)$
* But we don't need to explore the whole tree
## Alpha-Beta Pruning
* Node $n$ has children
* Stop considering options better/worse than one already found (depending on who is pruning)
* Good child ordering improves pruning effectiveness
* Perfect ordering - Time: $O(b^{m/2})$
* Doubles solvable depth
* Metareasoning - computing what to compute
# Lecture 7: 9/19/19
## Resource Limits
* Cannot realistically search to leaves
* Solution: Depth limited search
* Evaluation function - estimate/score non-terminals in depth limited search
* Ex. Weighted linear sum of features
* Depth matters:
* The deeper in the tree the eval function is, quality matters less
* Tradeoff between complexity of features and computation
## General Sum Games
* "Maximax" games where terminals have utility tuples
* Each player maximizes its own component
* Sometimes you get competition, sometimes collaboration
## Expectimax Search - Uncertain Outcomes
* Explicit randomness, unpredictable opponents/humans
* Reflect average-case instead of worst case
* Compute average score under optimal play
* Calculate expected utilities (weighted average of children)
* Aka Markov Decision Process (MDP)
* Cannot prune
* Probabilities can be uniform distribution or sophisticated
# Lecture 8: 9/24/19
## Markov Decision Processes
* Use when world is stochastic or you don't know the whole world in your model
* Ex. Noisy movement in a grid world
* Agent receives rewards at each time step
* Small living reward at each step (can be negative)
* Big rewards at the end (good or bad)
* Goal: maximize sum of rewards
* Expectimax is a naive tool to solve
* Given the present state, future and past are independent
* Action outcomes depend on only current state
## MDP Formulation
* A set of states $s \in S$
* A set of actions $a \in A$
* A transition function $T(s,a,s')$
* Probability that $a$ from $s$ leads to $s'$
* $P(s'|s, a)$
* Aka Model/dynamics
* Reward function $R(s, a, s')$
* Sometimes just $R(s)$ or $R(s')$
* Start state
* Terminal state
* Policy = choice of action for each state
* Utility = sum of (discounted) rewards
## Policies
* In deterministic single agent search, we have an optimal plan
* For MDPs we want an optimal policy $\pi^*: S \rightarrow A$
* Policy $\pi$ gives action for each state
* Optimal policy maximizes expected utility
* Explicit policy = reflex agent
* Too small of a living cost in reward function can cause agent to get stuck in local minima
* Too large can cause agent to "end suffering"
## Discounting
* Maximize sum of rewards but prefer rewards earlier
* Solution: Exponential decay of value later on
* Discount factor $\gamma$
* Think of it as a gamma chance of ending the process at each step
* Helps algorithms converge
## Infinite Utilities
* Finite horizon (similar to depth-limited search)
* Terminates after fixed T steps (life)
* Give nonstationary policies ($\pi$ depends on time left)
* Discounting factor $0 < \gamma < 1$
* Converging geometric series
* $\gamma = 0$ only cares about current state reward (not practical)
* $\gamma > 1$ makes things in the future worth more
* Smaller $\gamma$ means smaller "horizon"/focus
# Lecture 9: 9/26/19
## Bellman Equation
* Q values have a value for every possible action
* Recursive definition of value: $V^* = \max\limits_a Q^*(s,a)$
* Best possible reward of all possible actions to take
* $Q^*(s,a) = \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V^*(s')]$
* Best possible reward given an action at current state
* $V^*(s) = \max\limits_a \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V^*(s')]$
## Time-Limited Values
* Define $V_k(s)$ to be optimal value of s if game ends in k more time steps
* Equivalent to depth-k expectimax
## Value Iteration
* $V_0(s) = 0$ b/c no time steps left so no reward
* $V_{k+1}(s) = \max\limits_a \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V_k(s')]$
* Repeat until convergence
* Problems:
* Complexity of iterations: $O(S^2A)$
* Max at each state rarely changes
* Policy often converges before values do
### Convergence
* If tree has maximum depth M, $V_M$ holds actual untruncated values
* If $\gamma < 1$, the values converge as k increases
### Computing Actions from Values
* Do a mini-expectimax of 1 step
* $\pi^*(s) = \text{argmax}_a \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V_k(s')]$
## Policy Iteration
* 1) Policy evaluation - calculate utilities for a fixed policy until convergence
* 2) Policy improvement - update policy using 1 step look ahead with resulting converged (but not optimal) utilitieis as future values
* Repeat until policy converges
* If we fixed a policy $\pi(s)$, only 1 action per state
* Compute utility of state s under a fixed (usually not optimal) policy
* Define utility of state s under fixed policy $\pi$ as:
* $V^\pi(s) =$ expected total discounted rewards starting in s, following $\pi$
* Recursive relation doesn't need to use max since we use $\pi(s)$ instead of optimal action
* Evaluation: Find values with policy evaluation and iterate until values converge
* Improvement: Get a better policy using policy extraction
## Comparison
* Both value and policy iteration compute same thing
* Value iteration:
* Every iteration updates both values and implicity policy
* Don't track plicy but taking max over actions implicitly recomputes it
* Policy Iteration
* Several passes that update utilities with fixed policy
* After policy is evaluated, a new policy is chosen
* New policy improves
# Lecture 10: 10/1/19
## Reinforcement Learning
* Missing information about MDP so you can't make an optimal policy
* Exploitation: Use what you know eventually
* Regret: Even if you learn intelligently, you make mistakes
* Sampling: Have to try things repeatedly b/c of chance
* Difficulty: Learning is much harder than solving a known MDP
* Exploration: you have to try unknown actions to get information
* Looking for a policy $\pi(s)$
* We don't know T or R
* Don't know which states are good or what actions do
* Can no longer take actions offline since we don't know T or R
* Not a POMDP since you know the output (is observable)
## Passive Reinforcement Learning
* Model-Based Idea
* Idea: We eventually learn the right model
* Learn an approximate model based on experiences
* Solve for values as if learned model was correct
* 1) Learn empirical MDP model
* Count outcomes $s'$ for each action, normalize
* Discover rewards from actions
* 2) Solve the learned MDP (Ex. via value iteration)
## Model-Free
* Idea: Samples appear with the right frequencies
* Simplified task: policy evaluation
* Input a fixed policy $\pi(s)$, don't know T or R
* Goal: Learn state values
* Learner is "along for the ride"
* No choice about what actions to take
* Execute policy and learn from the experience
### Direct Evaluation
* Compute values for each state under $\pi$
* Average together observed sample values
* Instead of trying to learn a transition function, just find values for states
* Pros: Don't need to understand T or R, eventually computes correct average values
* Cons: Wastes info about state connectivity, takes a while to learn each state
## Temporal Difference Learning
* Update $V(s)$ each time we experience a transition
* Likely outcomes $s'$ contribute updates more often
* Sample: $R(s, \pi(s), s') + \gamma V^\pi (s')$
* Update: $V^\pi(s) \leftarrow (1- \alpha)V^\pi(s) + (\alpha)sample$
* Aka $V^\pi(s) \leftarrow V^\pi(s) + \alpha(sample - V^\pi(s))$
* Exponential moving average
* We forget about the past since more recent is more representative
* Problem: We can't turn values into a new policy
* Solution: Learn Q-values, not values
## Q-Value Iteration
* Value Iteration (from before): Find successive (depth limited) values
* Start with $V_0(s) = 0$
* Given $V_k$, calculate depth $k+1$ values for all states
* Q-Value Iteration:
* Start with $Q(s,a) = 0$
* $Q_{k+1}(s,a) = \sum\limits_{s'}T(s,a,s')[R(s,a,s') + \gamma \max\limits_{a'}Q_k(s',a')]$
## Q-Learning
* Sample based Q-value iteration
* Learn $Q(s,a)$ values as you go
* Receive a sample $(s,a,s',r)$
* Incorporate new estimate into a running average
* Converges to optimal policy even if you act suboptimally
* Have to explore enough
* Make learning rate small but not decrease too quickly
# Lecture 11: 10/3/19
## Summary
* Known MDP - Offline solution
* Compute $V^*, Q^*, \pi^*$ - Value/policy iteration
* Evaluate fixed policy $\pi$ - Policy iteration
* Unknown MDP - Model-based
* Compute $V^*, Q^*, \pi^*$ - Value/policy iteration on approx MDP
* Evaluate fixed policy $\pi$ - Policy iteration on approx MDP
* Unknown MDP - Model-free
* Compute $V^*, Q^*, \pi^*$ - Q-Learning
* Evaluate fixed policy $\pi$ - Value Learning
* Active Reinforcement Learning
* Act according to current optimal (based on Q-Values) but also explore
* Use the feedback it receives to iteratively update its policy while learning until eventually determining the optimal policy after sufficient exploration
## Exploration vs Exploitation
* $\epsilon$ greedy - With a small probability $\epsilon$ act randomly, with large probability $1-\epsilon$
* Problems: Keep thrashing once learning is done
* Solution: Lower $\epsilon$ over time or have exploration functions
* Exploration functions
* Explore areas that we aren't that certain are bad, but eventually stop
* Take a value estimate $u$ and visit count $n$ and return an optimistic utility like $f(u,n) = u + k/n$
* Modified Q updates propagate "bonus" back to states that lead to unknown states
## Regret
* Measure of total mistake cost: difference between your rewards (including old suboptimality) and optimal rewards
* Minimizing regret requires optimally learning to be optimal
* Random exploration tends to have high regret
* Can never have 0 regret unless given a known optimal policy
* Or could define regret relative to the best of a class of learning algorithms
# Lecture 12: 10/8/19
## Approximate Q-Learning
* Basic Q-Learning keeps a table of q-values
* Too many states to visit them all in training and hold in memory
* Want to generalize learning from small number of training states to new similar states
* Fundamental idea in ML
* Solution: Describe a state using a vector of properties
* Features: functions from states to real numbers (often 1 or 0) to capture important properties of states
* Ex. Dist to closest ghost, food, num ghosts, 1/dist^2
* Linear Value functions
* $V(s) = w_1f_1(s) + w_2f_2(s) + \ldots w_nf_n(s)$
* Advantage: experience is summed up in a few powerful nums
* Disadvantage: States can share features but actually be very diff in value
* Approximate Q-Learning
* $Q(s, a) = w_1f_1(s, a) + w_2f_2(s, a) + \ldots w_nf_n(s, a)$
* difference = $[r + \gamma \max\limits_{a'} Q(s', a')] - Q(s,a)$
* $Q(s,a) \leftarrow Q(s,a) + \alpha[\text{difference}]$
* $w_i \leftarrow w_i + \alpha[\text{difference}]f_i(s,a)$
* Adjust weights of active features.
* If something bad happens, blame the features that were present
## Least Squares
* Total error $= \sum\limits_i (y_i - \hat{y_i})^2 = \sum\limits_i \left(y_i - \sum\limits_k w_kf_k(x_i)\right)^2$
* $error(w) = \frac{1}{2}\left(y - \sum\limits_k w_kf_k(x_i)\right)^2$
* $\frac{\partial{error(w)}}{\partial{w_m}} = -\left(y - \sum\limits_k w_kf_k(x_i)\right)f_m(x)$
* By chain rule
* $w_m \leftarrow w_m + \alpha \left(y - \sum\limits_k w_kf_k(x_i)\right)f_m(x)$
* $w_m \leftarrow w_m + \alpha[r + \gamma \max\limits_{a'} Q(s', a') - Q(s,a)]f_m(s,a)$
* target - prediction
## Policy Search
* Problem: Feature-based policies that work well don't approximate V/Q best
* Q learning priority: get Q-values close (modeling)
* Action selection priority: Get ordering of Q-values right (prediction)
* Solution: learn policies that maximize rewards, not the values that predict them
* Policy search: Start with an ok solution (e.g. Q-learning) then fine-tune by hill climbing on feature weights
## Model-based vs Model-free
* Model-based
* Model is nice to learn thru reward functions
* Learning values is more complicated than learning dynamics
* Model-free
* Good if dynamics are complicated