# CS188 Notes Part 1
Final notes can be found [here](https://hackmd.io/@matthewtang/188-final)
# Lecture 1: 8/29/19
## Designing Rational Agents
* Agent - entity that perceives and acts
* Rational agent- select actions to max utility
* Percepts, environment, action space dictate techniques for rational actions
# Lecture 2: 9/3/19
* Reflex agents - choose action based on current precept (maybe memory)
* Do not consider future consequences of actions
* Consider how the world **is**
* Planning agents - make decisions based on hypothesized consequences of actions
* Needs model of how world evolves in response to actions
* Formulate a goal
* Consider how world **would be**
* Optimal vs complete planing
* Completeness - tell us if no solution exists
* Planning vs replanning
* Tradeoff of time for more complicated planning agents
## Search problems
* State space
* Successor function (with actions, costs)
* A start state and a goal test
* A solution is a sequence of actions (a plan) which transforms the start state to a goal state
### Example Traveling in Romania
* State space: cities
* Don't want too many states
* Successor function: raods - go to adjacent city. cost = distance
* Start state: Arad
* Goal test: Is state == Bucharest?
* Solution: TBD
### Example: State Space Sizes
* World state:
* Agent positions: 120
* Food count: 30
* Ghost positions: 12
* Agent facing: NSEW
* Total world states: $120(2^{30})(12^2)4$
* Total states for pathing: $120$
* Only need to keep track of agent positions
* States for eat all dots: $120(2^{30})$
* Depending on the problem, your state space will be a subset of world state
### Example: Safe Passage
* Boolean consumption for big and small dots
* Remaining time for scared ghosts
* Goal test must work on state
## State Space Graphs
* Mathematical representation of a search problem
* Nodes are world configurations
* Arrows are successors (action results)
* Goal test is a set of goal nodes
* In a search graph, each state only occurs once
* Takes too much memory to explicitly construct this
## Search Trees
* A "what if" tree and their outcomes
* Root node: start state
* Children: successors
* Nodes corresponds to plans that achieve those states
* Can't build the whole tree for problems
## General Tree Search
* Fringe, expansion, exploration strategy
* Main idea: which fringe nodes to explore?
# Lecture 3: 9/5/19
## Search Algorithm Properties
* Complete: Guaranteed to find a solution if one exists?
* Return in finite time if not?
* Optimal: Guaranteed to find least cost path?
### Depth First Search
* Stack (LIFO)
* Time complexity
* Total number of nodes: $O(b^m)$ nodes where b is branching factor, m is the depth
* Space complexity
* $O(bm)$
* Complete if no cycles
* Not optimal
### Breadth First Search
* Fringe/queue (FIFO)
* Processes all nodes above shallowest solution
* Time: $O(b^s)$ where b is branching factor, s is depth of shallowest solution
* Space: $O(b^s)$
* Complete: Yes, guaranteed to return a solution if exists
* Optimal: No
### Iterative Deepening
* Combine BFS and DFS.
* Get DFS space advantage with BFS time
* Run DFS with depth limit 1. If no solution, increase limit by 1.
* Early depths have redundant searches but it's not too bad since the last levels take the most time (exponential)
### Uniform Cost Search
* Fringe is a priority queue (priority is cumulative cost)
* If solution costs $C^*$ and arcs cost at least $\epsilon$, effective depth is $\frac{C^*}{\epsilon}$
* Time: $O(B^{C^* / \epsilon})$
* Space: $O(B^{C^* / \epsilon})$
* Complete: If solution has a finite cost and minimum cost arc is positive, yes
* Optimal: yes
* Downsides:
* Explores options in every "direction"
* No information about end goal
## Informed Search
* Heuristics
* Estimate how close a state is to a goal
* Ex. Manhattan distance (ignore walls), euclidean distance
* Greedy Search
* Expand the node that seems closest
* Commonly takes you in a suboptimal route
* $A^*$
* Combining UCS and Greedy
* UCS orders by path cost, backwards cost - $g(n)$
* Greedy orders by goal proximity, forwards cost - $h(n)$
* Add $g(n) + h(n)$
* Stop when goal is dequeued
* Heuristic estimates need to be less than the actual costs
* Inadmissible(pessimistic) heuristics overestimate
* Admissible(optimistic) if $o \leq h(n) \leq h^*(n)$ where $h^*(n)$ is the true cost to ta nearest goal
* Often admissible heuristics are solutions to relaxed problems where new actions are available
* Inadmissible heuristics are useful too to find a solution faster
* Ex. For 8 puzzle
* Heuristic: number of misplaced tiles (terrible)
* Better heuristic: Manhattan distance of each tile
* Graph Search
### Graph Search
* Don't expand a state twice
* Tree search + a set of expanded states
* Consistency of heuristic - heuristic "arc" cost $\leq$ actual cost for each arc
* With admissible heuristic, Tree A* is optimal
* With consistent heuristic, Graph A* is optimal
# Lecture 4: 9/10/19
## Constraint satisfaction problems (CSP)
* $N$ variables, domain $D$ of all colors
* Constraints - restrictions on values of variables in regard to other variables
* Planning: for planning a sequence of actions (path) to a goal
* Paths have various costs, depths
* Difficult to plan low cost path, goal is easy to determine
* Identification: Assignments to variables
* Goal itself is important, not the path
* All paths at the same depth (for some formulations)
* CSPs are specialized for identification problems
* State is a "black box"/ arbitrary data structure
* Goal test can be a function over states
* Successor can be anything
* CSP
* A special subset of search problems
* State is defined by variables $X_i$ with values from a domain D (sometimes D depends on i)
* Goal test is a set of constraints
### Example: Map Coloring
* Variables: WA, NT, Q, NSW, V, SA, T
* Domains: $D = \{red, green, blue\}$
* Constraints: Adjacent regions must have diff colors
* Implicit: $WA \neq NT$
* Explicit: $(WA, NT) \in \{ (red, green), (red, blue), \ldots \}$
## Backtracking Search
* BFS is no good, DFS is better but still not good
* Idea: Incremental goal tests - check constraints along the way
* Idea: Fix one variable and then do dfs
* Backtracking = DFS + variable-ordering + fail-on-violation
* Filtering - Keep track of domains for unassigned variables, cross off bad options
* Forward checking: Cross off values that violate a constraing when adding new assignment
* Constraint Propagation: Propagates info from assigned to unassigned variables.
* Thinks about effects at deeper levels
* Arc consistency: Property such that every x in the tail has a y that could be assigned without violating a constraint
* Delete from the tail to make consistent
# Lecture 5: 9/12/19
## Filtering
* Forward checking
* Cross off values that violate a constraint when added to the existing assignment
* K consistency - for every k nodes, there exists a consistent assignment
* 1-Consistency: Check each single node
* 2-Consistency: Arc consistency
* Strong K-consistency: Also means $k-1, \ldots, 2, 1$ consistent
* Strong N-consistent means no need for backtracking
## Ordering
* Variable Ordering:
* Minimum Remaining Value (MRV) - Choose the variable with the fewest legal left values in its domain
* Aka "most constrained"
* Value Ordering:
* Least constrained value - Given a chosen variable, choose an assignment that is least constraining
## Iterative Improvement
* While not solved:
* Variable selection: randomly select any conflicted variable
* Value selection: min-conflicts heuristic
* Choose a value that violates the fewest constraints
* Very good performance except for a golden ratio R
## Structure
* Independent subproblems are identifiable as connected components of constraint graph
* If the constraint graph is a tree (no loops), can be solved in $O(nd^2)$ as opposed to $O(d^n)$
* $d^2$ comes from checking everything at head and tail
### Tree CSPs
* Choose a root variable, order variables so that parents precede children
* 1. Do a single pass backwards from n to the 2nd node to ensure the arcs pointing to those nodes are consistent
* All root to leaf arcs are consistent
* Each X-> Y was consistent at one point and Y's children were always processed first so Y's domain cannot be reduced later
* 2. Assign forward: for 1 to n, assign consistently
* No need to backtrack
### Nearly Trees
* Conditioning - Instantiate a variable, prune neighbors domains
* Cutset conditioning - Pick a node, assign it all possible ways, solve the remaining tree
* Cutset size c, total time $O(d^c (n-c)d^2)$
* Very fast for small c
* Tree Decomposition
* Idea: create a tree-structured graph of mega-variables
* Each mega-variable encodes part of the original CSP
* Subproblems overlap to ensure consistent solutions
## CSP Summary
* Basic solution: Backtracking search
* Improvements:
* Ordering
* Filtering
* Structure
* Iterative min-conflicts is often most practical
## Local Search
* Tree search keeps unexplored alternatives on the fringe (ensures completeness)
* Local search: improve a single option until you canâ€™t make it better (no fringe!)
* Successor function is local changes now
* Better time and memory, but incomplete and suboptimal
* Idea: Start wherever, move to best neighboring state
* If no better neighbors, quit
### Simulated Annealing
* Escape local maxima by allowing downhill movement
* But more rare as time goes on
* $T$, a "temperature" controls the probability of downward steps
* If $T$ is decreased slow enough, will converge on optimal solution
### Genetic Algorithms
* Natural selection strategy
* Keep best N hypotheses at each step (selection) based on a fitness function
* Pairwise crossover operators, with optional mutation to give variety
# Lecture 6: 9/17/19
## Search with other agents
* Types of games:
* Deterministic or stochastic?
* 1, 2, or more players?
* 0 sum?
* Perfect information (can you see the state?)
* Sometimes have to estimate the state from sensor input
* Do we know the objectives/utilities of the other agents?
* For now, assume that we know their objectives exactly.
## Types of games
* General games
* Agents have independent utilities
* Cooperation, indifference, competition are possible
* Zero-sum games
* Agents have opposite utilities
* One maximizes, other minimizes
* Adversarial, pure competition
## Deterministic Games with Terminal Utilities
* States: $S$
* Players: $P = \{1 \ldots N\}$ (usually take turns)
* Actions: $A$ (can depend on player/state)
* Transition function: $SxA \implies S$
* Terminal Test: $S \implies \{t,f\}$
* Terminal Utilities: $SxP \implies R$
* Solution for a player is a policy $S \implies A$
* No longer minimizing cost, agent tries to maximize score/utility
* Value of a state: best achievable outcome/utility from that state
* Non-terminal states: $V(x) = \max(V(x'))$
## Minimax Adversarial Search
* Deterministic, zero-sum games
* One player maximizes result, other minimizes result
* Minimax search
* State-space search tree
* Players alternate turns
* Compute each node's minimax value- best achievable utility against a rational, optimal adversary
```python=
def value(state):
if state is terminal: return state utility
if next agent is MAX: return max-value(state)
if next agent is MIN: return min-value(state)
```
* Time: $O(b^m)$, Space: $O(bm)$
* But we don't need to explore the whole tree
## Alpha-Beta Pruning
* Node $n$ has children
* Stop considering options better/worse than one already found (depending on who is pruning)
* Good child ordering improves pruning effectiveness
* Perfect ordering - Time: $O(b^{m/2})$
* Doubles solvable depth
* Metareasoning - computing what to compute
# Lecture 7: 9/19/19
## Resource Limits
* Cannot realistically search to leaves
* Solution: Depth limited search
* Evaluation function - estimate/score non-terminals in depth limited search
* Ex. Weighted linear sum of features
* Depth matters:
* The deeper in the tree the eval function is, quality matters less
* Tradeoff between complexity of features and computation
## General Sum Games
* "Maximax" games where terminals have utility tuples
* Each player maximizes its own component
* Sometimes you get competition, sometimes collaboration
## Expectimax Search - Uncertain Outcomes
* Explicit randomness, unpredictable opponents/humans
* Reflect average-case instead of worst case
* Compute average score under optimal play
* Calculate expected utilities (weighted average of children)
* Aka Markov Decision Process (MDP)
* Cannot prune
* Probabilities can be uniform distribution or sophisticated
# Lecture 8: 9/24/19
## Markov Decision Processes
* Use when world is stochastic or you don't know the whole world in your model
* Ex. Noisy movement in a grid world
* Agent receives rewards at each time step
* Small living reward at each step (can be negative)
* Big rewards at the end (good or bad)
* Goal: maximize sum of rewards
* Expectimax is a naive tool to solve
* Given the present state, future and past are independent
* Action outcomes depend on only current state
## MDP Formulation
* A set of states $s \in S$
* A set of actions $a \in A$
* A transition function $T(s,a,s')$
* Probability that $a$ from $s$ leads to $s'$
* $P(s'|s, a)$
* Aka Model/dynamics
* Reward function $R(s, a, s')$
* Sometimes just $R(s)$ or $R(s')$
* Start state
* Terminal state
* Policy = choice of action for each state
* Utility = sum of (discounted) rewards
## Policies
* In deterministic single agent search, we have an optimal plan
* For MDPs we want an optimal policy $\pi^*: S \rightarrow A$
* Policy $\pi$ gives action for each state
* Optimal policy maximizes expected utility
* Explicit policy = reflex agent
* Too small of a living cost in reward function can cause agent to get stuck in local minima
* Too large can cause agent to "end suffering"
## Discounting
* Maximize sum of rewards but prefer rewards earlier
* Solution: Exponential decay of value later on
* Discount factor $\gamma$
* Think of it as a gamma chance of ending the process at each step
* Helps algorithms converge
## Infinite Utilities
* Finite horizon (similar to depth-limited search)
* Terminates after fixed T steps (life)
* Give nonstationary policies ($\pi$ depends on time left)
* Discounting factor $0 < \gamma < 1$
* Converging geometric series
* $\gamma = 0$ only cares about current state reward (not practical)
* $\gamma > 1$ makes things in the future worth more
* Smaller $\gamma$ means smaller "horizon"/focus
# Lecture 9: 9/26/19
## Bellman Equation
* Q values have a value for every possible action
* Recursive definition of value: $V^* = \max\limits_a Q^*(s,a)$
* Best possible reward of all possible actions to take
* $Q^*(s,a) = \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V^*(s')]$
* Best possible reward given an action at current state
* $V^*(s) = \max\limits_a \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V^*(s')]$
## Time-Limited Values
* Define $V_k(s)$ to be optimal value of s if game ends in k more time steps
* Equivalent to depth-k expectimax
## Value Iteration
* $V_0(s) = 0$ b/c no time steps left so no reward
* $V_{k+1}(s) = \max\limits_a \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V_k(s')]$
* Repeat until convergence
* Problems:
* Complexity of iterations: $O(S^2A)$
* Max at each state rarely changes
* Policy often converges before values do
### Convergence
* If tree has maximum depth M, $V_M$ holds actual untruncated values
* If $\gamma < 1$, the values converge as k increases
### Computing Actions from Values
* Do a mini-expectimax of 1 step
* $\pi^*(s) = \text{argmax}_a \sum\limits_{s'} T(s,a,s')[R(s,a,s')+ \gamma V_k(s')]$
## Policy Iteration
* 1) Policy evaluation - calculate utilities for a fixed policy until convergence
* 2) Policy improvement - update policy using 1 step look ahead with resulting converged (but not optimal) utilitieis as future values
* Repeat until policy converges
* If we fixed a policy $\pi(s)$, only 1 action per state
* Compute utility of state s under a fixed (usually not optimal) policy
* Define utility of state s under fixed policy $\pi$ as:
* $V^\pi(s) =$ expected total discounted rewards starting in s, following $\pi$
* Recursive relation doesn't need to use max since we use $\pi(s)$ instead of optimal action
* Evaluation: Find values with policy evaluation and iterate until values converge
* Improvement: Get a better policy using policy extraction
## Comparison
* Both value and policy iteration compute same thing
* Value iteration:
* Every iteration updates both values and implicity policy
* Don't track plicy but taking max over actions implicitly recomputes it
* Policy Iteration
* Several passes that update utilities with fixed policy
* After policy is evaluated, a new policy is chosen
* New policy improves
# Lecture 10: 10/1/19
## Reinforcement Learning
* Missing information about MDP so you can't make an optimal policy
* Exploitation: Use what you know eventually
* Regret: Even if you learn intelligently, you make mistakes
* Sampling: Have to try things repeatedly b/c of chance
* Difficulty: Learning is much harder than solving a known MDP
* Exploration: you have to try unknown actions to get information
* Looking for a policy $\pi(s)$
* We don't know T or R
* Don't know which states are good or what actions do
* Can no longer take actions offline since we don't know T or R
* Not a POMDP since you know the output (is observable)
## Passive Reinforcement Learning
* Model-Based Idea
* Idea: We eventually learn the right model
* Learn an approximate model based on experiences
* Solve for values as if learned model was correct
* 1) Learn empirical MDP model
* Count outcomes $s'$ for each action, normalize
* Discover rewards from actions
* 2) Solve the learned MDP (Ex. via value iteration)
## Model-Free
* Idea: Samples appear with the right frequencies
* Simplified task: policy evaluation
* Input a fixed policy $\pi(s)$, don't know T or R
* Goal: Learn state values
* Learner is "along for the ride"
* No choice about what actions to take
* Execute policy and learn from the experience
### Direct Evaluation
* Compute values for each state under $\pi$
* Average together observed sample values
* Instead of trying to learn a transition function, just find values for states
* Pros: Don't need to understand T or R, eventually computes correct average values
* Cons: Wastes info about state connectivity, takes a while to learn each state
## Temporal Difference Learning
* Update $V(s)$ each time we experience a transition
* Likely outcomes $s'$ contribute updates more often
* Sample: $R(s, \pi(s), s') + \gamma V^\pi (s')$
* Update: $V^\pi(s) \leftarrow (1- \alpha)V^\pi(s) + (\alpha)sample$
* Aka $V^\pi(s) \leftarrow V^\pi(s) + \alpha(sample - V^\pi(s))$
* Exponential moving average
* We forget about the past since more recent is more representative
* Problem: We can't turn values into a new policy
* Solution: Learn Q-values, not values
## Q-Value Iteration
* Value Iteration (from before): Find successive (depth limited) values
* Start with $V_0(s) = 0$
* Given $V_k$, calculate depth $k+1$ values for all states
* Q-Value Iteration:
* Start with $Q(s,a) = 0$
* $Q_{k+1}(s,a) = \sum\limits_{s'}T(s,a,s')[R(s,a,s') + \gamma \max\limits_{a'}Q_k(s',a')]$
## Q-Learning
* Sample based Q-value iteration
* Learn $Q(s,a)$ values as you go
* Receive a sample $(s,a,s',r)$
* Incorporate new estimate into a running average
* Converges to optimal policy even if you act suboptimally
* Have to explore enough
* Make learning rate small but not decrease too quickly
# Lecture 11: 10/3/19
## Summary
* Known MDP - Offline solution
* Compute $V^*, Q^*, \pi^*$ - Value/policy iteration
* Evaluate fixed policy $\pi$ - Policy iteration
* Unknown MDP - Model-based
* Compute $V^*, Q^*, \pi^*$ - Value/policy iteration on approx MDP
* Evaluate fixed policy $\pi$ - Policy iteration on approx MDP
* Unknown MDP - Model-free
* Compute $V^*, Q^*, \pi^*$ - Q-Learning
* Evaluate fixed policy $\pi$ - Value Learning
* Active Reinforcement Learning
* Act according to current optimal (based on Q-Values) but also explore
* Use the feedback it receives to iteratively update its policy while learning until eventually determining the optimal policy after sufficient exploration
## Exploration vs Exploitation
* $\epsilon$ greedy - With a small probability $\epsilon$ act randomly, with large probability $1-\epsilon$
* Problems: Keep thrashing once learning is done
* Solution: Lower $\epsilon$ over time or have exploration functions
* Exploration functions
* Explore areas that we aren't that certain are bad, but eventually stop
* Take a value estimate $u$ and visit count $n$ and return an optimistic utility like $f(u,n) = u + k/n$
* Modified Q updates propagate "bonus" back to states that lead to unknown states
## Regret
* Measure of total mistake cost: difference between your rewards (including old suboptimality) and optimal rewards
* Minimizing regret requires optimally learning to be optimal
* Random exploration tends to have high regret
* Can never have 0 regret unless given a known optimal policy
* Or could define regret relative to the best of a class of learning algorithms
# Lecture 12: 10/8/19
## Approximate Q-Learning
* Basic Q-Learning keeps a table of q-values
* Too many states to visit them all in training and hold in memory
* Want to generalize learning from small number of training states to new similar states
* Fundamental idea in ML
* Solution: Describe a state using a vector of properties
* Features: functions from states to real numbers (often 1 or 0) to capture important properties of states
* Ex. Dist to closest ghost, food, num ghosts, 1/dist^2
* Linear Value functions
* $V(s) = w_1f_1(s) + w_2f_2(s) + \ldots w_nf_n(s)$
* Advantage: experience is summed up in a few powerful nums
* Disadvantage: States can share features but actually be very diff in value
* Approximate Q-Learning
* $Q(s, a) = w_1f_1(s, a) + w_2f_2(s, a) + \ldots w_nf_n(s, a)$
* difference = $[r + \gamma \max\limits_{a'} Q(s', a')] - Q(s,a)$
* $Q(s,a) \leftarrow Q(s,a) + \alpha[\text{difference}]$
* $w_i \leftarrow w_i + \alpha[\text{difference}]f_i(s,a)$
* Adjust weights of active features.
* If something bad happens, blame the features that were present
## Least Squares
* Total error $= \sum\limits_i (y_i - \hat{y_i})^2 = \sum\limits_i \left(y_i - \sum\limits_k w_kf_k(x_i)\right)^2$
* $error(w) = \frac{1}{2}\left(y - \sum\limits_k w_kf_k(x_i)\right)^2$
* $\frac{\partial{error(w)}}{\partial{w_m}} = -\left(y - \sum\limits_k w_kf_k(x_i)\right)f_m(x)$
* By chain rule
* $w_m \leftarrow w_m + \alpha \left(y - \sum\limits_k w_kf_k(x_i)\right)f_m(x)$
* $w_m \leftarrow w_m + \alpha[r + \gamma \max\limits_{a'} Q(s', a') - Q(s,a)]f_m(s,a)$
* target - prediction
## Policy Search
* Problem: Feature-based policies that work well don't approximate V/Q best
* Q learning priority: get Q-values close (modeling)
* Action selection priority: Get ordering of Q-values right (prediction)
* Solution: learn policies that maximize rewards, not the values that predict them
* Policy search: Start with an ok solution (e.g. Q-learning) then fine-tune by hill climbing on feature weights
## Model-based vs Model-free
* Model-based
* Model is nice to learn thru reward functions
* Learning values is more complicated than learning dynamics
* Model-free
* Good if dynamics are complicated

Last updated 5/7/23 by Emily Hsi, created by Matthew Tang and Christianna Xu Asymptotics Runtime $\Theta$: there exists positive constants k1 and k2 such that $k_1f_1(N)\leq R(N)\leq k_2f_2(N)$ Important sums:Sum of first $n$ natural numbers = $\frac{n(n+1)}{2} \implies \Theta(n^2)$ Sum of first $n$ powers of 2 = $2Q -1 \implies \Theta(n)$ $\Sigma_{i=0}^{N-1} 2^i \implies \Theta(2^n)$ $1 + ... (\frac{n}{2})^2 + n^2 \implies \Theta(n^2)$ $1 + 2 + 4 + 8 + ... N^2 \implies \Theta(n^2)$

5/8/2023#1: Tutorial Part A: Convert 20-1-16-5 using letters to numbers to get TAPE (The title The Amazon Code clues A-Z) Part B: The flavortext insightful and title clues you to think of braille. It spells out EYES Part C: You are given a string eRkeU3_KF1w which is a YouTube URL (As clued by the title). The video is the puzzle recap from 2021, and in the description it says: If you're here from the 2022 puzzle game, the word is "mmbernie" giving MMBERNIE Part D: We are given C2, B3, A1, A2. The letter indicates which part to take the answer from and the number is the index of the answer. This gives us META Answer: META Answer explanation: Part D is a meta puzzle Title explanation: This is a tutorial puzzle

1/31/2022Originally played on 3/19/21 Overall I think the game turned out fine. Time allotments seemed good for the difficulty of the structures Wide variety in final builds BB8: Rubric seemed very fair Starship:

3/20/2021MT1 Notes can be found here MT2 Notes can be found here Final Notes can be found here Lecture 1: 8/28/19 Cycles Two pointers, advance p1 twice, advance p2 once. If they ever end up at the same place, report cycle. If no cycle, slow pointer never catches fast cycle If cycle, pointers will enter cycle at some time and dist will decrease between the pointers

3/20/2021
Published on ** HackMD**

or

By clicking below, you agree to our terms of service.

Sign in via Facebook
Sign in via Twitter
Sign in via GitHub
Sign in via Dropbox
Sign in with Wallet

Wallet
(
)

Connect another wallet
New to HackMD? Sign up