# CS170 Final Notes
1. MT1 Notes can be found [here](https://hackmd.io/@matthewtang/B1js2XPrS)
2. MT2 Notes can be found [here](https://hackmd.io/@matthewtang/B1NCuHWdB)
3. Final Notes can be found [here](https://hackmd.io/@matthewtang/r1PACERcH)
# Lecture 26: 11/4/19
## NP and P Problems
* Classes of computational problems
* P - Class of problems that you can find the solution in polynomial time
* NP - Class of problems for which you can verify the solution in polynomial time
* Example: 3COL
* Input: Graph $G(V,E)$
* Solution: A coloring $c:V\{R,B,G\}$ such that each edge gets 2 different colors
* Verifying the solution coloring just checks all edges
* $P \subseteq NP$
* If we can compute the solution efficiently, we can verify the solution efficiently
* MST algorithm can be solved and verified in polynomial time
* Can just run MST algorithm again and check if the cost matches
* Not in NP
* Number of 3-COL
* Games: does white/black have a winning strategy?
* In NP
* Hamiltonian cycle:
* Input: Graph $G(V,E)$
* Sol: A cycle containing all vertices exactly once
## TSP Variants
* Min TSP (Optimization) - Not in NP
* Input: Graph $G(V,E)$, weights $\{w_e\}_{e \in E}$
* Sol: A tour of minimum cost (goes through every vertex once)
* Budget TSP (search): In NP
* Input: Graph $G(V,E)$, weights $\{w_e\}_{e \in E}$, budget B
* Sol: A tour with cost $\leq B$
* If Min TSP can be solved efficiently, budget TSP can be solved efficiently
* Decision-Budget TSP (Decision): In NP
* Input: Graph $G(V,E)$, weights $\{w_e\}_{e \in E}$, budget B
* Is there a tour of cost $< B$
## Breaking RSA
* Alice sends public key (PK), Bob encrypts message with PK
* Input: Public key PK, ciphertext Enc(PK,m)
* Sol: Message m
* NP Problem since we can just encrypt and see if it is the ciphertext
* P = NP??
# Lecture 27: 11/8/19
* Fatorization not in P
## Reductions
* Problem $A \leq_P B$
* Problem A reduces in polynomial time to problem B if you can use an algorithm for B to solve A
* B is a more difficult problem than A
* A: Hamiltonian Cycle
* B: Half Cycle: find a graph that visits half the nodes (V/2)
* Use algorithm for B as a black box to solve A
* Use a reduction algorithm to convert input of A to input of type B
* Then recover the solution from B to solution of A
* Reductions
* 1) Describe reduction algorithm
* 2) Prove Sol to $B \implies $ Sol to A efficiently
* 3) No soln to $B \implies$ No sol to A
* Half cycle and Hamiltonian cycle
* Double the number of vertices
## NP Complete Problems
* Reduction from every problem in NP to that
* A is NP complete if every problem in NP reduces to that
* All NP complete problems are reducible to one another
* Hardest problems within NP
* A polytime algo for 1 gives polytime algo for all of NP
# Lecture 28: 11/13/19
* NP-hard = harder than every problem in NP
* Ex. Halting problem (unsolvable)
* Can't even verify the solution
* NP-complete is inside NP but harder than everything else in NP
## Circuit SAT
* Input: A boolean circuit C
* Sol: Input assignment to satisfy the circuit
* Circuit SAT is NP complete
* Need to prove that every problem in NP reduces to CircuitSAT
* $A \leq_P$ CircuitSAT
* Let A be the factorization problem to find factors p,q of N
* Create circuit that takes N, p,q. Output 1 if correct factorization
* Send this circuit to CircuitSAT and that will output solutions of p and q to satisfy it or none if none exists
* For any problem in NP, make the verification procedure into a circuit, feed that into CircuitSAT
* Solving any problem in NP (polynomial number of circuit gates) is same as CircuitSAT
## 3-SAT
* Input: 3-SAT formula on variables $x_1 \ldots x_n \in \{0,1\}$ with n clauses
* Output: An assignment that satisfies all clauses
* Prove that 3-SAT is NP-Complete:
* 1) 3SAT $\in$ NP
* 2) Every problem in NP reduces to 3SAT
* We just proved CircuitSAT is NP-complete, so if CircuitSAT reduces to 3SAT, 3SAT is NP-complete too
* If $A \leq_P B, B \leq_P C, \implies A \leq_p C$
* Reduce CircuitSAT to 3SAT
* 3-SAT formula:
* Variables $\{x_1 \ldots x_n\}$
* Wires $\{w_1 \ldots w_m\}$
* Convert gates into clauses
* AND: $w_k = w_i \land w_j$ convert to clause
* Impossible to have 0,0,1, so $(w_i \lor w_j \lor w_k)$
* Do this for the 3 other impossible configurations, join with $\land$
* Encode constraing that $w_m = 1$ so $(w_m \lor w_m \lor w_m)$ (to fit format of 3SAT)
* If circuit C has satisfying assignment, there must exist a satisfying assignment to 3SAT formula
## Independent Set
* Goal: Find independent set of size $k$
* Create graph from 3SAT
* Create an undirected triangle from first clause
* Can only pick at most 1 vertex in any triangle
* Can't pick $y$ and $\lnot y$, so draw edge from every variable to its negation
* $k$ is the number of clauses
* In every clause, there must be at least 1 true literal
* Ind. set: pick any true literal in every triangle
* If $S \subseteq V$ is indep set, S contains 1 vertex from every triangle
* $|S| = m$, num of clauses
* If $x_i \in S, x_i = TRUE$
* If $\bar x_i \in S, x_i = FALSE$
* else: Assign $x_i = TRUE$ (arbitrary)
# Lecture 29: 11/15/19
* NP-complete proofs
* Show that an NP-complete problem reduces to A and the A is in NP
## Integer Programming
* LP with variables only integer values
* Input: Set of linear constraints $x_i \ldots x_n$
* Solution: Integer assignment satisfying constraints
* No objective function, just need to satisfy constraints
* Reduce Independent Set to Integer Programming
* $x_i \in \{0,1\}$ indicates if $x_i$ in indep set
* $\sum\limits_{i=1}^n x_i = k$, $0 \leq x_i \leq 1$
* $x_i + x_j \leq 1, \forall (i,j) \in E$ (edge constraint)
* Very versatile problem to reduce to
## Rudrata Cycle
* Reduce 3SAT to Rudrata Cycle (directed graph)
* Map 3SAT sols to Rudrata cycle sols
* For 1 gadget, 0 is R to L, 1 is L to R
* Can only traverse in 1 direction
* $2^N$ different hamiltonian cycles for each direction in each row you can go
* Create a vertex C for the clause $x_1 \lor \bar x_2 \lor x_n$
* (L to R) for True, (R to L) for False
* Direct edges to C such that only possible from L to R for $x_1$
* Rudrata path must walk $x_1$ L to R or the direction of the other variables in clause to cover the vertex C
* Size of graph: Length of paths is the num of occurence of each variable so $O(n+m)$
# Lecture 30: 11/18/19
## Greedy Vertex Cover Approximation
* Pick a maximal matching M (cannot add any more edges)
* Add edges to M until you can't
* Output S: all endpoints of edges in M
* Prove S is a vertex cover
* Suppose an edge (u,v) is NOT covered by S
* Add edge (u,v) to matching M
* But M is maximal, this is a contradiction
* $|S| \geq 2|M| \geq 2 OPT$
* Optimal vertex cover $\geq$ size of maximal matching
* Each edge $(u,v) \in M$ to cover this edge, we have to pick u or v
* Linear time
## LP Vertex Cover Approximation
* Variables: $x_i = 1$ if in vertex cover, 0 otherwise
* Objective: $\min \sum\limits_{i=1}^n x_i$
* Constraints: $0 \leq x_i \leq 1$
* $x_i + x_j \geq 1, \forall (i,j) \in E$
* LP-OPT = $\sum_i x_i^*$
* May not be an integer solution
* LP-OPT $\leq$ OPT-VC
* If S is an optimal verex cover, LP is feasible
* If $x_i \geq \frac{1}{2}$, round to 1. Otherwise, round down
* For any edge, must have at least 1 must be greater than or equal to half
* This covers all possible edges in S
* VC $\leq 2LP$
* Vertex cover pays 1
* LP pays $x_i^* \geq \frac{1}{2}$
## Metric TSP
* Input: $n$ cities with distances $d_{ij}$
* Output: Tour visiting all cities and returning to starting point Can go from any city, no repetition allowed
* Assumption: Distances satisfy triangle inequality
* $d_{ij} + d_{jk} \geq d_{ik} \forall i,j,k$
* Algorithm
* 1) Find an MST T on $d_{ij}$
* Cost(T) $\leq$ Cost(OPT-TSP)
* 2) DFS Traversal of tree T
* Cost(DFS) = 2Cost(Tree T) $\leq$ 2Cost(OPT-TSP)
* 3) Drop repeated vertices from traversal
* Cost(Output) $\leq$ cost(DFS) $\leq$ 2Cost(OPT-TSP)
# Lecture 31: 11/20/19
* Min TSP without triangle inequality has no fixed approximation factor
* NP hard to approximate any fixed factor
* If you could solve this approximation with a fixed factor, you could solve Rudrata cycle
* If G has a Rudrata cycle, would have a TSP of cost n
## K-Cluster
* Input: $n$ points with distance d
* Output: Cluster the points
* Diameter = max dist between points in cluster
* Cost = $\min \max\limits_i diam(C_i)$
* Greedy Algorithm (Factor 2):
* Find k cluster means
* Set $\mu_1$ arbitrarily
* For $2 \ldots k$: $\mu_i =$ farthest$(\{\mu_1 \ldots \mu_{i-1}\})$
* Farthest = $\max\limits_x \min\limits_j dist(x,\mu_j)$
* Output clustering
# Lecture 32: 11/22/19
## Streaming
* Input is a stream: $s_1, s_2 \ldots s_n$, each $s_i \in \{1 \ldots N\}$
* Goal: compute function $f(s_1, \ldots s_n)$
* Restrictions:
* 1) Memory $\approx poly(\log n, log N)$
* 2) You see each input only once in sequential order
* Analogous to reading a book with a little sheet to take notes, compute word statistics
## Random Samples
* Take a random sample of inputs
* 300 voters who vote 0 or 1
* Estimate num of voters who are 1:
* Sample $k$ voters, output average of that
* With $k$ samples, estimate is $\frac{1}{k} \sum x_i$ with accuracy $\epsilon$, probability $1 - \delta$
* Runtime of algorithm is independent of input size
## Chernoff Bound
* Input is a stream: $s_1, s_2 \ldots s_n$
* Let $x_1, \ldots x_k$ uniform random samples from stream
* Output: $\frac{1}{k} \sum x_i$
* $E(X_i) = P$
* $Pr(|\frac{1}{k} \sum x_i - p | > \epsilon) \leq 2e^{-2\epsilon^2 k}$
## Reservior Sampling:
* Input is a stream: $s_1, s_2 \ldots s_n$
* Output: uniformly random element from stream
* The catch is that you don't know when the stream ends
* Algorithm:
* Reservoir = $s_1$
* For $i = 2 \ldots$:
* Choose random number $r \in \{ 1 \ldots i\}$
* If $r = 1$: Add $s_i$ to reservoir
* Else: ignore $s_i$
* Output reservoir
* Main challenge is assuming that your order is uniformly random
# Lecture 33: 11/25/19
## Distinct Elements
* Input is a stream: $s_1, s_2 \ldots s_n$
* Output: Estimate # of distinct elements in stream
* Algorithm:
* 1) Pick a random hash function $h = \{ 1 \ldots N\} \rightarrow [0,1]$
* $k$ distinct elements in stream means $k$ distinct real numbers in $[0,1]$
* Uniformly random over the interval
* Smallest element should be $\approx \frac{1}{k}$
* 2) Compute $\alpha = \min \{h(s_1), h(s_2), \ldots h_s(n)\} \approx \frac{1}{k}$
* Does not require much memory/space
* 3) Output $\frac{1}{\alpha} - 1$
* $E[\min \{h(s_1), h(s_2), \ldots h_s(n)\}] = \frac{1}{k+1}$
* Where $k$ is num of distinct elements in stream, h is a random hash function to $[0,1]$
### Probability bounding
* For $k$ distinct elements in the stream, $P(\min \{h(s_1), h(s_2), \ldots h_s(k)\} \leq \frac{1}{4k}) \leq \frac{1}{4}$
* $= P(h(s_1) \leq \frac{1}{4k}) \land \ldots P(h(s_k) \leq \frac{1}{4k})$
* Union bound: $\leq \sum\limits_{i=1}^k P(h(s_i) \leq \frac{1}{4}k) = \sum\limits_{i=1}^k \frac{1}{4k}k = \frac{1}{4}$
* For $k$ distinct elements in the stream, $P(\min \{h(s_1), h(s_2), \ldots h_s(k)\} > \frac{4}{k}) \leq e^{-4}$
* Estimate $< \frac{k}{4}$ for a truly random function
* $= P(h(s_1) \leq \frac{4}{k}) \land \ldots P(h(s_k) \leq \frac{4}{k})$
* Assume independence of random function: $\Pi_{i=1}^k P(h(s_i)) > \frac{4}{k} = (1 - \frac{4}{k})^k \approx e^{-4}$
* But this proof is bad since we assume truly random hash function
## Hash family
* $\mathcal{H} = \{ h_1 \ldots h_m: \{ 1 \ldots N \} \rightarrow \mathbb{R}\}$
* If $\mathcal{H} = $ set of all possible functions
* To remember $h \in \mathcal{H}$, need $\log |\mathcal{H}| $ bits
* Want $\mathcal{H}$ to be small yet random
## Pairwise Independent Hash Family
* If a hash family $\mathcal{H}$ is pairwise independent, $\forall x,y, x \neq y, \in Domain\{1 \ldots N\}$
* And $\forall \alpha, \beta \in Range(\{1 \ldots R\})$
* $P((h(x) = \alpha) \land (h(y) = \beta) = \frac{1}{R^2}$
* Example: $\mathcal{Z}_p = \{0, 1, \ldots p-1\}$ integers modulo p
* $\mathcal{H} = \{h_{a,b \in \mathcal{Z}_p}(x) = ax + b \mod p \}$, $|\mathcal{H}| = p^2$
* Once you know 2 values of the hash function, you can calculate a,b and predict any future ones
# Lecture 34: 12/2/19
## Heavy Hitters
* Output: List of all elements that appear $\geq n\beta $ times, $0 \leq \beta \leq 1$
* Data structure: "Count min sketch"
* List all heavy hitters
* Given any element, tell if frequency of element is $\geq \beta$
* Table $M[i,j] = 0$, length $l = 10 \log n$, width = $B$ buckets
* Hash functions $h_1 \ldots h_l \colon \{1 \ldots N\} \rightarrow \{1 \ldots B\}$
* If you see element a:
* Increment(element a): for each hash function $h_i, M[i, h_i(j)]++$
* Query(element a): approximate frequency
* $M[1\ldots l, h(a)] \geq f_a$ are all over-estimates so return the minimum
* $M[i,j]$ is num of elements in stream that map to bucket j under hash function $h_i$, so its an over-estimate due to hash collisions
### Upper bound
* Fix i and element a.
* $E_{h_i} M[i, h_i(a)]= f_a + \sum\limits_{h_i(b) = h_i(a)}f_b \leq f_a + \frac{n}{B}$
* $P(h_i(b) = h_i(a)) = \frac{1}{B}$ for a fixed $b \neq a$
* Markov's Inequality
* $P(X >t) \leq \frac{E[X]}{t}$
* $P(M[i, h_i(a)-f_a] \geq \frac{2n}{B} ) \leq \frac{1}{2}$
* $P(\min_i M[i, h_i(a)] \geq f_a + \frac{2n}{B} )$
* Since hash functions are chosen independently, multiply all probabilities $P(M[i, h_i(a)] \geq f_a + \frac{2n}{B} )$
* Each estimate is at worst $\frac{1}{2}$ so total is $\frac{1}{2^l} = \frac{1}{2^{10 \log n}} \approx \frac{1}{n^{10}}$
## Final Algorithm
* for i = 1 to n:
* increment($s_i$)
* If (query$(s_i) \geq B_n$), add $s_i$ to list of $\beta$-heavy hitters
# Lecture 35: 12/4/19
## Proofs
* True statements are provable, false statements cannot be proved