# CS170 Final Notes 1. MT1 Notes can be found [here](https://hackmd.io/@matthewtang/B1js2XPrS) 2. MT2 Notes can be found [here](https://hackmd.io/@matthewtang/B1NCuHWdB) 3. Final Notes can be found [here](https://hackmd.io/@matthewtang/r1PACERcH) # Lecture 26: 11/4/19 ## NP and P Problems * Classes of computational problems * P - Class of problems that you can find the solution in polynomial time * NP - Class of problems for which you can verify the solution in polynomial time * Example: 3COL * Input: Graph $G(V,E)$ * Solution: A coloring $c:V\{R,B,G\}$ such that each edge gets 2 different colors * Verifying the solution coloring just checks all edges * $P \subseteq NP$ * If we can compute the solution efficiently, we can verify the solution efficiently * MST algorithm can be solved and verified in polynomial time * Can just run MST algorithm again and check if the cost matches * Not in NP * Number of 3-COL * Games: does white/black have a winning strategy? * In NP * Hamiltonian cycle: * Input: Graph $G(V,E)$ * Sol: A cycle containing all vertices exactly once ## TSP Variants * Min TSP (Optimization) - Not in NP * Input: Graph $G(V,E)$, weights $\{w_e\}_{e \in E}$ * Sol: A tour of minimum cost (goes through every vertex once) * Budget TSP (search): In NP * Input: Graph $G(V,E)$, weights $\{w_e\}_{e \in E}$, budget B * Sol: A tour with cost $\leq B$ * If Min TSP can be solved efficiently, budget TSP can be solved efficiently * Decision-Budget TSP (Decision): In NP * Input: Graph $G(V,E)$, weights $\{w_e\}_{e \in E}$, budget B * Is there a tour of cost $< B$ ## Breaking RSA * Alice sends public key (PK), Bob encrypts message with PK * Input: Public key PK, ciphertext Enc(PK,m) * Sol: Message m * NP Problem since we can just encrypt and see if it is the ciphertext * P = NP?? # Lecture 27: 11/8/19 * Fatorization not in P ## Reductions * Problem $A \leq_P B$ * Problem A reduces in polynomial time to problem B if you can use an algorithm for B to solve A * B is a more difficult problem than A * A: Hamiltonian Cycle * B: Half Cycle: find a graph that visits half the nodes (V/2) * Use algorithm for B as a black box to solve A * Use a reduction algorithm to convert input of A to input of type B * Then recover the solution from B to solution of A * Reductions * 1) Describe reduction algorithm * 2) Prove Sol to $B \implies $ Sol to A efficiently * 3) No soln to $B \implies$ No sol to A * Half cycle and Hamiltonian cycle * Double the number of vertices ## NP Complete Problems * Reduction from every problem in NP to that * A is NP complete if every problem in NP reduces to that * All NP complete problems are reducible to one another * Hardest problems within NP * A polytime algo for 1 gives polytime algo for all of NP # Lecture 28: 11/13/19 * NP-hard = harder than every problem in NP * Ex. Halting problem (unsolvable) * Can't even verify the solution * NP-complete is inside NP but harder than everything else in NP ## Circuit SAT * Input: A boolean circuit C * Sol: Input assignment to satisfy the circuit * Circuit SAT is NP complete * Need to prove that every problem in NP reduces to CircuitSAT * $A \leq_P$ CircuitSAT * Let A be the factorization problem to find factors p,q of N * Create circuit that takes N, p,q. Output 1 if correct factorization * Send this circuit to CircuitSAT and that will output solutions of p and q to satisfy it or none if none exists * For any problem in NP, make the verification procedure into a circuit, feed that into CircuitSAT * Solving any problem in NP (polynomial number of circuit gates) is same as CircuitSAT ## 3-SAT * Input: 3-SAT formula on variables $x_1 \ldots x_n \in \{0,1\}$ with n clauses * Output: An assignment that satisfies all clauses * Prove that 3-SAT is NP-Complete: * 1) 3SAT $\in$ NP * 2) Every problem in NP reduces to 3SAT * We just proved CircuitSAT is NP-complete, so if CircuitSAT reduces to 3SAT, 3SAT is NP-complete too * If $A \leq_P B, B \leq_P C, \implies A \leq_p C$ * Reduce CircuitSAT to 3SAT * 3-SAT formula: * Variables $\{x_1 \ldots x_n\}$ * Wires $\{w_1 \ldots w_m\}$ * Convert gates into clauses * AND: $w_k = w_i \land w_j$ convert to clause * Impossible to have 0,0,1, so $(w_i \lor w_j \lor w_k)$ * Do this for the 3 other impossible configurations, join with $\land$ * Encode constraing that $w_m = 1$ so $(w_m \lor w_m \lor w_m)$ (to fit format of 3SAT) * If circuit C has satisfying assignment, there must exist a satisfying assignment to 3SAT formula ## Independent Set * Goal: Find independent set of size $k$ * Create graph from 3SAT * Create an undirected triangle from first clause * Can only pick at most 1 vertex in any triangle * Can't pick $y$ and $\lnot y$, so draw edge from every variable to its negation * $k$ is the number of clauses * In every clause, there must be at least 1 true literal * Ind. set: pick any true literal in every triangle * If $S \subseteq V$ is indep set, S contains 1 vertex from every triangle * $|S| = m$, num of clauses * If $x_i \in S, x_i = TRUE$ * If $\bar x_i \in S, x_i = FALSE$ * else: Assign $x_i = TRUE$ (arbitrary) # Lecture 29: 11/15/19 * NP-complete proofs * Show that an NP-complete problem reduces to A and the A is in NP ## Integer Programming * LP with variables only integer values * Input: Set of linear constraints $x_i \ldots x_n$ * Solution: Integer assignment satisfying constraints * No objective function, just need to satisfy constraints * Reduce Independent Set to Integer Programming * $x_i \in \{0,1\}$ indicates if $x_i$ in indep set * $\sum\limits_{i=1}^n x_i = k$, $0 \leq x_i \leq 1$ * $x_i + x_j \leq 1, \forall (i,j) \in E$ (edge constraint) * Very versatile problem to reduce to ## Rudrata Cycle * Reduce 3SAT to Rudrata Cycle (directed graph) * Map 3SAT sols to Rudrata cycle sols * For 1 gadget, 0 is R to L, 1 is L to R * Can only traverse in 1 direction * $2^N$ different hamiltonian cycles for each direction in each row you can go * Create a vertex C for the clause $x_1 \lor \bar x_2 \lor x_n$ * (L to R) for True, (R to L) for False * Direct edges to C such that only possible from L to R for $x_1$ * Rudrata path must walk $x_1$ L to R or the direction of the other variables in clause to cover the vertex C * Size of graph: Length of paths is the num of occurence of each variable so $O(n+m)$ # Lecture 30: 11/18/19 ## Greedy Vertex Cover Approximation * Pick a maximal matching M (cannot add any more edges) * Add edges to M until you can't * Output S: all endpoints of edges in M * Prove S is a vertex cover * Suppose an edge (u,v) is NOT covered by S * Add edge (u,v) to matching M * But M is maximal, this is a contradiction * $|S| \geq 2|M| \geq 2 OPT$ * Optimal vertex cover $\geq$ size of maximal matching * Each edge $(u,v) \in M$ to cover this edge, we have to pick u or v * Linear time ## LP Vertex Cover Approximation * Variables: $x_i = 1$ if in vertex cover, 0 otherwise * Objective: $\min \sum\limits_{i=1}^n x_i$ * Constraints: $0 \leq x_i \leq 1$ * $x_i + x_j \geq 1, \forall (i,j) \in E$ * LP-OPT = $\sum_i x_i^*$ * May not be an integer solution * LP-OPT $\leq$ OPT-VC * If S is an optimal verex cover, LP is feasible * If $x_i \geq \frac{1}{2}$, round to 1. Otherwise, round down * For any edge, must have at least 1 must be greater than or equal to half * This covers all possible edges in S * VC $\leq 2LP$ * Vertex cover pays 1 * LP pays $x_i^* \geq \frac{1}{2}$ ## Metric TSP * Input: $n$ cities with distances $d_{ij}$ * Output: Tour visiting all cities and returning to starting point Can go from any city, no repetition allowed * Assumption: Distances satisfy triangle inequality * $d_{ij} + d_{jk} \geq d_{ik} \forall i,j,k$ * Algorithm * 1) Find an MST T on $d_{ij}$ * Cost(T) $\leq$ Cost(OPT-TSP) * 2) DFS Traversal of tree T * Cost(DFS) = 2Cost(Tree T) $\leq$ 2Cost(OPT-TSP) * 3) Drop repeated vertices from traversal * Cost(Output) $\leq$ cost(DFS) $\leq$ 2Cost(OPT-TSP) # Lecture 31: 11/20/19 * Min TSP without triangle inequality has no fixed approximation factor * NP hard to approximate any fixed factor * If you could solve this approximation with a fixed factor, you could solve Rudrata cycle * If G has a Rudrata cycle, would have a TSP of cost n ## K-Cluster * Input: $n$ points with distance d * Output: Cluster the points * Diameter = max dist between points in cluster * Cost = $\min \max\limits_i diam(C_i)$ * Greedy Algorithm (Factor 2): * Find k cluster means * Set $\mu_1$ arbitrarily * For $2 \ldots k$: $\mu_i =$ farthest$(\{\mu_1 \ldots \mu_{i-1}\})$ * Farthest = $\max\limits_x \min\limits_j dist(x,\mu_j)$ * Output clustering # Lecture 32: 11/22/19 ## Streaming * Input is a stream: $s_1, s_2 \ldots s_n$, each $s_i \in \{1 \ldots N\}$ * Goal: compute function $f(s_1, \ldots s_n)$ * Restrictions: * 1) Memory $\approx poly(\log n, log N)$ * 2) You see each input only once in sequential order * Analogous to reading a book with a little sheet to take notes, compute word statistics ## Random Samples * Take a random sample of inputs * 300 voters who vote 0 or 1 * Estimate num of voters who are 1: * Sample $k$ voters, output average of that * With $k$ samples, estimate is $\frac{1}{k} \sum x_i$ with accuracy $\epsilon$, probability $1 - \delta$ * Runtime of algorithm is independent of input size ## Chernoff Bound * Input is a stream: $s_1, s_2 \ldots s_n$ * Let $x_1, \ldots x_k$ uniform random samples from stream * Output: $\frac{1}{k} \sum x_i$ * $E(X_i) = P$ * $Pr(|\frac{1}{k} \sum x_i - p | > \epsilon) \leq 2e^{-2\epsilon^2 k}$ ## Reservior Sampling: * Input is a stream: $s_1, s_2 \ldots s_n$ * Output: uniformly random element from stream * The catch is that you don't know when the stream ends * Algorithm: * Reservoir = $s_1$ * For $i = 2 \ldots$: * Choose random number $r \in \{ 1 \ldots i\}$ * If $r = 1$: Add $s_i$ to reservoir * Else: ignore $s_i$ * Output reservoir * Main challenge is assuming that your order is uniformly random # Lecture 33: 11/25/19 ## Distinct Elements * Input is a stream: $s_1, s_2 \ldots s_n$ * Output: Estimate # of distinct elements in stream * Algorithm: * 1) Pick a random hash function $h = \{ 1 \ldots N\} \rightarrow [0,1]$ * $k$ distinct elements in stream means $k$ distinct real numbers in $[0,1]$ * Uniformly random over the interval * Smallest element should be $\approx \frac{1}{k}$ * 2) Compute $\alpha = \min \{h(s_1), h(s_2), \ldots h_s(n)\} \approx \frac{1}{k}$ * Does not require much memory/space * 3) Output $\frac{1}{\alpha} - 1$ * $E[\min \{h(s_1), h(s_2), \ldots h_s(n)\}] = \frac{1}{k+1}$ * Where $k$ is num of distinct elements in stream, h is a random hash function to $[0,1]$ ### Probability bounding * For $k$ distinct elements in the stream, $P(\min \{h(s_1), h(s_2), \ldots h_s(k)\} \leq \frac{1}{4k}) \leq \frac{1}{4}$ * $= P(h(s_1) \leq \frac{1}{4k}) \land \ldots P(h(s_k) \leq \frac{1}{4k})$ * Union bound: $\leq \sum\limits_{i=1}^k P(h(s_i) \leq \frac{1}{4}k) = \sum\limits_{i=1}^k \frac{1}{4k}k = \frac{1}{4}$ * For $k$ distinct elements in the stream, $P(\min \{h(s_1), h(s_2), \ldots h_s(k)\} > \frac{4}{k}) \leq e^{-4}$ * Estimate $< \frac{k}{4}$ for a truly random function * $= P(h(s_1) \leq \frac{4}{k}) \land \ldots P(h(s_k) \leq \frac{4}{k})$ * Assume independence of random function: $\Pi_{i=1}^k P(h(s_i)) > \frac{4}{k} = (1 - \frac{4}{k})^k \approx e^{-4}$ * But this proof is bad since we assume truly random hash function ## Hash family * $\mathcal{H} = \{ h_1 \ldots h_m: \{ 1 \ldots N \} \rightarrow \mathbb{R}\}$ * If $\mathcal{H} = $ set of all possible functions * To remember $h \in \mathcal{H}$, need $\log |\mathcal{H}| $ bits * Want $\mathcal{H}$ to be small yet random ## Pairwise Independent Hash Family * If a hash family $\mathcal{H}$ is pairwise independent, $\forall x,y, x \neq y, \in Domain\{1 \ldots N\}$ * And $\forall \alpha, \beta \in Range(\{1 \ldots R\})$ * $P((h(x) = \alpha) \land (h(y) = \beta) = \frac{1}{R^2}$ * Example: $\mathcal{Z}_p = \{0, 1, \ldots p-1\}$ integers modulo p * $\mathcal{H} = \{h_{a,b \in \mathcal{Z}_p}(x) = ax + b \mod p \}$, $|\mathcal{H}| = p^2$ * Once you know 2 values of the hash function, you can calculate a,b and predict any future ones # Lecture 34: 12/2/19 ## Heavy Hitters * Output: List of all elements that appear $\geq n\beta $ times, $0 \leq \beta \leq 1$ * Data structure: "Count min sketch" * List all heavy hitters * Given any element, tell if frequency of element is $\geq \beta$ * Table $M[i,j] = 0$, length $l = 10 \log n$, width = $B$ buckets * Hash functions $h_1 \ldots h_l \colon \{1 \ldots N\} \rightarrow \{1 \ldots B\}$ * If you see element a: * Increment(element a): for each hash function $h_i, M[i, h_i(j)]++$ * Query(element a): approximate frequency * $M[1\ldots l, h(a)] \geq f_a$ are all over-estimates so return the minimum * $M[i,j]$ is num of elements in stream that map to bucket j under hash function $h_i$, so its an over-estimate due to hash collisions ### Upper bound * Fix i and element a. * $E_{h_i} M[i, h_i(a)]= f_a + \sum\limits_{h_i(b) = h_i(a)}f_b \leq f_a + \frac{n}{B}$ * $P(h_i(b) = h_i(a)) = \frac{1}{B}$ for a fixed $b \neq a$ * Markov's Inequality * $P(X >t) \leq \frac{E[X]}{t}$ * $P(M[i, h_i(a)-f_a] \geq \frac{2n}{B} ) \leq \frac{1}{2}$ * $P(\min_i M[i, h_i(a)] \geq f_a + \frac{2n}{B} )$ * Since hash functions are chosen independently, multiply all probabilities $P(M[i, h_i(a)] \geq f_a + \frac{2n}{B} )$ * Each estimate is at worst $\frac{1}{2}$ so total is $\frac{1}{2^l} = \frac{1}{2^{10 \log n}} \approx \frac{1}{n^{10}}$ ## Final Algorithm * for i = 1 to n: * increment($s_i$) * If (query$(s_i) \geq B_n$), add $s_i$ to list of $\beta$-heavy hitters # Lecture 35: 12/4/19 ## Proofs * True statements are provable, false statements cannot be proved