Group Assignment 2.4-2.5

# Group Assignment 2.4-2.5 ## 2.4 (A-Question) Consider a problem when the input is a stream of $m$ inequalities of the form “$ai ≠ aj$” for some $1 ≤ i < j ≤ n$. Design an analyze an algorithm that assigns value $0$ or $1$ to each of the variables $a_1, . . . , a_n$ so that the number of inequalities that are satisfied are at least $m/2$ in expectation. For example, if the input stream is ($a_1 ≠ a_2$, $a_1 ≠ a_3$, $a_2 ≠ a_3$), then assigning $a_1 = a_2 = 1$ and $a_3 = 0$ will satisfy two inequalities (all but the first inequality). Your algorithm should use $O(poly log(n) + poly log(m))$ space. You can assume that n and m are known in advance. After the algorithm finishes reading the stream, it should output the values of $a_1, . . . , a_n$ and the number of satisfied inequalities. Hint: Try to solve the problem when there is no space limit first. How can you adapt your solution to use less space? ### Solution **Algorithm:** Create a Cater-Wegman hash function by generating a random prime $p \in \{m,...,2m\}$, hash to a 1 bit number. Assign values by assigning each value the hash of their index: $a_i = h(i)$ for all $1 \le i \le m$. Assign a counter to keep count of satisfied equalities. For each inequality, check if it is satisfied. If it is: add 1 to counter. Return assignment and assigned values. **Input:** A stream of inequalities $s$, two integers $m$ and $n$. **Output:** An assignment of values to $a_1,...,a_n$ and the number of inequalities satisfied. ```python h(a,b,p,x): # Cater-Wegman hash return ((ax+b)mod p)mod 2 Inequalities(s,m,n): p <- random prime from {m,...,2m} a <- random integer from {1,...,p-1} b <- random integer from {0,...,p-1} counter <- 0 for each inequality (i,j) in s: if h(a,b,p,i) ≠ h(a,b,p,j): counter <- counter + 1 for each value i from 1 to m: output h(a,b,p,i) #print the assginment a_i = h(i) return counter ``` **Lemma 1:** The probability of an inequality being satisfied is at least 0.5. **Proof:** The inequality $a_i \ne a_j$ is equivalent to $h_{ab}(i) \ne h_{ab}(j)$ after our assignment. Notice $i \ne j$. The hash function used is universal which means $p(h_{ab}(i) = h_{ab}(j)) \le \frac{1}{n}$ where $n$ is the bound on the generated number. In this case we use $n=2$ since we want either 0 or 1. $$p(h_{ab}(i) = h_{ab}(j)) \le \frac{1}{2} \implies p(h_{ab}(i) \ne h_{ab}(j)) \ge \frac{1}{2} \implies p(a_i \ne a_j) \ge \frac{1}{2}$$ **Lemma 2:** $E(\text{# of satisfied inequalities}) \ge 0.5m$ **Proof:** By lemma 2, each inequality has at least probability 0.5 of being satisfied. Let $X_i$ be 1 if inequality $i$ is satisfied and 0 otherwise, for all $1 \le i \le m$. $$E(X_i) \ge 0.5$$ $$E(\text{# of satisfied inequalities}) = E(\sum_{i=1}^m X_i) = \sum_{i=1}^m E(X_i) \ge 0.5m$$ **Lemma 3:** The space required is $O(polylog(n)+polylog(m))$ **Proof:** The counter will maximum be $m$, if all inequalities are satisfied. This requires $\log m$ bits. $p$, $a$, and $b$, are all bounded by $2m$. They therefore require $\log 2m$ bits. In total we require $\log m + 3 \log 2m \in O(polylog(n)+polylog(m))$ bits. ## 2.5 (D-level or E-level) In this problem you will work out the details behind the last slide on the Misra-Gries algorithm (in the lecture on Nov 14, where we covered the Misra-Gries algorithm, we skipped this). In particular, the objective is to show that the Misra-Gries algorithm can be adjusted to also output estimates of the count of each item. Recall that the basic Misra-Gries algorithm outputs a list of $k$ values $x_1, . . . , x_k$ and is guaranteed to contain all values that appear more than $m/(k+1)$ times in the input. Show how to adjust the algorithm to also output $k$ frequency estimates $g_1, . . . , g_k$ such that it is guaranteed that $f_i − m/(k + 1) ≤ g_i ≤ f_i$ where $f_i$ is the exact number of times that $x_i$ appears in the input stream. ### Solution **Input:** A positive integer $k$ and a finite sequence $s$ with values $x_1,x_2,...,x_m$ for a positive integer $m$ **Output:** An associative array $G$ where $G[i]$ is the frequency estimate for the value $x_i$ ($1 \le i \le m$) **Pseudocode:** ``` ModifiedMisraGries(s, k) = let G be hashmap for each x in s: if x is in keys(G): G[x] <- G[x] + 1 else if |keys(G)| < k: G[x] <- 1 else: for each y in keys(G): G[y] <- G[y] - 1 if G[y] = 0: remove y from keys(G) return G ``` ^[Inspired by Wikipedia pseudocode. https://en.wikipedia.org/wiki/Misra%E2%80%93Gries_summary.] **Claim:** $f_i − \frac{m}{k + 1} \le G[i] \le f_i$ **Proof:** The statement is equivalent to proving that any outputted counter $G[a]$ can at most have been decremented or not incremented when it *should have* at most $\frac{m}{k + 1}$ times. To clarify: $G[a]$ can only be decremented when the current element of the iteration $x_i \neq a$ is being considered or *not incremented* when $x_i=a$ is considered (but $a$ is not in $G$ yet) at most $\frac{m}{k + 1}$ times, since then its recorded frequency $G[a]$ will be at most $f_i - \frac{m}{k + 1}$ where $f_i$ is its actual frequency. To not increment $G[a]$ when $x_i=a$ is considered or to decrement $G[a]$ when $x_i\neq a$ is considered is equivalent to executing the last else statement (independent of whether or not $x_i=a$), otherwise some counter will be incremented by one. The last else statement is only executed if $|keys(G)| \geq k$. This only holds when each element $G[i] \geq 1$, therefore we find that it can only be executed if $\sum G[i] \geq k$. Otherwise if $|keys(G)| < k$ we find that for each iteration of the loop it increments $\sum G[i]$ with one. We can look at this as if $\sum G[i]$ is always incremented by one after reading an element; let us call this the *default* case. If the last else statement is executed then we can see it as $\sum G[i]$ will be decremented by $k+1$, as it decrements each value in $G$ (which has a size of $k$) and does not add a value (therefore decrementing by $k+1$ compared to the *default* case). The final else statement can at most be executed every $k+1$ iterations, since we need to add at least $k$ elements before we can remove any of them ($\sum G[i]\geq k$ in order to reach the last else statement). Due to there being $m$ iterations we have the following: $$ c \cdot (k + 1) \leq m \Leftrightarrow c \leq \frac{m}{k + 1} $$ where $c$ is the number of total calls to the else function, and thus also the upper bound for the number of decrementations and missed incrementations of any one counter $G[i]$. We also know the upper bound for the counted frequency. If it is not decremented, or not *not added*, any times then the counted frequency will be equal to its real frequency $f_i$. The lower bound for $G[i]$ is thus $f_i - \frac{m}{k+1}$ since we decrement it or miss an incrementation a maximum of $\frac{m}{k + 1}$ times. Therefore, we have the following: $$ f_i - \frac{m}{k+1} \leq G[i] \leq f_i $$ QED.