Group Assignment 3.1-3.2

# Group Assignment 3.1-3.2 # 3.1 Approximate sum: C-question Imagine you have an array $A[1...n]$. Each value in the array is an integer between $1$ and $M$. Consider the following algorithm for finding the approximate sum of the values in the array. ## Algorithm $r ← 1/ε^2$ $s ← 0$ Repeat $r$ times: $s ← s + A[i]$, where $i ∈[1 ...n]$ is picked randomly Return $n · s / r$ Your boss claims that this is a good algorithm, and gives the following argument. ## Proof Let $x_i$ be the value of the $i$th random sample, and $X = r^{-1} \sum{x_i}$ be their average. Let $A = n^{-1} \sum{A[i]}$, i.e., $A$ is the average value of the array (the quantity we want to find divided by n). We know that $E[X] = A$. Then, using the Chernoff bound, we know that: $$Pr[|X − A| \geq ε \cdot A] = Pr[|X −E[X]|≥ ε \cdot A] ≤ 2e^{−2r(εA)(εA)}$$ We know that $A ≥ 1$ so by our choice of $r = 1/ε^2$ conclude $$2e^{−2r(εA)(εA)} ≤ 2e^{−2rε \cdot ε} = 2e^{−2}$$ Thus the probability of error is ≤ 1/3. In other words, with probability at least $2/3$ we have $|X − A|< ε \cdot A$, which implies that $|n \cdot X − n \cdot A|< ε \cdot n \cdot A$, i.e. $$nA(1 − ε) ≤ n \cdot X ≤ nA(1 + ε)$$ Note that the estimate returned in the algorithm is $n \cdot X$ and that the true sum is $n \cdot A$. We thus conclude that the algorithm returns an estimate which is within a factor $1 \pm ε$ of the actual sum. What is wrong with this algorithm and proof? Can you give an example of an array A where this algorithm will almost certainly give the wrong answer? Note: to pass the problem it is not sufficient to only give an example A, you need to also point out what is wrong in the above proof. But finding a “bad” A which breaks the algorithm can be helpful for finding the error.  ## Counterexample: Assume that we construct an array as follows: $$ A[1..n] = [\underbrace{1, \dots ,1}_{n-1 \ times},M], \quad where \ r \ll n \ll M $$ Then the true mean will be $A=\frac{M+n-1}{n} \gtrapprox M/n$ but the sampled mean (from the algorithm) will depend on whether the element of value $M$ is part of the set of samples or not. If $M$ occurs $k=0$ times in the sample we get a sampled mean of $X|_{k=0} = r / r = 1$ If $M$ occurs $k \geq 1$ time(s) in the sample we have a sampled mean of $X|_{k\geq 1} = (r-k + k \cdot M)/ r \geq M / r$. Lets say we have $\varepsilon = 0.1$, $r=100$, $n=10^5$, $M=10^{10}$. According to the proof: $|n·X − n·A|< ε·n·A$. But this does not hold in our case as: For $k = 0$: $$ VL|_{k=0} = n\cdot A - n \cdot X|_{k=0} \geq n \cdot \frac{M}{n} - n \cdot 1 = M - n = 10^{10} - 10^5 $$ $$ HL = \varepsilon \cdot n \cdot A = \varepsilon \cdot n \cdot \frac{M + n - 1}{n} = 10^{-1} \cdot (10^{10} + 10^5 - 1) \leq 2 \cdot 10^9 $$ $$ \implies VL \geq HL $$ and for $k\geq 1$: $$ VL|_{k\geq 1} = n \cdot X|_{k\geq1} - n\cdot A \geq n \cdot \frac{M}{r} - n \cdot \frac{M + n - 1}{n} \geq \frac{10^5 \cdot 10^{10}}{10^2} - 10^{10} - 10^5 \geq 10^{12} $$ And once again: $VL \geq HL$ Therefore, since we either have to have the $M$-sample $0$ or $\geq 1$ times there are no samples which can give you an answer within the bound and the bound is therefore incorrect.  ## What is wrong with the proof then? The proof uses Chernoff bounds incorrectly. The variable $x_i$ must be limited to $[0,1]$, which it is not in this case. This incorrect use also causes the conclusion to be incorrect. ---  # 3.2 E-part Någon streamingalgorithm jag antar att de nämnde i den senaste föreläsningen. Consider the following game. Two people, Alice and Bob, each have $m$ (possibly duplicated) integers from the range ${1,...,n}$. Let $(a_1,...,a_m)$ denote the numbers that Alice has and $(b_1,...,b_m)$ denote the numbers that Bob has. They want to figure out whether there is a number $x$ that appears strictly more than m times among the $2m$ numbers that they have (i.e., among $a_1,...,a_m$, $b_1,...,b_m$). We allow Alice to send a message of $log^{O(1)}(n + m)$ bits to Bob (so, sending all $a_1,...,a_m$ is not possible). Then, Bob can reply back with a message of $log^{O(1)}(n + m)$ bits. Then Alice has to either output the number that appears more than m times or determine that such a number does not exist. Describe and analyze a deterministic algorithm for Alice and Bob to do so. Example: If Alice has $(1,1,3)$ and Bob has $(3,1,1)$, then Alice should answer $1$. If Alice has $(1,1,3)$ and Bob has $(3,3,1)$, then Alice has to answer “no majority”. Remarks: 1. You may assume that every player knows $m$ and $n$ in the beginning. 2. We do not care about the running time of players. Thus, they can take as much time to compute their messages as they want. We only care about the size of their messages. So, you do not have to analyze their running time. ## Idea Alice remembers two numbers: 1. $ID \in \{a_1...a_m\} 2. Counter $c \leq m$ Alice looks at one item at a time from $a_1...a_m$. For each item $a_i$: - If $a_i = ID$, then: $c \leftarrow c + 1$ - Else: $c \leftarrow c - 1$ - If $c = 0$, then: $ID \leftarrow a_i$, $count \leftarrow 1$ After Alice has looked at $a_1...a_m$, send $c$ and $ID$ to Bob. Bob does the same thing Alice did, but uses Alice's values for $c$ and $ID$ at the start. Memory needed for the message is $O(logn)$ for $ID$ and $O(logm)$ for $c$. Total memory needed for message is $O(logn) + O(logm) = O(logn + logm) = O(log(nm))$. $log^2(n + m) = log(n + m)log(n + m) \geq log(n) + log(m)$??? --- ## Algorithm **Input** Alice's array $A$ and Bob's array $B$ **Ouput** An integer $out$ which signals which number occurs more than $m$ times across both Alice's and Bob's numbers, or "no majority" if no such integer was found. ``` # help function occurrences(ID, A): return the number of occurrences of value ID in A # help function find_majority(A): counts <- new array of size n with all values initialized to 0 counts[ID] <- C for each number i in a_1 to a_m: counts[a_i] <- counts[a_i] + 1 if one number a_i occurs more than m/2 times then: maj <- a_i occ <- counts[a_i] else: maj <- 1 occ <- 0 # no occurrences signals no majority return maj, occ # main algorithm cross_majority(A, B): Alice sends A_maj, A_occ = find_majority(A) to Bob # sends two numbers, maj and occ to Bob if occurrences(A_maj, B) + A_occ > m: Bob sends A_maj, occurrences(A_maj, B) to Alice else: Bob sends B_maj, B_occ = find_majority(B) to Alice # the number A_rec_maj <- first number recieved from Bob # its occurrences in Bob's A_rec_occ <- second number received from Bob's array if A_rec_occ + occurrences(A_rec_maj, A): return A else: return "no majority" ``` **Lemma 1:** The algorithm returns a number that appears strictly more than $m$ times, or "no majority" if no such number exists. **Proof:** In order for a number to appear strictly more than $m$ times in the arrays it has to be the majority of at least one of the two arrays. This majority number is the number sent by Alice and Bob to eachother. In the case that a received majority number from one array added together with all that number's occurrances in the other array add up to a number more than m then the algorithm will return that number (as seen by the last if statement). **Lemma 2:** The algorithm only sends bits of size $log^{O(n)}(n + m)$ between Alice and Bob. **Proof:** This can be seen by the fact that Alice and Bob only send two numbers each. The first signalling the majority number of their array (which can at maximum be $n$) and the second signalling how many times it occurs in their array (which can be at maximum $m$ times). Therefore the two sent numbers of both Alice and Bob will be of size $O(log(n))$ and $O(log(m))$ which is bounded by $log^{O(n)}(n + m)$. ``` algorithm find_majority(A, ID, C): Input: A: Array of size m, with numbers ranging from 1..N ID, C: Initialization parameters, initializes the ocunter of ID to C. Output: ID, C: The ID with the highest count. Counter <- associative array Counter[ID] <- C for i=1..M: if A[i] in Counter: Counter[A[i]] <- Counter[A[i]] + 1 else: Counter[A[i]] <- 1 maxID, maxCounter = key and value with the highest value in Counter return maxID, maxCounter ``` ``` algorithm Cross_Majority(A,B): Input: Alices array. Bobs array. Output: Returns True if there is a majority, False if not. // Alices message: Alice_to_Bob = () // Bobs message: Bob to alice = ((),()) // After alice recieves bobs message: ``` **Alice 1** **input** **output** ID, c = find_majority(alicearray) return ID, c **Bob** **input** **output** ALICE INITAILIZATION = ALICE_1() ID1, c1 = find_majority(Bob's array,) ID2, c2 = find_majority(Bob' array, Alice's initialization) return (ID1, c1), (ID2, c2) **Alice 2** **input** **output** ID1, c = find_majority(Alice's array, Bob's initialization) ID2, c = BOB() if ID1 or ID2: return True else: return False Alice skickar ID|max(c) till bob Bob skickar ID|max(c) till alice OCH Bob skickar ID|misragires initialiserad med alice meddelande Alice kontrollerar att ID(bob misragries)) == ID|misragries initialiserad med bobs ID | max c ---- 1. Alice kör M-G 2. Alice skickar $ID_{maxA}, c_{maxA}$, $ID_{MG}$, $c_{MG}$ till Bob. 3. Bob kör M-G med $ID_MG$ och $c_{MG}$ 4. Bob skickar $ID_{maxBob},