# Group Assignment 3.1-3.2
# 3.1 Approximate sum: C-question
Imagine you have an array $A[1...n]$. Each value in the array is an integer between $1$ and $M$. Consider the following algorithm for finding the approximate sum of the values in the array.
## Algorithm
$r ← 1/ε^2$
$s ← 0$
Repeat $r$ times: $s ← s + A[i]$, where $i ∈[1 ...n]$ is picked randomly
Return $n · s / r$
Your boss claims that this is a good algorithm, and gives the following argument.
## Proof
Let $x_i$ be the value of the $i$th random sample, and $X = r^{-1} \sum{x_i}$ be their average. Let $A = n^{-1} \sum{A[i]}$, i.e., $A$ is the average value of the array (the quantity we want to find divided by n). We know that $E[X] = A$. Then, using the Chernoff bound, we know that:
$$Pr[|X − A| \geq ε \cdot A] = Pr[|X −E[X]|≥ ε \cdot A] ≤ 2e^{−2r(εA)(εA)}$$
We know that $A ≥ 1$ so by our choice of $r = 1/ε^2$ conclude
$$2e^{−2r(εA)(εA)} ≤ 2e^{−2rε \cdot ε} = 2e^{−2}$$
Thus the probability of error is ≤ 1/3. In other words, with probability at least $2/3$ we have $|X − A|< ε \cdot A$, which implies that $|n \cdot X − n \cdot A|< ε \cdot n \cdot A$, i.e.
$$nA(1 − ε) ≤ n \cdot X ≤ nA(1 + ε)$$
Note that the estimate returned in the algorithm is $n \cdot X$ and that the true sum is $n \cdot A$. We thus conclude that the algorithm returns an estimate which is within a factor $1 \pm ε$ of the actual sum.
What is wrong with this algorithm and proof? Can you give an example of an array A where this algorithm will almost certainly give the wrong answer?
Note: to pass the problem it is not sufficient to only give an example A, you need to also point out what is wrong in the above proof. But finding a “bad” A which breaks the algorithm can be helpful for finding the error.
<!-- Exempel då algoritmen inte fungerar:
The algorithm is clearly not $\varepsilon$-bounded as the bound depends on which values are selected in the $r$ samples.
For example lets say you have an array as follows:
$$
A[1..n] = [\underbrace{1, \dots ,1}_{n-1 \ times},M], \quad \begin{cases}M \to \infty \ (\text{or sufficiently large}), \\ r << n \end{cases}
$$
Then the true mean will be dominated by $M$ but the sampled mean will either be $r$ or $\approx \frac{n \cdot M}{r}$ depending on whether $M$ is part of the sample or not. Either way it will be far from the true mean of $\approx \frac{M}{n}$, especially as we can set $n$ freely.
Korrekta bevis kan bara leda till korrekta slutsatser, så vad är felet här?
Beviset är sannolikt fel eftersom det antagna väntevärdet är fel. Hur bevisar vi att det är fel: https://en.wikipedia.org/wiki/St._Petersburg_paradox#Finite_St._Petersburg_lotteries
Proof by datorsimulering? I verkligheten så stämmer inte väntevärdet överens med det förväntade värdet.
Proof sketch: Begränsa antalet försök till $10^3$, sätt $r$ till $10^4$ ($\varepsilon = 0.01$), sätt $n = 10^{50}$ och sätt $M=69 \cdot n$. Då enligt simuleringen så kommer vi få ett samplat medelvärde på $1$ med en sannolikhet på $\approx 1- 10^3\cdot 10^4 / 10^{50} = 1- 10^{-43}$. Vi kommer alltså inte stöta på den i simuleringen.
Kan säkert göra ett bättre bevis genom att bara bevisa att i vissa arrays, som i mitt exempel, så finns det ingen möjlighet att komma inom boundet. Därav är det bevisbart att
$$
Pr [|X −A| ≥ ε·A] = 1
$$
Men man kommer tillbaka till frågan: Vad gjorde de fel i beviset?
"Suppose X1, ..., Xn are independent random variables taking values in {0, 1}." ^[https://en.wikipedia.org/wiki/Chernoff_bound#math_1]. Vilket inte stämmer här. Kan detta vara felet? -->
## Counterexample:
Assume that we construct an array as follows:
$$
A[1..n] = [\underbrace{1, \dots ,1}_{n-1 \ times},M], \quad where \ r \ll n \ll M
$$
Then the true mean will be $A=\frac{M+n-1}{n} \gtrapprox M/n$ but the sampled mean (from the algorithm) will depend on whether the element of value $M$ is part of the set of samples or not.
If $M$ occurs $k=0$ times in the sample we get a sampled mean of $X|_{k=0} = r / r = 1$
If $M$ occurs $k \geq 1$ time(s) in the sample we have a sampled mean of $X|_{k\geq 1} = (r-k + k \cdot M)/ r \geq M / r$.
Lets say we have $\varepsilon = 0.1$, $r=100$, $n=10^5$, $M=10^{10}$.
According to the proof: $|n·X − n·A|< ε·n·A$.
But this does not hold in our case as:
For $k = 0$:
$$
VL|_{k=0} = n\cdot A - n \cdot X|_{k=0} \geq n \cdot \frac{M}{n} - n \cdot 1 = M - n = 10^{10} - 10^5
$$
$$
HL = \varepsilon \cdot n \cdot A = \varepsilon \cdot n \cdot \frac{M + n - 1}{n} = 10^{-1} \cdot (10^{10} + 10^5 - 1) \leq 2 \cdot 10^9
$$
$$
\implies VL \geq HL
$$
and for $k\geq 1$:
$$
VL|_{k\geq 1} = n \cdot X|_{k\geq1} - n\cdot A \geq n \cdot \frac{M}{r} - n \cdot \frac{M + n - 1}{n} \geq \frac{10^5 \cdot 10^{10}}{10^2} - 10^{10} - 10^5 \geq 10^{12}
$$
And once again: $VL \geq HL$
Therefore, since we either have to have the $M$-sample $0$ or $\geq 1$ times there are no samples which can give you an answer within the bound and the bound is therefore incorrect.
<!--
$$
\begin{cases}\left| n - M \right| &< \varepsilon \cdot M \\ | n\cdot M/r - M | &< \varepsilon \cdot M \end{cases}
$$
$$
= \begin{cases}\left| 10^5 - 10^{10} \right| = 10^{10} - 10^5 &< 0.1 \cdot 10^{10} &= 10^{10} - 9 \cdot 10^9 \\ | 10^5 \cdot 10^{10}/10^{2} - 10^{10} | = 10^{13} - 10^{10} &< 0.1 \cdot 10^{10} &= 10^{10} - 9 \cdot 10^9 \end{cases}
$$
Which is obviously incorrect as the left hand side is greater than the right hand side. -->
## What is wrong with the proof then?
The proof uses Chernoff bounds incorrectly. The variable $x_i$ must be limited to $[0,1]$, which it is not in this case. This incorrect use also causes the conclusion to be incorrect.
---
<!--
For one, the expected value is not always a good estimate of what you are going to get from the average sample, which can be seen in the St. Petersburg Paradox^[https://en.wikipedia.org/wiki/St._Petersburg_paradox].
Secondly, but most importantly, we can not use chernoff bounds in this situation as $x_i$ is not neccessarily in $[0,1]$, which is an assumption we have to make with chernoff bounds. ^[Is this correct? Im just basing it of off: https://en.wikipedia.org/wiki/Chernoff_bound#Multiplicative_form_(relative_error)]
^[Here we are using the mean, as compared to the sum, in the wikipedia article. Does this make a difference?] -->
# 3.2 E-part Någon streamingalgorithm jag antar att de nämnde i den senaste föreläsningen.
Consider the following game. Two people, Alice and Bob, each have $m$ (possibly duplicated) integers from the range ${1,...,n}$. Let $(a_1,...,a_m)$ denote the numbers that Alice has and $(b_1,...,b_m)$ denote the numbers that Bob has. They want to figure out whether there is a number $x$ that appears strictly more than m times among the $2m$ numbers that they have (i.e., among $a_1,...,a_m$, $b_1,...,b_m$).
We allow Alice to send a message of $log^{O(1)}(n + m)$ bits to Bob (so, sending all $a_1,...,a_m$ is not possible). Then, Bob can reply back with a message of $log^{O(1)}(n + m)$ bits. Then Alice has to either output the number that appears more than m times or determine that such a number does not exist. Describe and analyze a deterministic algorithm for Alice and Bob to do so.
Example: If Alice has $(1,1,3)$ and Bob has $(3,1,1)$, then Alice should answer $1$. If Alice has $(1,1,3)$ and Bob has $(3,3,1)$, then Alice has to answer “no majority”.
Remarks:
1. You may assume that every player knows $m$ and $n$ in the beginning.
2. We do not care about the running time of players. Thus, they can take as much time to compute their messages as they want. We only care about the size of their messages. So, you do not have to analyze their running time.
## Idea
Alice remembers two numbers:
1. $ID \in \{a_1...a_m\}
2. Counter $c \leq m$
Alice looks at one item at a time from $a_1...a_m$.
For each item $a_i$:
- If $a_i = ID$, then: $c \leftarrow c + 1$
- Else: $c \leftarrow c - 1$
- If $c = 0$, then: $ID \leftarrow a_i$, $count \leftarrow 1$
After Alice has looked at $a_1...a_m$, send $c$ and $ID$ to Bob.
Bob does the same thing Alice did, but uses Alice's values for $c$ and $ID$ at the start. Memory needed for the message is $O(logn)$ for $ID$ and $O(logm)$ for $c$. Total memory needed for message is $O(logn) + O(logm) = O(logn + logm) = O(log(nm))$.
$log^2(n + m) = log(n + m)log(n + m) \geq log(n) + log(m)$???
---
## Algorithm
**Input** Alice's array $A$ and Bob's array $B$
**Ouput** An integer $out$ which signals which number occurs more than $m$ times across both Alice's and Bob's numbers, or "no majority" if no such integer was found.
```
# help function
occurrences(ID, A):
return the number of occurrences of value ID in A
# help function
find_majority(A):
counts <- new array of size n with all values initialized to 0
counts[ID] <- C
for each number i in a_1 to a_m:
counts[a_i] <- counts[a_i] + 1
if one number a_i occurs more than m/2 times then:
maj <- a_i
occ <- counts[a_i]
else:
maj <- 1
occ <- 0 # no occurrences signals no majority
return maj, occ
# main algorithm
cross_majority(A, B):
Alice sends A_maj, A_occ = find_majority(A) to Bob # sends two numbers, maj and occ to Bob
if occurrences(A_maj, B) + A_occ > m:
Bob sends A_maj, occurrences(A_maj, B) to Alice
else:
Bob sends B_maj, B_occ = find_majority(B) to Alice
# the number
A_rec_maj <- first number recieved from Bob
# its occurrences in Bob's
A_rec_occ <- second number received from Bob's array
if A_rec_occ + occurrences(A_rec_maj, A):
return A
else:
return "no majority"
```
**Lemma 1:** The algorithm returns a number that appears strictly more than $m$ times, or "no majority" if no such number exists.
**Proof:** In order for a number to appear strictly more than $m$ times in the arrays it has to be the majority of at least one of the two arrays. This majority number is the number sent by Alice and Bob to eachother. In the case that a received majority number from one array added together with all that number's occurrances in the other array add up to a number more than m then the algorithm will return that number (as seen by the last if statement).
**Lemma 2:** The algorithm only sends bits of size $log^{O(n)}(n + m)$ between Alice and Bob.
**Proof:** This can be seen by the fact that Alice and Bob only send two numbers each. The first signalling the majority number of their array (which can at maximum be $n$) and the second signalling how many times it occurs in their array (which can be at maximum $m$ times). Therefore the two sent numbers of both Alice and Bob will be of size $O(log(n))$ and $O(log(m))$ which is bounded by $log^{O(n)}(n + m)$.
```
algorithm find_majority(A, ID, C):
Input: A: Array of size m, with numbers ranging from 1..N
ID, C: Initialization parameters, initializes the ocunter of ID to C.
Output: ID, C: The ID with the highest count.
Counter <- associative array
Counter[ID] <- C
for i=1..M:
if A[i] in Counter:
Counter[A[i]] <- Counter[A[i]] + 1
else:
Counter[A[i]] <- 1
maxID, maxCounter = key and value with the highest value in Counter
return maxID, maxCounter
```
```
algorithm Cross_Majority(A,B):
Input: Alices array. Bobs array.
Output: Returns True if there is a majority, False if not.
// Alices message:
Alice_to_Bob = ()
// Bobs message:
Bob to alice = ((),())
// After alice recieves bobs message:
```
**Alice 1**
**input**
**output**
ID, c = find_majority(alicearray)
return ID, c
**Bob**
**input**
**output**
ALICE INITAILIZATION = ALICE_1()
ID1, c1 = find_majority(Bob's array,)
ID2, c2 = find_majority(Bob' array, Alice's initialization)
return (ID1, c1), (ID2, c2)
**Alice 2**
**input**
**output**
ID1, c = find_majority(Alice's array, Bob's initialization)
ID2, c = BOB()
if ID1 or ID2:
return True
else:
return False
Alice skickar ID|max(c) till bob
Bob skickar ID|max(c) till alice OCH
Bob skickar ID|misragires initialiserad med alice meddelande
Alice kontrollerar att ID(bob misragries)) == ID|misragries initialiserad med bobs ID | max c
----
1. Alice kör M-G
2. Alice skickar $ID_{maxA}, c_{maxA}$, $ID_{MG}$, $c_{MG}$ till Bob.
3. Bob kör M-G med $ID_MG$ och $c_{MG}$
4. Bob skickar $ID_{maxBob},