# Sampling Algorithm ## Problem Suppose we want to know the ratio of supporters of some political party to the whole people, or how many supporters. It is often expensive to ask everyone and sum up the data. Instead, we randomly ask a few people, and take the ratio in the selected people as our answer. Intuitively, the more people we ask, the more accurate our result is close to the real answer. ## Problem Formulation Let $U$ be the universe and $G$ be the target group. ### Assume we are given 1. $|U|$ 2. can sample uniformly at random from $U$ 3. given any sample $x \in U$, can check if $x \in G$ ### Goal The goal is to estimate $|G|$. Specifically, given constant $\epsilon, \delta > 0$, want $$ \text{Pr}[(1-\epsilon)|G| \leq \text{output} \leq (1+\epsilon)|G|] \geq 1-\delta $$ We want the the esitimate to be close as enough to the real answer (within the tolerance factor $\epsilon$), with high probability $1-\delta$. ## Algorithm Choose $N$ samples uniformly at random from $U$ with replacement. ## Analysis **Def** For $i = 1,...,N$ $$ Y_i = \begin{cases} 1 & \text{if }X_i\text{ in }G \\ 0 & \text{else} \\ \end{cases} $$ $$ Y = \sum_{i=1}^{N}{Y_i} $$ Obviously, $$ E[Y] = \sum_{i=1}^{N}{E[Y_i]} = \sum_{i=1}^{N}{\text{Pr}[Y_i=1]} = \sum_{i=1}^{N}{\frac{|G|}{|U|}} = N\frac{|G|}{|U|} $$ So the algorithm should output $Y\frac{|U|}{N}$ as the estimation of $|G|$. **The challenge is how big $N$ should be to achieve our goal.** From the goal probability we have $$\begin{align} \text{Pr}[(1-\epsilon)|G| \leq \text{output} \leq (1+\epsilon)|G|] &= \text{Pr}[(1-\epsilon)E[Y] \leq Y \leq (1+\epsilon)E[Y]] \\ &\geq 1-\text{Pr}[Y < (1-\epsilon)E[Y]] - \text{Pr}[Y >(1+\epsilon)E[Y]] \\ &\geq 1-e^{\frac{-\epsilon^2 E[Y]}{2}}-e^{\frac{-\epsilon^2 E[Y]}{3}} \text{ (By Chernoff's Bound)} \\ &\geq1-2e^{\frac{-\epsilon^2 E[Y]}{3}} \\ &=1-2e^{\frac{-\epsilon^2}{3}(N\frac{|G|}{|U|})} \\ \end{align} $$ Also our goal is to make this probability greater than $1-\delta$, so we have $$ 1-2e^{\frac{-\epsilon^2}{3}(N\frac{|G|}{|U|})} \geq 1-\delta $$ Solving the inequality gives us $$ N \geq \frac{3}{\epsilon^2}\frac{|U|}{|G|}\ln\frac{2}{\delta} $$ ###### tags: `Algorithm` `抽樣演算法` `Sampling`