The note of the lecture, Approximation and Online Algorithms. Expect typo, grammar errors, and rarely false information.
u-tokyo
, course-notes
, AaOA
This course aims to relate the theories to non-theories:
Theories | Non-Theories |
---|---|
Complexity Data Structure Discrete Math |
Databases Machine Learing Operating Systems |
An algorithm that outputs are required without knowing all inputs. For example, the Secretary Problem.
An optimization model allows us to solve problems using theories.
By ansering the following questions using math formulations, we obtain the necessary information to construct a optimization model —
Let there be only 2 food to survive from; we need to decide how would be get these food. The four parts can be stated as —
To solve the above model, we can write them into matrix forms —
\[
\text{Minimize} \begin{bmatrix}
c_{_G} & c_{_M}
\end{bmatrix} \begin{bmatrix}
x_{_G} \\ x_{_M}
\end{bmatrix}, \text{Subject to} \begin{bmatrix}
a_{_{P, G}} & a_{_{P, M}} \\
a_{_{C, G}} & a_{_{C, M}}
\end{bmatrix}\begin{bmatrix}
x_{_G} \\ x_{_M}
\end{bmatrix} \geq \begin{bmatrix}
r_{_P} \\ r_{_C}
\end{bmatrix}
\]
We can see this becoming a linear programming problem, if we multiply both side of the constraint by \(-1\;\) —
Given matrix \(A\) and vectors \(b\) and \(c\;\),
find vector \(x\) minimizing \(cx\) Subject to \(Ax \leq b\).
By construing the optimization module for the problem, we can find that algorithms for LP can be used to solve this problem —
The definition for a problem to be "solvable" is —
Given a input file with \(n\)-bits,
the running time is within any polynomial of \(n\).
It could be terrible on efficiency, but still is solvalble if there exists a program to.
To formally prove that a problem cannot be solved, we try to prove that —
The new problem is not simpler than an old problem widely aggreed to be unsolvable.
The approach to prove the proposition is as following:
Output NewProblem(Input1, Input2, ... Inputk);
NewProblem()
.Output OldProblem(Input1, Input2, ... Inputk){
// Code using NewProblem()
// Still in polynomial time
}
If we are able to construct the program, it indicates that the old problem is also solvable; however we know the old problem is unsolvable. Thus, it proves our assumtion wrong — indicating the new problem is unsolvable.
The satisfiability problem, which is widly known as an NP-Hard problem, is defined as the following —
Given a Boolean formula in Conjunctive Normal Form(CNF),
does there exist a set of input(s) that would outputtrue
?
Then, we define out problem —
Given a boolean formula in CNF,
find a set of input(s) that would outputtrue
.
true
, output the set. If not, output any set of input(s).Assuming there exists a solving program for our problem Boolean b[] outProblem(Formula f);
. We are able to use it to solve satisfiability()
—
bool satisfiability(Formula f){
Boolean set[] = outProblem(f);
if(output_of(f, set) == true){
return true;
}
return false;
}
How ever, we shouldn't able to do it since the satisfiability problem is NP-Hard. Thus, we can conclude that Boolean b[] outProblem(Formula f);
doesn't exist and out problem is not simpler than the satisfiability problem.
Given a undireciited graph \(G = \{V, E\}\) with no multiedges and a integer \(k\), we want the find \(S\), a subgraph of \(G\) with \(k\) vertices that contains the maximum amount of edges.
We can establish the optimization model of the densest subgraph problem as following —
To prove that the problem is NP-Hard, we utilize another problem k-clique which is known to be NP-Hard —
Set densestSubgraph(Set V, Set E, int k);
boolean kClique(Set V, Set E, int k){
Set S = densestSubgraph(V, E, k);
for([u, v] : S){
if(contains(E, [u, v]) == false){
return false;
}
}
return true;
}
Like above, assuming we can solve the densest subset problem, we will be able develop a program the k-clique problem; however k-clique should be un-solvable, thus the densest subsest problem is also NP-Hard.
There are a series of product to select from to produce, each has its properties; some products produced from competiters; and a set of customers having their own preferences, who would purchase the cheapest product satifying them. We have to select a limited numbers of products to produce due to a given budget, and want to maximize the number of customers.
We can establish the optimization model for the problem —
To prove the product selection problem is NP-Complete, we utilize the K-most representative skyline operator problem:
For the Product Selection problem, of the situation where \(m = 0\), non \(f^{'}_i\) can be found so if \(f_i\) exists, we choose it. Thus, we can rewrite the objective function as the following, indicating the equivalant —
\[
\Big\vert \{ i \in \{1,2,\ldots d\} : \exists j \in S\; s.t.C_i \leq P_j\} \Big\vert
\]
Which supports the construction of the following program —
Set productSelection(
int n, Vector[] P, double[] p,
int m, Vector[] Q, double[] q,
int d, Vector[] C, int k
);
Set kMostRepresentativeSkyline(int n, Vector[] P, int d, Vector[] C, int k){
Double p[] = any(Double[]);
return productSelection(n, P, p, 0, [], [], d, C, k);
}
Thus, we can conclude that the Product Selection problem is an NP-Hard problem.
An NP-Complete problem is as difficult as satisfiability. To prove a problem is NP-Complete, we must prove the both way —
It doesn't have to be satisfiability, but any problem of the NP-Completeness class. A library of NP-Complete problems can be found Here, or from
Computers and Intractability: A Guide to the Theory of NP-Completeness
Given a some objects and their weights, pick only some to pack since we only have a limited capacity. Try to maximize the packed weight. The optimization model of the problem can be established as —
Assume \(w\)'s are in increasing order
The problem is knon to be NP-Hard, even with this simplified form.
A greedy algo can be used to approach the problem —
However, this algorithm does not obtain the optimal solution.
\[ \begin{align} \texttt{1:} &\;\;\;\; S \leftarrow \varnothing \\[1pt] \texttt{2:} &\;\;\;\; \texttt{For}\;\; j = 1 \;\;\texttt{to} \;\;n \\[1pt] \texttt{3:} &\;\;\;\;\;\;\;\;\;\;\;\; \texttt{If}\;\; \textstyle \sum_{i \in S} \;\; w_i + w_j \leq W \\ \texttt{4:} &\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; S \leftarrow S \cup {j} \\[1pt] \texttt{5:} &\;\;\;\;\;\;\;\;\;\;\;\; \texttt{Else:} \\[2pt] \texttt{6:} &\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \texttt{If}\;\; w_j \geq \textstyle\sum_{i \in S}\;\;w_i\\ \texttt{7:} &\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; S \leftarrow \{j\}\\ \texttt{8:} &\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \texttt{break}\\ \end{align} \]
We define the following 2 terms:
If, for all imputs of a problem, some algorithm have —
\[
\mathrm{SOL} \gtreqqless \alpha \cdot \mathrm{OPT}
\]
We call it an \(\alpha\)-approximation Algorithm.
We can see the greedy approach of the Simplified Knapsack Problem is at least a 0.5-approximation algorithm.
On the last step, we devide some object having the sum weight \(S > W\) into 2 partitions, and choose the heavier one. This means the solution is at least better then \(W \over 2\).
Since OPT cannot be heavier then \(W\), we know the solution is definitely better then \(\mathrm{OPT} \over 2\).
In the original form of the Knapsack problem, each object are associated with a value. Instead of maximizing the sum of weight, we try to maximize the sum of value.
The optimization model of the problem can be established as —
A greedy algo can be used to approach the problem, where the difference between the previous algorithm is that, when we run out of capacity, we compare the total value instead of total weight.
This algorithm is also proven to be a 0.5-approximation algorithm.
\[ \begin{align} \texttt{1:} &\;\;\;\; S \leftarrow \varnothing \\[1pt] \texttt{2:} &\;\;\;\; \texttt{For}\;\; j = 1 \;\;\texttt{to} \;\;n \\[1pt] \texttt{3:} &\;\;\;\;\;\;\;\;\;\;\;\; \texttt{If}\;\; \textstyle \sum_{i \in S} \;\; w_i + w_j \leq W \\ \texttt{4:} &\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; S \leftarrow S \cup {j} \\[1pt] \texttt{5:} &\;\;\;\;\;\;\;\;\;\;\;\; \texttt{Else:} \\[2pt] \texttt{6:} &\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \texttt{If}\;\; w_j \geq \textstyle\sum_{i \in S}\;\;w_i\\ \texttt{7:} &\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; S \leftarrow \{j\}\\ \texttt{8:} &\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \texttt{break}\\ \end{align} \]
Consider a 2-layered database system: when you access the system to obtain some data,
We call the first layer the Filter.
Given the total number of data items is \(m\), which is extremely huge.
It is not possible to have an array boolean arr[m]
. In this case, we can implement a hash function on an array of size \(k << w\). We set arr[f(i)] = true
only when when some data item i
is added into the database.
When checking some data's presence, we check if arr[f(i)]
is true
.
However, some false positive might occur when for two data point, i
, j
both has the same hash value (f(i) == f(j)
), causing the system to enter the second layer when its unnecessary.
Suppose \(n\) data items are already present, the probability of some data point i
occuring a false positive is:
\[ 1 - \left( 1 - \frac{1}{m} \right)^n \]
Which as \(n\) grows, gets unbearably near to 1.
Instead of using a single hash function, we use multiple: f_1(), f_2(), ..., f_p()
. When some data item i
is added into the database, all arr[f_1(i)], arr[f_2(i)], ..., arr[f_p(i)]
are set to true
.
When checking some data's presence, we check if all arr[f_1(i)], arr[f_2(i)], ..., arr[f_p(i)]
are true
.
Suppose \(n\) data items are already present, and \(p\) differrent hash function is used, for some data point i
, and some hash function f_j()
, the probability of arr[f_j(i)]
being faultly set to true
is:
\[ 1 - \left( 1 - \frac{1}{m} \right)^{np} \]
In total, the action "set some bit to true" occured \(n \cdot p\) times.
Then, the probability of false positive occurs is:
\[
\left( 1 - \left( 1 - \frac{1}{m} \right)^{np} \right)^p
\]
All \(p\) hash function must fail.
Then, we can know that when we minimize the probability of false positive,
\[
p = \displaystyle\frac{m}{n} \ln{2}
\]
Assume we know:
i
present in the databasei
is addedLet \(S\) be the set of all data items present, then the probability for some arr[j]
being not set is:
\[
\left( 1 - \frac{1}{m} \right)^{\sum_{i\in S}p_i}
\]
According to Optimizing data popularity conscious bloom filters, PODC '08, the probability of false positive is minimized when for all j
, the probability of arr[j] == true
is \(0.5\).
Then, we can get:
\[ \color{white}{\Rightarrow} \left( 1 - \frac{1}{m} \right)^{\sum_{i\in S}p_i} = 0.5 \Rightarrow \left( \displaystyle\sum_{i \in S}p_i \right) \cdot \ln{\left( 1 - \frac{1}{m} \right)} = \ln{0.5} \]
Since as \(x \rightarrow 0\), we get \(1 - x \approx e^{-x}\), implying
\[
\begin{align}
&\color{white}{\Rightarrow} \left( \displaystyle\sum_{i \in S}p_i \right) \cdot \ln{\left(e^{-\frac{1}{m}}\right)} = \ln{0.5} \\
&\Rightarrow \left( \displaystyle\sum_{i \in S}p_i \right) \cdot \left( -{1 \over m} \right) = \ln{0.5} \Rightarrow {1 \over m} \displaystyle\sum_{i \in S} p_i = -\ln{0.5} = \ln{2} \\
&\Rightarrow \displaystyle\sum_{i\in S} p_i = m\ln 2
\end{align}
\]
Since \(S\) is unknown, we are not able to know the precise sum of all \(p\); however, we do know the probability of \(i \in S\) thus we aim for the expected value:
\[
\displaystyle\sum_{i \in S} P_i \cdot p_i = m \ln 2
\]
For some i
not in \(S\), the probability of all \(p_i\) bits true is \(0.5^p_i\). If we check all data item for once, the expected number of false positives is:
\[
\displaystyle\sum_{i} 0.5^{p_i}
\]
Accordingly, we can have the optimization model:
i
, probability of \(i \in S\)i
, number of bits to set true when i
enters databaseTo deal with exponentials in the objective function, we can break it down into multiple terms. Let there be new variables \(p_i^{(1)}, p_i^{(2)}, p_i^{(3)} \ldots\), where:
\[
p_i^{(j)} = \begin{cases}
1, &j > p_i \\
0, &\text{otherwise}
\end{cases}
\]
Effectly, it counts op until \(p_i\). For example, if \(p_5 = 3\), then,
\[
p_5^{(1)} = p_5^{(2)} = p_5^{(3)} = 1 \\ p_5^{(4)} = p_5^{(5)} = \ldots = 0
\]
Then we could break the objective function into:
\[
\begin{align}
\displaystyle\sum_i (0.5)^{p_i} &= \displaystyle\sum_i\left( 1 - \displaystyle\sum_j 0.5^j\cdot p_i^{(j)} \right) \\
&= \displaystyle\sum_i 1 - \displaystyle\sum_i\displaystyle\sum_j0.5^j\cdot p_i^{(j)}
\end{align}
\]
Since the total number of data items \(N = \sum_i 1\) is a constant, minimizing the optimization function is equvalant to maximizing:
\[
\displaystyle\sum_i\displaystyle\sum_j0.5^j\cdot p_i^{(j)}
\]
Furthermore, let there be&emdash;
Since \(\sum_j p_i^{(j)} = p_i\), we have
\[
\begin{align}
\displaystyle\sum_i P_i \cdot p_i &= \displaystyle\sum_i P_i \cdot \left( \displaystyle\sum_j p_i^{(j)} \right) = \displaystyle\sum_i \left( \displaystyle\sum_j P_i \cdot p_i^{(j)} \right) \\ &= \displaystyle\sum_i \displaystyle\sum_j w_i^{(j)} \cdot p_i^{(j)} \\ &= \displaystyle\sum_{p_i^{(j)} \in Q} w_i^{(j)}
\end{align}
\]
And we can further modify the objective function into:
\[
\displaystyle\sum_i\displaystyle\sum_j0.5^j\cdot p_i^{(j)}
= \displaystyle\sum_{p_i^{(j)} \in Q} h_i^{(j)}
\]
Now, let \(W = m \cdot \ln{2}\). We now have the following optimization model:
Notice that the constraint has change from \(=\) to \(\leq\). This is because that it is unlikely we will get a sum less then \(W\), since we aim to select as much item as possible, it would become \(=W\) anyway.
Now, as the problem is evaluated into the KNAPSACK PROBLEM, we could apply the 0.5-approximation algorithm to get how to pick the numbers for \(p_i\).
Given a graph \(G = \left\{ V, E \right\}\) where \(E \subseteq \left\{ \{u, v\} : u, v \in V \right\}\), we want so select a set of vertices that connects to all edges.
The optimization model can be described as:
Let there be variables \(x_1, x_2 \ldots x_{\left|V\right|}\), where
\[
x_u = \begin{cases}
1: u \in S \\
0: u \notin S
\end{cases}
\]
Then, we can change the output to \(\mathbf{x} = \left[ x_1, x_2 \ldots x_{\left|V\right|}\right]^\mathrm{T}\) instead of \(S\). As for \(\left|S\right|\), we know \(\left|S\right| = \sum_ix_i\).
Since we know \(x_i\) is either \(1\) or \(0\), we can instead write it as, where \(\mathbf{c} = \left[ 1, 1 \ldots 1 \right]^\mathrm{T}\)
\[
\left| S \right| = \mathbf{c}^\mathrm{T}\mathbf{x}
\]
For some edge \(\left\{ u, v \right\}\) being selected, its equvalent to that \(x_u + x_v \geq 1\) must be satisfied. That is, we are able to rewrite the constrains into the form of \(A\mathbf{x} \geq \mathbf{b}\):
\[
\begin{bmatrix}
1 & 1 & \cdots & 0 \\ 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1
\end{bmatrix}
\begin{bmatrix}
x_1 \\ x_2 \\ \vdots \\ x_{\left| V \right|}
\end{bmatrix} \geq
\begin{bmatrix}
1 \\ 1 \\ \vdots \\ 1
\end{bmatrix}
\]
For each row, there is two entries set to \(1\), and the others being \(0\), representing all the edges \(\{u, v\} \in E\)
Thus, we can rewrite the optimization model as the following:
The previous optimization model is almost in a linear programming form, though since \(x \in \{0, 1\}\) we couldn't simply use a solving library like CPLEX to do it.
Consider another problem, FRACTIONAL VERTEX COVER problem, where the only difference is that we allow \(x_u\) to be any real number between \(0\) and \(1\):
We can use this problem to help develop a approximation algorithm for the VERTEX COVER problem.
Consider the following algorithm for VERTEX COVER:
\[ \begin{align} &\texttt{1.} \qquad \mathbf{x} = \texttt{solveFrationalVertexCover()} \\ &\texttt{2.} \qquad \texttt{forall}\quad x_i \quad\texttt{in}\quad \mathbf{x} \\ &\texttt{3.} \qquad\quad \texttt{if}\quad x_i \geq0.5 \\ &\texttt{4.} \qquad\quad\quad y_i \leftarrow 1 \\ &\texttt{5.} \qquad\quad \texttt{else} \\ &\texttt{6.} \qquad\quad\quad y_i \leftarrow 0 \\ &\texttt{7.} \qquad \texttt{return} \quad \mathbf{y} \end{align} \]
1. The output satisfies the constraint of VERTEX COVER. The constraint of FRATIONAL VERTEX COVER states that:
\(\forall \{u, v\} \in E,\)
\[
\begin{align}
&\color{white}{\Rightarrow\quad }x_u + x_v \geq 1 \\
&\rightarrow \quad x_u \geq 0.5 \lor x_v \geq 0.5 \\
&\rightarrow \quad y_u = 1 \lor y_v = 1 \\
&\rightarrow \quad y_u + y_v \geq 1 \\
\end{align}
\]
Which is the constrain of VERTEX COVER.
2. For any input, \(\mathrm{SOL} \leq 2 \cdot \mathrm{OPT}\).
Let
We are more oppurtunity to minimize the objective function. Thus we know, \(\mathbf{c}^\mathrm{T}\mathbf{x} \leq \mathbf{c}^\mathrm{T}\mathbf{x}^*\).
By rounding in line 3 - 6 in the algorithm, we have \(\forall i : y_i \leq 2\cdot x_i\). Thus,
\[ \begin{cases} \mathrm{SOL} = \mathbf{c}^\mathrm{T}\mathbf{y} = \displaystyle\sum_i y_i \\ \displaystyle\sum_i y_i \leq 2 \cdot \displaystyle\sum_i x_i \\ 2 \cdot \displaystyle\sum_i x_i = 2 \cdot \mathbf{c}^\mathrm{T}\mathbf{x} \\ 2 \cdot \mathbf{c}^\mathrm{T}\mathbf{x} \leq 2 \cdot \mathbf{c}^\mathrm{T}\mathbf{x}^* \\ 2 \cdot \mathbf{c}^\mathrm{T}\mathbf{x}^* = 2 \cdot \mathrm{OPT} \end{cases} \Rightarrow \mathrm{SOL} \leq 2 \cdot \mathrm{OPT} \]
Proving the algorithm being a 2-approximation algorithm for VERTEX COVER.
It is simular to VERTEX PROBLEM, where we change the set of edges into the a set of subsets. We still can transform it into the same matrix form.
This time, for each set in \(E = \{e_{1,1}, e_{1,2} \ldots e_{1, k}\}\), one row in \(A\) have those entries set to \(1\), others to \(0\).
\[ \begin{cases} \left\vert V \right\vert &= 5 \\ E &= \{\{1, 3, 4\}, \{1, 2\}, \{2, 5\}, \{3, 4, 5\} \end{cases} \\[15px] \Downarrow \\[15px] A = \begin{bmatrix} 1 & 0 & 1 & 1 & 0 \\ 1 & 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 1 \\ 0 & 0 & 1 & 1 & 1 \end{bmatrix} \]
Thus, the optimization model can rather be written as the RHS.
It is not enough to just make use of the VERTEX COVER's algorithm. Say we only need \(x_1 + x_2 + x_3 \geq 1\), then, the solition might be \(\left[ \frac{1}{3}, \frac{1}{3}, \frac{1}{3} \right]^\mathrm{T}\), which rounding gets us \([0, 0, 0]^\mathrm{T}\), not satisfying the constraints of SET COVER.
Let the largest set in \(E\) has the size of \(k\). Using the fact that:
\[
x_{u_1} + x_{u_2} + \cdots + x_{u_m} \geq 1 \geq \frac{m}{k} \implies \exists i : x_{u_i} \geq \frac{1}{k}
\]
We then modify the algorithm so that, when we round the solition of FRACTIONAL SET COVER, we set \(y_u = 1\) when \(x_u \geq \frac{1}{k}\). By rounding in the algorithm, we have \(\forall i : y_i \leq k \cdot x_i\). Thus,
\[ \begin{cases} \mathrm{SOL} = \mathbf{c}^\mathrm{T}\mathbf{y} = \displaystyle\sum_i y_i \\ \displaystyle\sum_i y_i \leq k \cdot \displaystyle\sum_i x_i \\ k \cdot \displaystyle\sum_i x_i = k \cdot \mathbf{c}^\mathrm{T}\mathbf{x} \\ k \cdot \mathbf{c}^\mathrm{T}\mathbf{x} \leq k \cdot \mathbf{c}^\mathrm{T}\mathbf{x}^* \\ k \cdot \mathbf{c}^\mathrm{T}\mathbf{x}^* = k \cdot \mathrm{OPT} \end{cases} \Rightarrow \mathrm{SOL} \leq k \cdot \mathrm{OPT} \]
Thus we know the algorithm is a \(k\)-approximation algorithm, but sometimes a terrible one. Assume a problem instance with \(k = n\), the algorithm that simply returns \(\mathbf{x} = [1, 1, \cdots 1]^\mathrm{T}\) is also a \(n\)-approximation algorithm, since \(\mathrm{OPT}\) must at least has some \(x_u = 1\).
If for some algorithm, its execution variation is controlled by a probability distribution, we call it a Randomized Algorithm.
By running a randomized algorithm many times, we can almost gurantee in some runs there is the optimal solution.
Try combining the rounding process with randomizing. We try different settings and evaluate its performance.
For the new randomized rounding, we generate \(\mathbf{y} = \left[ y_1, y_2, \cdots y_n \right]^\mathrm{T}\) by:
\[
\begin{cases}
P(y_i = 1) = x_i \\
P(y_i = 0) = 1 - x_i
\end{cases}
\]
For each set \(E_i \in E\), there will be constraint requiring \(s = \displaystyle\sum_{u \in E_i} x_u \geq 1\), implying the failing probility of a single constraint being:
\[
\prod_{u \in E_i} (1 - x_u) \leq \prod_{u in E_i} e^{-x_u} = e^{-\left(s\right)} \leq e^{-1} \leq 0.37
\]
So we know the probability of satisfying each constraint is 0.63. It is not good even when the number of constraint is only slightly big.
For the new randomized rounding, we generate \(\mathbf{y} = \left[ y_1, y_2, \cdots y_n \right]^\mathrm{T}\) by:
\[
\begin{cases}
P(y_i = 1) = 1 - (1 - x_i)^t \\
P(y_i = 0) = 1 - (1 - (1 - x_i)^t) = (1 - x_i)^t
\end{cases}
\]
For each set \(E_i \in E\), there will be constraint requiring \(s = \displaystyle\sum_{u \in E_i} x_u \geq 1\),
implying the failing probility of a single constraint being:
\[
\prod_{u \in E_i} (1 - x_u)^t \leq \prod_{u \in E_i} e^{-x_ut} = e^{-ts} \leq e^{-t}
\]
Thus we can know the probability of having some constraint fail is at most \(e^t \cdot \vert E \vert\).
Next, we aim to reduce the probability of some constraint fail to at most \(1 \over 4\).
\[ \begin{align} e^{-t} \cdot \vert E \vert \leq \frac{1}{4} &\Rightarrow e^{-t} \leq \frac{1}{4 \cdot \vert E \vert} \\ &\Rightarrow \ln e^{-t} \leq \ln{\frac{1}{4\cdot \vert E \vert}} \\ &\Rightarrow -t \leq -\ln{4 \vert E \vert} \\ &\Rightarrow t \geq \ln{\vert E \vert} + \ln{4} \end{align} \]
Finally, we end with the following algorithm:
\[
\begin{align}
&\texttt{1.} \qquad \mathbf{x} = \texttt{solveFrationalVertexCover()} \\
&\texttt{2.} \qquad \texttt{while(} \;\; y \;\; \texttt{doesn't satisfy constraint)}\\
&\texttt{3.} \qquad \quad\texttt{forall}\quad x_i \quad\texttt{in}\quad \mathbf{x} \\
&\texttt{4.} \qquad\quad\quad y_i \leftarrow 1 \quad\texttt{with probability}\quad 1 - (1 - x_i)^{\ln{\vert E \vert} + \ln{4}}\\
&\texttt{5.} \qquad \texttt{return} \quad \mathbf{y}
\end{align}
\]
The Markov Inequality
For some random variable \(X\) with expected value \(\mu(X)\),
\[
P(X > a) \leq \frac{\mu(X)}{a}
\]
The Burnolli's Inequality
For some \(x, r \in \mathbb{R}, x \leq -1, r \leq 0, x \neq 0, r \neq 1\),
\[
(1 + x)^r \geq 1 + rx
\]
The expected value of \(\mathrm{SOL}\) is:
\[
\begin{align}
\mu(\mathrm{SOL}) = \mu(\sum_{i \in V}y_i) &= \sum_{i \in V}\left\{1 \cdot \left[ ( 1 - (1 - x_i)^{\ln{\vert E \vert} + \ln{4}} ) \right] + 0 \cdot (1 - x_i)^{\ln{\vert E \vert} + \ln{4}}\right\} \\
&= \sum_{i \in V}\left[ ( 1 - (1 - x_i)^{\ln{\vert E \vert} + \ln{4}} ) \right] \\
\color{#C0C0C0}{1 - (1 - x_i)^{\ln{\vert E \vert} + \ln{4}} \leq(\ln{\vert E \vert} + \ln{4})x_i\qquad} &\leq \sum_{i \in V}\left[ \left( \ln{\vert E \vert} + \ln{4} \right)x_i \right] \\
&= \left( \ln{\vert E \vert} + \ln{4} \right) \sum_{i \in V} x_i \\
&\leq \left( \ln{\vert E \vert} + \ln{4} \right)\cdot \mathrm{OPT}
\end{align}
\]
By the Markov inequality, the probability of the solution being far from optimal is low.
\[
P(\mathrm{SOL} \geq 4 \cdot \mu(\mathrm{SOL}) \color{#C8C8C8}{= 4 \cdot \left( \ln{\vert E \vert} + \ln{4} \right) \cdot () \mathrm{OPT}}) \leq {1 \over 4}
\]
PCA is a technique to reduce dimension among records. By lowering the dimension, it simplfies calculation extracting information using machine learning algorithms.
The above optimization model is solvable by semi-definite programming.
However, when we reduce dimensions, if we combine too many into a simgle dimension, it might be to difficult to understande the characteristic of the result dimension. Thus, we tend to limit the number of choosen dimensions (number of non-zero terms in \(\mathbf{v}\)).
Thus leads to the following optimization model, being NP-Hard:
SPARSE PCA can be solved using a randomized approximation algorithm, first by solving the relaxed version which is solvable by semi-definite programming:
From the result, we define the following probability to get \(\mathbf{v}\) from \(\mathbf{u}\), using some parameter \(s \geq 1\):
\[
\begin{cases}
\displaystyle P(v_i = \frac{u_i}{p_i}) &= p_i = \min\left\{\frac{s \cdot u_i}{\sum_{j = 1}^n u_j}, 1\right\} \\
\displaystyle P(v_i = 0) &= 1 - p_i
\end{cases}
\]
Then,
the expected value of number of non-zero entries is:
\[
\mu(\left\vert\left\{v_i : v_i \neq 0\right\}\right\vert) = \sum_{i = 1}^n p_i \leq \frac{s \cdot \displaystyle\sum_{j = 1}^n |u_j|}{\displaystyle\sum_{j = 1}^n |u_j|} = s
\]
The expected value of square sum of the vector entries is:
When \(p_i = \frac{s \cdot u_i}{\sum_{j = 1}^n u_j}\),
\[
\begin{align}
\frac{u_i^2}{p_i} = \frac{u_i^2}{s \cdot \vert u_i \vert} \cdot \sum_{j = 1}^n \vert u_j \vert = \frac{\vert u_i \vert}{s} \cdot \sum_{j = 1}^n \vert u_j \vert \leq \frac{\vert u_i \vert}{s}\left[ \vert u_i \vert + \sum_{j = 1}^n \vert u_j \vert \right] &= \frac{u_i^2}{s} + \frac{\vert u_i \vert}{s}\sum_{j = 1}^n \vert u_j \vert \\
& \leq u_i^2 + \frac{\vert u_i \vert}{s}\sqrt{k^{\color{white}{|}}}
\end{align}
\]
When \(p_i = 1\),
\[
\frac{u_i^2}{p_i} = u_i^2 = 1
\]
Summing both up,
\[
\begin{align}
\mu(\displaystyle\sum_{i = 0}^{n} v_i^2) &= \displaystyle\sum_{i = 0}^{n} \mu(v_i^2) = \displaystyle\sum_{i = 0}^{n} \left( \frac{u_i}{p_i} \right)^2 \cdot p_i = \displaystyle\sum_{i = 0}^{n} \frac{u_i^2}{p_i} \\
&= \sum_{p_i = \frac{s \cdot u_i}{\sum_{j = 1}^n u_j}}\frac{u_i^2}{p_i} + \sum_{p_i = 1} \frac{u_i^2}{p_i} \\
&= \sum_{p_i = \frac{s \cdot u_i}{\sum_{j = 1}^n u_j}} \left[ u_i^2 + \frac{\vert u_i \vert}{s}\sqrt{k^{\color{white}{|}}} \right] + \sum_{p_i = 1} u_i^2 \\
&\leq \displaystyle\sum_{i = 0}^{n} u_i^2 + \displaystyle\sum_{i = 0}^{n} \frac{\vert u_i \vert}{s}\sqrt{k^{\color{white}{|}}} \\
&= 1 + \frac{\sqrt{k^{\color{white}{|}}}}{s}\displaystyle\sum_{i = 0}^{n} \vert u_i \vert \leq 1 + \frac{k}{s}
\end{align}
\]
Now, we have two contradicting requirements:
According to researchs, \(s = 200k\) is a good solution.
From researches, the above randomized algoritm for SPARSE PCA can satisfy
\[
\mathrm{SOL}\geq \mathrm{OPT} - 1
\]
The form \(\mathrm{SOL}\geq \mathrm{OPT} - \beta\) is called the Additive Approximation Ratio. It is rarely used since it cannot provide a good guarantee for any input.
To prove some \(\alpha\)-approximate algorithm for some optimization model is impossible, we can try to modify the optimization model on prove it unsolvawbel, in such way that, if the new optimization model is solvable, it would become the \(\alpha\)-appriximation algorithm for the problem.
The optimation model of VERTEX COVEDR (LHS) can be rewrite as the RHS, where only the optimal solution is feasible:
We then can easily get change the constraints into where the solution must be within \(1.9999\) times of the optimal solution. The algorithm is unproven if its unfeasable.
K-CENTER and K-MEDIAN are two common optimization for clustering data. The optimization model of K-CENTER is as following:
We can rewrite the optimization model so that solutions must fall under \(1.9999 \cdot \mathrm{OPT}\):
Using the DOMINATING SET problem which is known to be NP-Hard, we can achieve the prove. The optimization model of DOMINATING SET is as:
If 1.999-APPROXIMATION K-CENTER is solvable, we can cunstrut the following program to solve DOMINATING SET:
Set approxKCenter1.9999(int n, double d[][], int k);
bollean doiminatingSet(Set V, Set E, k){
double d[V.size()][V.size];
// Setup distance function
for(int i = 0; i < V.size(); ++i){
for(int j = 0; j < V.size; ++i}{
if(i == j){
d[i][j] = 0;
}
else if(E.contain({i, j})){
d[i][j] = 1;
}
else{
d[i][j] = 2;
}
}
}
Set S = approxKCenter1.9999(V.size(), d, k);
if(S.maxCenterDistance() <= 1 ){
return true;
}
else{
return false;
}
}
But DOMINATING SET should not be solvable, thus proving 1.999-APPROXIMATION K-CENTER is also unsolvable.
The problem aims to find a subset os size from vertices, where each selected vertices has probability providing some value to elements within another set, so that the sum of values can be maximized. The optimization model is as following:
Not only finding the optimal solution is NP-Hard, just finding one feasable solution is NP-Hard.
Refert the lecture document to find why any \(\alpha\)-approximation algorithm is impossible.
We try to decide the best strategy on purchasing a ski pass. One day pass costs \(400\), and a yearly pass costs \(9000\). We cannot predict will we go ski in the future. The optimization model can be described as:
The highlight part implies the optimization model requires an online algorithm to solve.
The Competitive Ratio describes how effective our online algorithm is. In online algorithms, we define \(\mathrm{SOL}_i\) and \(\mathrm{OPT}_i\) as the objective value and optimal value on step \(i\).
\[
\textrm{competitive ratio} = \max_{i = 1}^{365}{\frac{\mathrm{SOL}_i}{\mathrm{OPT}_i}} \\
\forall i \in \mathbb{Z}^+, \quad\textrm{competitive ratio} \geq \frac{\mathrm{SOL}_i}{\mathrm{OPT}_i}
\]
We want to decide a strategy booting/shuting server in a datacenter according to current and past demands. Cost happens when a server is running, also booting it costs extra. We want to minimize the cost. The problem can be described as the folloting optimization model:
One viable online algorithm is as following, which is 2-competitive:
FOR i := 1 TO n
IF d[i + 1] = s[i];
s[i + 1] := d[i + 1];
count[j] = d[i = 1];
ELSE
FOR j := 1 TO n
IF d[i + 1] <= j < s[i]
count[j] += c;
ELSE
count[j] = 0;
s[i + 1] = MIN(j) WHERE count[j] > b;