Hierarchical Clustering
======
# Greedy Cut
Suppose we can approximate the following binary CSP:
$$\text{Maximize over all partitions $(A,B)$: } w(A) (|B| + |A|/3) + w(B) (|A| + |B|/3)$$
Then if we apply this algorithm to find $A,B$ and then cluster each randomly then our clustering objective will in expectation be equal to this value.
Consider an algorithm which with probability $p$ does this and with probability $1-p$ just splits randomly. Suppose we can approxiamte the CSP with factor $\alpha$. Then this algorithm gets:
$$\alpha p(\max_{(X,Y)}w(X)(|Y| + |X|/3) + w(Y)(|X| + |Y|/3)) + (1-p)w,$$
where $w$ is the total weight. We denote the $\max$ above as $CSP$.
So the algorithm is getting at least $\alpha p CSP + (1 - p)\frac13 w$.
Now consider an optimum solution. Pick an arbitrary node in this solution and call its children $A$ and $B$ respectively and the rest call $C$.
Then we know that:
$$OPT \le w(A) + w(B) + w(C) + |C|(w(A,B) + w(A,C) + w(B,C)).$$
Pick the node in the optimum tree which minimizes the upper bound above. We also have that:
$$CSP \ge w(A)(|B \cup C| + \frac{|A|}{3}) + w(B \cup C)(|A| +
\frac{|B \cup C |}{3})$$
$$CSP \ge w(B)(|A \cup C| + \frac{|B|}{3}) + w(A \cup C)(|B| + \frac{|A \cup C |}{3})$$
$$CSP \ge w(C)(|A \cup B| + \frac{|C|}{3}) + w(A \cup B)(|C| + \frac{|A \cup B |}{3})$$
Let's multiply the above by $\alpha_A, \alpha_B$ and $\alpha_C$ respectively and add up (let $a = |A|, b = |B|, c = |C|$).
Then we have:
$$CSP \ge w(A)(\alpha_A(1 - 2a/3) + \alpha_B(1/3 + 2b/3) + \alpha_C(1/3 + 2c/3) +\\
w(B)(\alpha_B(1 -2b/3) + \alpha_A(1/3 + 2a/3) + \alpha_C(1/3 + 2c/3)) + \\
w(C)(\alpha_C(1 -2c/3) + \alpha_A(1/3 + 2a/3) + \alpha_B(1/3 + 2b/3)) + \\
w(A,B)\alpha_C (1/3 + 2c/3) + w(B,C) \alpha_A(1/3 + 2a/3) + w(A,C) \alpha_B(1/3 + 2b/3)$$
To balance things we can set:
$$\alpha_A = \frac{\frac{1}{1 - 2a}}{\frac{1}{1 - 2a} + \frac{1}{1 - 2b} + \frac{1}{1 - 2c}}, \alpha_B = \frac{\frac{1}{1 - 2b}}{\frac{1}{1 - 2a} + \frac{1}{1 - 2b} + \frac{1}{1 - 2c}}, \alpha_C = \frac{\frac{1}{1 - 2c}}{\frac{1}{1 - 2a} + \frac{1}{1 - 2b} + \frac{1}{1 - 2c}}$$
Plugging in we get:
$$c_1 = c_A = c_B = c_C = \alpha_A(1 - 2a/3) + \alpha_B (1/3 + 2b/3) + \alpha_C (1/3 + 2c/3) = \frac13 \frac{5 + 4(\frac{a}{1 - 2a} + \frac{b}{1 - 2 b} + \frac{c}{1 - 2c})}{\frac{1}{1 - 2a} + \frac{1}{1 - 2b} + \frac{1}{1 - 2c}}$$
$$c_2 = \frac13 \frac{3 + 4(\frac{a}{1 - 2a} + \frac{b}{1 - 2 b} + \frac{c}{1 - 2c})}{\frac{1}{1 - 2a} + \frac{1}{1 - 2b} + \frac{1}{1 - 2c}}$$
$$CSP \ge c_1 (w(A) + w(B) + w(C)) + c_2(w(A,B) + w(B,C) + w(A,C))$$
$$OPT \le w(A) + w(B) + w(C) + |C|(w(A,B) + w(A,C) + w(B,C)).$$
$$ALG \ge \alpha p CSP + (1 - p)\frac13 (w(A) + w(B) + w(C) + w(A,B) + w(B,C) + w(A,C))$$
# Case |A| = |B| = 1/4, |C| = 1/2
We divide everything by $n$ in this analysis.
We have:
$$OPT \le w(A) + w(B) + w(C) + \frac12(w(A,B) + w(B,C) + w(A,C))$$
$$ALG \ge 0.87 * 2/3 * OPT_{Bal-UNCUT}$$
When we split $A$ and $B$ they just go to the corresponding side so we have in both cases:
$$OPT_{Bal-UNCUT} \ge w(A) + w(B) + w(C) + w(A,B)$$
When we split $C$ we get:
$$OPT_{Bal-UNCUT} \ge w(A) + w(B) + \frac12 w(C) + \frac12 w(A,C) + \frac12 w(B,C)$$
Finally, we have the trivial inequality from splitting randomly:
$$OPT_{Bal-UNCUT} \ge \frac12 (w(A) + w(B) + w(C) + w(A,B) + w(A,C) + w(B,C))$$
Let's multiply the first inequality by $\alpha$, second by $\beta$ and last one by $1 - \alpha - \beta$ and add them up.
We get the following four terms:
$$(\alpha + \beta + \frac{1 - \alpha - \beta}{2})(w(A) + w(B))$$
$$(\alpha + \beta/2 + \frac{1 - \alpha - \beta}{2}) w(C) = \frac{1 + \alpha}{2} w(C)$$
$$(\alpha + \frac{1 - \alpha -\beta}{2})w(A,B) = \frac{1 + \alpha -\beta}{2} w(A,B)$$
$$(\beta/2 + \frac{1 - \alpha - \beta}{2})(w(A,C) + w(B,C)) = \frac{1 - \alpha}{2}(w(A,C) + w(B,C))$$
The first term's coefficient is always better than the second one so we drop it. Then (up to the (GW * 2/3)-factor) the approximation is:
$$\min(\frac{1 + \alpha}{2}, 1 + \alpha -\beta, 1-\alpha).$$
We want to maximize this subject to a constraint $\alpha + \beta \le 1$. Best choice is $\alpha = 1/3$ and $\beta \le 2/3$ which makes the above minimum $2/3$. Hence the overall approximation is $GW \times 4/9 \approx 0.386$.
# General analysis
$$OPT \le n (w(A) + w(B) + w(C)) + |C|(w(A,B) + w(B,C) + w(A,C))$$
$$ALG \ge 0.87 \times 2n/3 \times OPT_{Bal-UNCUT}$$
$$OPT_{Bal-UNCUT} \ge ... ?$$
Suppose we split vertices of $C$ randomly between $A$ and $B$ so that the sizes of both resulting sets are the same. What can we say about the resulting UNCUT?
Let $x = n/2 - |A|$ and $y = n/2 - |B|$ then for every vertex in $C$ the probability that it goes to $A$ is $\frac{x}{x + y}$ and the probability that it goes to $B$ is $\frac{y}{x + y}$.
For an edge between $A$ and $C$ the probability that it is uncut is thus $\frac{x}{x + y}$ and similarly for an edge between $B$ and $C$ it is $\frac{y}{x + y}$.
For an edge in $C$ the probability that it is uncut is $\frac{x^2 + y^2}{(x + y)^2}$.
Thus the expected UNCUT value is:
$$w(A) + w(B) + \frac{x^2 + y^2}{(x + y)^2}w(C) + \frac{x}{x + y}w(A,C) + \frac{y}{x + y}w(B,C)$$
Let $d_A = n/2 - |A|$, $d_B = n/2 - |B|$ and $d_C = n/2 - |C|$. Then applying the same argument as above to all three different pairs we get:
$$OPT_{Bal-UNCUT} \ge w(B) + w(C) + \frac{d_B^2 + d_C^2}{(d_B + d_C)^2}w(A) + \frac{d_B}{d_B + d_C}w(A,B) + \frac{d_C}{d_B + d_C}w(A,C)$$
$$OPT_{Bal-UNCUT} \ge w(A) + w(C) + \frac{d_A^2 + d_C^2}{(d_A + d_C)^2}w(B) + \frac{d_C}{d_A + d_C}w(B,C) + \frac{d_A}{d_A + d_C}w(A,B)$$
$$OPT_{Bal-UNCUT} \ge w(A) + w(B) + \frac{d_A^2 + d_B^2}{(d_A + d_B)^2}w(C) + \frac{d_A}{d_A + d_B}w(A,C) + \frac{d_B}{d_A + d_B}w(B,C)$$
Let's pick $\alpha_A, \alpha_B, \alpha_C$ so that $\alpha_A + \alpha_B + \alpha_C = 1$ and:
$$\gamma = \alpha_A \frac{d_B^2 + d_C^2}{(d_B + d_C)^2} + \alpha_B + \alpha_C = \alpha_B \frac{d_A^2 + d_C^2}{(d_A + d_C)^2} + \alpha_A + \alpha_C = \alpha_C \frac{d_A^2 + d_B^2}{(d_A + d_B)^2} + \alpha_A + \alpha_B$$
Setting:
$$\beta_A = 1 - \frac{d_B^2 + d_C^2}{(d_B + d_C)^2} = \frac{2d_B d_C}{(d_B + d_C)^2}, \beta_B = \frac{2 d_A d_C}{(d_A + d_C)^2}, \beta_C = \frac{2 d_A d_B}{(d_A + d_B)^2}$$ we have:
$$\alpha_A = \frac{\beta_B \beta_C}{\beta_A \beta_B + \beta_B \beta_C + \beta_A \beta_C} = \frac{d_A (d_B + d_C)^2}{d_A(d_B + d_C)^2 + d_B(d_A + d_C)^2 + d_C(d_A + d_B)^2},$$
and $\alpha_B, \alpha_C$ are expressed similarly.
$$\gamma = 1 - \frac{\beta_A \beta_B \beta_C}{\beta_A \beta_B + \beta_B \beta_C + \beta_A \beta_C}$$
Let
$$\gamma_{AB} = \alpha_A \frac{d_B}{d_B + d_C} + \alpha_B \frac{d_A}{d_A + d_C} = \frac{d_A d_B(d_A + d_B + 2 d_C)}{d_A(d_B + d_C)^2 + d_B(d_A + d_C)^2 + d_C(d_A + d_B)^2}$$
Then we will have:
$$OPT_{Bal-UNCUT} \ge \gamma (w(A) + w(B) + w(C)) + \gamma_{AB} w(A,B) + \gamma_{AC} w(A,C) + \gamma_{BC}w(B,C)$$
Then the approximation factor is given as:
$$0.87 \times 2/3 \times \min_{|A|, |B|, |C| \colon |A| + |B| \ge n/2, |A|, |B| < n/2, |A| + |B| + |C| = n} \min(\gamma, \frac{n}{|C|}\gamma_{AB},\frac{n}{|C|}\gamma_{BC}, \frac{n}{|C|}\gamma_{AC})$$
# Case study $d_A = d_B$
Suppose $|A| = |B|$ and let $x = 1/2 - |A|$ (here we assume sizes are fractions). Then $d_C = 1/2 - 2x$.
From splitting $C$ we have:
$$OPT_{Bal-UNCUT} \ge w(A) + w(B) + \frac12 w(C) + \frac12 w(A,C) + \frac12 w(B,C)$$
From splitting $A$ we have:
$$OPT_{Bal-UNCUT} \ge w(B) + w(C) + \frac{x^2 + (1/2 - 2x)^2}{(1/2 - x)^2}w(A) + \frac{x}{1/2 - x} w(A,B) + \frac{1/2 - 2x}{1/2 - x} w(A,C)$$
Similarly, for $B$:
$$OPT_{Bal-UNCUT} \ge w(A) + w(C) + \frac{x^2 + (1/2 - 2x)^2}{(1/2 - x)^2}w(B) + \frac{x}{1/2 - x} w(A,B) + \frac{1/2 - 2x}{1/2 - x} w(B,C)$$
Finally, we have:
$$OPT_{Bal-UNCUT} \ge \frac12 (w(A) + w(B) + w(C) + w(A,B) + w(A,C) + w(B,C))$$
The first inequality will get weight $\alpha$, second two weight $\beta$ and the last one $1 - \alpha - 2\beta$.
Adding things up we get the following terms:
$$\left(\beta \frac{x^2 + (1/2 - 2x)^2}{(1/2 - x)^2} + \frac{1 + \alpha}{2}\right) (w(A) + w(B))$$
$$\left(\frac12 + \beta\right)w(C)$$
$$\left(\frac{2\beta x}{1/2 - x} + \frac{1 - \alpha - 2\beta}{2}\right)w(A,B)$$
$$\left(\frac12 - \frac{\beta x}{1/2 - x}\right)(w(A,C) + w(B,C))$$
Then (up to the (GW * 2/3)-factor) the approximation is given as the minimum of the following four quantities:
$$\beta \frac{x^2 + (1/2 - 2x)^2}{(1/2 - x)^2} + \frac{1 + \alpha}{2}$$
$$1/2 + \beta$$
$$\frac{\beta}{1/2 - x} + \frac{1 - \alpha - 2\beta}{4x}$$
$$\frac{1}{4x} - \frac{\beta}{1 - 2x}$$
So we aim to optimize this minimum by choosing $\alpha, \beta$.
Suppose $x = 1/4$ then we have:
$$\min(1/2 + \beta + \alpha/2, 1/2 + \beta,1 - \alpha + 2 \beta, 1 - 2\beta)$$
So we can set $\alpha = 2/3$ and $\beta = 1/6$ and the approximation is $GW \times 4/9$.
Let's select $\alpha, \beta$ so that:
$$\beta \frac{x^2 + (1/2 - 2x)^2}{(1/2 - x)^2} + \frac{1 + \alpha}{2} = \frac12 + \beta$$
$$\frac{2\beta x}{1/2 - x} + \frac{1 - \alpha - 2\beta}{2} = \frac12 - \frac{\beta x}{1/2 - x}$$
Simplifying we have:
$$\frac{\alpha}{2} + \beta\frac{x^2 + (1/2 - 2x)^2 - (1/2 - x)^2}{(1/2 - x)^2} = \frac{\alpha}{2} + \beta\frac{4x^2 - x}{(1/2 - x)^2}= 0$$
$$\frac{3\beta x}{1/2 - x} - \frac{\alpha}{2} - \beta = \beta\frac{4x - 1/2}{1/2 - x} - \frac{\alpha}{2} = 0$$
Adding together:
$$\beta $$