Hierarchical Clustering ====== # Greedy Cut Suppose we can approximate the following binary CSP: $$\text{Maximize over all partitions $(A,B)$: } w(A) (|B| + |A|/3) + w(B) (|A| + |B|/3)$$ Then if we apply this algorithm to find $A,B$ and then cluster each randomly then our clustering objective will in expectation be equal to this value. Consider an algorithm which with probability $p$ does this and with probability $1-p$ just splits randomly. Suppose we can approxiamte the CSP with factor $\alpha$. Then this algorithm gets: $$\alpha p(\max_{(X,Y)}w(X)(|Y| + |X|/3) + w(Y)(|X| + |Y|/3)) + (1-p)w,$$ where $w$ is the total weight. We denote the $\max$ above as $CSP$. So the algorithm is getting at least $\alpha p CSP + (1 - p)\frac13 w$. Now consider an optimum solution. Pick an arbitrary node in this solution and call its children $A$ and $B$ respectively and the rest call $C$. Then we know that: $$OPT \le w(A) + w(B) + w(C) + |C|(w(A,B) + w(A,C) + w(B,C)).$$ Pick the node in the optimum tree which minimizes the upper bound above. We also have that: $$CSP \ge w(A)(|B \cup C| + \frac{|A|}{3}) + w(B \cup C)(|A| + \frac{|B \cup C |}{3})$$ $$CSP \ge w(B)(|A \cup C| + \frac{|B|}{3}) + w(A \cup C)(|B| + \frac{|A \cup C |}{3})$$ $$CSP \ge w(C)(|A \cup B| + \frac{|C|}{3}) + w(A \cup B)(|C| + \frac{|A \cup B |}{3})$$ Let's multiply the above by $\alpha_A, \alpha_B$ and $\alpha_C$ respectively and add up (let $a = |A|, b = |B|, c = |C|$). Then we have: $$CSP \ge w(A)(\alpha_A(1 - 2a/3) + \alpha_B(1/3 + 2b/3) + \alpha_C(1/3 + 2c/3) +\\ w(B)(\alpha_B(1 -2b/3) + \alpha_A(1/3 + 2a/3) + \alpha_C(1/3 + 2c/3)) + \\ w(C)(\alpha_C(1 -2c/3) + \alpha_A(1/3 + 2a/3) + \alpha_B(1/3 + 2b/3)) + \\ w(A,B)\alpha_C (1/3 + 2c/3) + w(B,C) \alpha_A(1/3 + 2a/3) + w(A,C) \alpha_B(1/3 + 2b/3)$$ To balance things we can set: $$\alpha_A = \frac{\frac{1}{1 - 2a}}{\frac{1}{1 - 2a} + \frac{1}{1 - 2b} + \frac{1}{1 - 2c}}, \alpha_B = \frac{\frac{1}{1 - 2b}}{\frac{1}{1 - 2a} + \frac{1}{1 - 2b} + \frac{1}{1 - 2c}}, \alpha_C = \frac{\frac{1}{1 - 2c}}{\frac{1}{1 - 2a} + \frac{1}{1 - 2b} + \frac{1}{1 - 2c}}$$ Plugging in we get: $$c_1 = c_A = c_B = c_C = \alpha_A(1 - 2a/3) + \alpha_B (1/3 + 2b/3) + \alpha_C (1/3 + 2c/3) = \frac13 \frac{5 + 4(\frac{a}{1 - 2a} + \frac{b}{1 - 2 b} + \frac{c}{1 - 2c})}{\frac{1}{1 - 2a} + \frac{1}{1 - 2b} + \frac{1}{1 - 2c}}$$ $$c_2 = \frac13 \frac{3 + 4(\frac{a}{1 - 2a} + \frac{b}{1 - 2 b} + \frac{c}{1 - 2c})}{\frac{1}{1 - 2a} + \frac{1}{1 - 2b} + \frac{1}{1 - 2c}}$$ $$CSP \ge c_1 (w(A) + w(B) + w(C)) + c_2(w(A,B) + w(B,C) + w(A,C))$$ $$OPT \le w(A) + w(B) + w(C) + |C|(w(A,B) + w(A,C) + w(B,C)).$$ $$ALG \ge \alpha p CSP + (1 - p)\frac13 (w(A) + w(B) + w(C) + w(A,B) + w(B,C) + w(A,C))$$ # Case |A| = |B| = 1/4, |C| = 1/2 We divide everything by $n$ in this analysis. We have: $$OPT \le w(A) + w(B) + w(C) + \frac12(w(A,B) + w(B,C) + w(A,C))$$ $$ALG \ge 0.87 * 2/3 * OPT_{Bal-UNCUT}$$ When we split $A$ and $B$ they just go to the corresponding side so we have in both cases: $$OPT_{Bal-UNCUT} \ge w(A) + w(B) + w(C) + w(A,B)$$ When we split $C$ we get: $$OPT_{Bal-UNCUT} \ge w(A) + w(B) + \frac12 w(C) + \frac12 w(A,C) + \frac12 w(B,C)$$ Finally, we have the trivial inequality from splitting randomly: $$OPT_{Bal-UNCUT} \ge \frac12 (w(A) + w(B) + w(C) + w(A,B) + w(A,C) + w(B,C))$$ Let's multiply the first inequality by $\alpha$, second by $\beta$ and last one by $1 - \alpha - \beta$ and add them up. We get the following four terms: $$(\alpha + \beta + \frac{1 - \alpha - \beta}{2})(w(A) + w(B))$$ $$(\alpha + \beta/2 + \frac{1 - \alpha - \beta}{2}) w(C) = \frac{1 + \alpha}{2} w(C)$$ $$(\alpha + \frac{1 - \alpha -\beta}{2})w(A,B) = \frac{1 + \alpha -\beta}{2} w(A,B)$$ $$(\beta/2 + \frac{1 - \alpha - \beta}{2})(w(A,C) + w(B,C)) = \frac{1 - \alpha}{2}(w(A,C) + w(B,C))$$ The first term's coefficient is always better than the second one so we drop it. Then (up to the (GW * 2/3)-factor) the approximation is: $$\min(\frac{1 + \alpha}{2}, 1 + \alpha -\beta, 1-\alpha).$$ We want to maximize this subject to a constraint $\alpha + \beta \le 1$. Best choice is $\alpha = 1/3$ and $\beta \le 2/3$ which makes the above minimum $2/3$. Hence the overall approximation is $GW \times 4/9 \approx 0.386$. # General analysis $$OPT \le n (w(A) + w(B) + w(C)) + |C|(w(A,B) + w(B,C) + w(A,C))$$ $$ALG \ge 0.87 \times 2n/3 \times OPT_{Bal-UNCUT}$$ $$OPT_{Bal-UNCUT} \ge ... ?$$ Suppose we split vertices of $C$ randomly between $A$ and $B$ so that the sizes of both resulting sets are the same. What can we say about the resulting UNCUT? Let $x = n/2 - |A|$ and $y = n/2 - |B|$ then for every vertex in $C$ the probability that it goes to $A$ is $\frac{x}{x + y}$ and the probability that it goes to $B$ is $\frac{y}{x + y}$. For an edge between $A$ and $C$ the probability that it is uncut is thus $\frac{x}{x + y}$ and similarly for an edge between $B$ and $C$ it is $\frac{y}{x + y}$. For an edge in $C$ the probability that it is uncut is $\frac{x^2 + y^2}{(x + y)^2}$. Thus the expected UNCUT value is: $$w(A) + w(B) + \frac{x^2 + y^2}{(x + y)^2}w(C) + \frac{x}{x + y}w(A,C) + \frac{y}{x + y}w(B,C)$$ Let $d_A = n/2 - |A|$, $d_B = n/2 - |B|$ and $d_C = n/2 - |C|$. Then applying the same argument as above to all three different pairs we get: $$OPT_{Bal-UNCUT} \ge w(B) + w(C) + \frac{d_B^2 + d_C^2}{(d_B + d_C)^2}w(A) + \frac{d_B}{d_B + d_C}w(A,B) + \frac{d_C}{d_B + d_C}w(A,C)$$ $$OPT_{Bal-UNCUT} \ge w(A) + w(C) + \frac{d_A^2 + d_C^2}{(d_A + d_C)^2}w(B) + \frac{d_C}{d_A + d_C}w(B,C) + \frac{d_A}{d_A + d_C}w(A,B)$$ $$OPT_{Bal-UNCUT} \ge w(A) + w(B) + \frac{d_A^2 + d_B^2}{(d_A + d_B)^2}w(C) + \frac{d_A}{d_A + d_B}w(A,C) + \frac{d_B}{d_A + d_B}w(B,C)$$ Let's pick $\alpha_A, \alpha_B, \alpha_C$ so that $\alpha_A + \alpha_B + \alpha_C = 1$ and: $$\gamma = \alpha_A \frac{d_B^2 + d_C^2}{(d_B + d_C)^2} + \alpha_B + \alpha_C = \alpha_B \frac{d_A^2 + d_C^2}{(d_A + d_C)^2} + \alpha_A + \alpha_C = \alpha_C \frac{d_A^2 + d_B^2}{(d_A + d_B)^2} + \alpha_A + \alpha_B$$ Setting: $$\beta_A = 1 - \frac{d_B^2 + d_C^2}{(d_B + d_C)^2} = \frac{2d_B d_C}{(d_B + d_C)^2}, \beta_B = \frac{2 d_A d_C}{(d_A + d_C)^2}, \beta_C = \frac{2 d_A d_B}{(d_A + d_B)^2}$$ we have: $$\alpha_A = \frac{\beta_B \beta_C}{\beta_A \beta_B + \beta_B \beta_C + \beta_A \beta_C} = \frac{d_A (d_B + d_C)^2}{d_A(d_B + d_C)^2 + d_B(d_A + d_C)^2 + d_C(d_A + d_B)^2},$$ and $\alpha_B, \alpha_C$ are expressed similarly. $$\gamma = 1 - \frac{\beta_A \beta_B \beta_C}{\beta_A \beta_B + \beta_B \beta_C + \beta_A \beta_C}$$ Let $$\gamma_{AB} = \alpha_A \frac{d_B}{d_B + d_C} + \alpha_B \frac{d_A}{d_A + d_C} = \frac{d_A d_B(d_A + d_B + 2 d_C)}{d_A(d_B + d_C)^2 + d_B(d_A + d_C)^2 + d_C(d_A + d_B)^2}$$ Then we will have: $$OPT_{Bal-UNCUT} \ge \gamma (w(A) + w(B) + w(C)) + \gamma_{AB} w(A,B) + \gamma_{AC} w(A,C) + \gamma_{BC}w(B,C)$$ Then the approximation factor is given as: $$0.87 \times 2/3 \times \min_{|A|, |B|, |C| \colon |A| + |B| \ge n/2, |A|, |B| < n/2, |A| + |B| + |C| = n} \min(\gamma, \frac{n}{|C|}\gamma_{AB},\frac{n}{|C|}\gamma_{BC}, \frac{n}{|C|}\gamma_{AC})$$ # Case study $d_A = d_B$ Suppose $|A| = |B|$ and let $x = 1/2 - |A|$ (here we assume sizes are fractions). Then $d_C = 1/2 - 2x$. From splitting $C$ we have: $$OPT_{Bal-UNCUT} \ge w(A) + w(B) + \frac12 w(C) + \frac12 w(A,C) + \frac12 w(B,C)$$ From splitting $A$ we have: $$OPT_{Bal-UNCUT} \ge w(B) + w(C) + \frac{x^2 + (1/2 - 2x)^2}{(1/2 - x)^2}w(A) + \frac{x}{1/2 - x} w(A,B) + \frac{1/2 - 2x}{1/2 - x} w(A,C)$$ Similarly, for $B$: $$OPT_{Bal-UNCUT} \ge w(A) + w(C) + \frac{x^2 + (1/2 - 2x)^2}{(1/2 - x)^2}w(B) + \frac{x}{1/2 - x} w(A,B) + \frac{1/2 - 2x}{1/2 - x} w(B,C)$$ Finally, we have: $$OPT_{Bal-UNCUT} \ge \frac12 (w(A) + w(B) + w(C) + w(A,B) + w(A,C) + w(B,C))$$ The first inequality will get weight $\alpha$, second two weight $\beta$ and the last one $1 - \alpha - 2\beta$. Adding things up we get the following terms: $$\left(\beta \frac{x^2 + (1/2 - 2x)^2}{(1/2 - x)^2} + \frac{1 + \alpha}{2}\right) (w(A) + w(B))$$ $$\left(\frac12 + \beta\right)w(C)$$ $$\left(\frac{2\beta x}{1/2 - x} + \frac{1 - \alpha - 2\beta}{2}\right)w(A,B)$$ $$\left(\frac12 - \frac{\beta x}{1/2 - x}\right)(w(A,C) + w(B,C))$$ Then (up to the (GW * 2/3)-factor) the approximation is given as the minimum of the following four quantities: $$\beta \frac{x^2 + (1/2 - 2x)^2}{(1/2 - x)^2} + \frac{1 + \alpha}{2}$$ $$1/2 + \beta$$ $$\frac{\beta}{1/2 - x} + \frac{1 - \alpha - 2\beta}{4x}$$ $$\frac{1}{4x} - \frac{\beta}{1 - 2x}$$ So we aim to optimize this minimum by choosing $\alpha, \beta$. Suppose $x = 1/4$ then we have: $$\min(1/2 + \beta + \alpha/2, 1/2 + \beta,1 - \alpha + 2 \beta, 1 - 2\beta)$$ So we can set $\alpha = 2/3$ and $\beta = 1/6$ and the approximation is $GW \times 4/9$. Let's select $\alpha, \beta$ so that: $$\beta \frac{x^2 + (1/2 - 2x)^2}{(1/2 - x)^2} + \frac{1 + \alpha}{2} = \frac12 + \beta$$ $$\frac{2\beta x}{1/2 - x} + \frac{1 - \alpha - 2\beta}{2} = \frac12 - \frac{\beta x}{1/2 - x}$$ Simplifying we have: $$\frac{\alpha}{2} + \beta\frac{x^2 + (1/2 - 2x)^2 - (1/2 - x)^2}{(1/2 - x)^2} = \frac{\alpha}{2} + \beta\frac{4x^2 - x}{(1/2 - x)^2}= 0$$ $$\frac{3\beta x}{1/2 - x} - \frac{\alpha}{2} - \beta = \beta\frac{4x - 1/2}{1/2 - x} - \frac{\alpha}{2} = 0$$ Adding together: $$\beta $$