Farid-internship
=======
# Eight meeting notes
Take $ w(x,y) = 1$ iff they are in the same class
Then
$$ DEND-PUR = \mathbb{E}_{x,y | w(x,y) = 1}(\frac{1}{|LCA(x,y)|} \sum_{z | w(x,z) = 1} 1) $$
$$ DEND-PUR = \mathbb{E}_{x,y | w(x,y) = 1}(\frac{w(x,y)}{|LCA(x,y)|} \sum_z w(x,z)) $$
$$ DEND-PUR = \frac{1}{ \sum_{u<v} {w(u,v)} } \sum_{x<y} \frac{w(x,y)}{|LCA(x,y)|} \sum_z w(x,z) $$
# Seventh meeting notes
Suppose there are four clusters, in order $C,A,B,D$. We also suppose we merge $(A,B)$ and then $(C,AB)$.
We use the following bound on the variation in score:
$$ \Delta S \geq \Delta(\textrm{MAX-UPPER}) - \Delta(\textrm{MIN-MERGE}) $$
$$ \Delta(\textrm{MIN-MERGE}) \leq \min( w(AB,D)c, ((a+b)c-w(AB,C))d) \leq (a+b)cd \min(1-\bar{w}(AB,C), \bar{w}(AB,D))$$
$$ \Delta(\textrm{MAX-UPPER}) \geq w(A,B)(c+d)+w(AB,C)d$$
----
We want to show that $\Delta(\textrm{MAX-MERGE} + \textrm{MIN-MERGE}) \geq 2 \sum_{\textrm{MIN-MERGE}} \textrm{MAX}w - \textrm{MIN}w$.
We have:
$$\Delta(\textrm{MAX-MERGE}+\textrm{MIN-MERGE}) \geq w(A,B)(c+d) + w(AB,C)d$$
and,
$$\sum_{\textrm{MIN-MERGE}} \textrm{MAX}w - \textrm{MIN}w \leq w(D,AB)c - w(C,AB)d$$
We finally want to show the following:
$$ ab(c+d) \bar{w}(A,B) + (a+b)cd\bar{w}(AB,C) \geq 2(a+b)cd\left(\bar{w}(AB,D) - \bar{w}(AB,C) \right)$$
$$ ab(c+d) \bar{w}(A,B) \geq (a+b)cd\left(\bar{w}(AB,D) - 2\bar{w}(AB,C) \right)$$
---
Going back to threshold weights, we have,
$$\Delta(M_{1,1}+M_{1,0}) \geq w(AB,C)d$$
$$\Delta S_{-1} \geq w(A,B)(c+d)$$
$$\Delta \Psi \geq (w(C)+w(D))(a+b) $$
$$\Delta M_{0,1} \leq \min(w(AB,D)c, (c(a+b)-w(AB,C))d)$$
And so,
$$\Delta(M_{1,1}+M_{1,0} + 2\Psi + S_{-1}-2M_{0,1}) \geq \\
\Delta(2\Psi + S_{-1} - M_{0,1}) \geq \\
ab(c+d)\bar{w}(A,B) + 2(w(C)+w(D))(a+b) - (a+b)cd \min( \bar{w}(AB,D), 1 - \bar{w}(AB,C))$$
We also know that,
$$ w(C) = \bar{w}(C) {c \choose 2} \geq \bar{w}(AB,D) {c \choose 2}$$
And thus,
$$ 2(w(C) + w(D)) \geq cd \min(\bar{w}(AB,D), 1-\bar{w}(AB,C)) $$
# Sixth meeting notes
In the case of four clusters \( (C,A,B,D) \), we have the following:
$$ avg(A,B) = \frac{\Delta M_{1,1} + \Delta M_{1,0}}{|A| |B| ( |C| + |D|)}$$
$$ \frac{1}{|D|} avg(A,C) + \frac{1}{|C|}avg(B,D) = \frac{\Delta M_{1,1} + \Delta M_{0,1}}{|A||B||C||D|} $$
Thus, using the following inequality:
$$ \left(\frac{1}{|D|} + \frac{1}{|C|}\right) avg(A,B) \geq \frac{1}{|D|} avg(A,C) + \frac{1}{|C|}avg(B,D) $$
$$ \left(\frac{1}{|D|} + \frac{1}{|C|}\right) \frac{\Delta M_{1,1} + \Delta M_{1,0}}{|C|+|D|} \geq \frac{\Delta M_{1,1} + \Delta M_{0,1}}{|C||D|} $$
$$ \left(|C| + |D|\right) \frac{\Delta M_{1,1} + \Delta M_{1,0}}{|C|+|D|} \geq \Delta M_{1,1} + \Delta M_{0,1} $$
Imagine we lose $(0,1)$-triple $(x,y,z)$ by merging $(x,y)$. This means that if we have points $t < y$ such that $w(t,y) = 1$ then we also lose (or have already lost) the $(t,y,z)$ triple.
We are trying to map $(x,y,z)$ onto $(pred(y),y,w)$.
Fix $y$ and let $L_q(y) = \{r < y | w(r,y) = q\}$ and $R(y) = \{r > y | w(r,y) = q\}$ be sets of points to the left/right of $y$ with similarity $q$. Assume w.l.o.g that $|L_1(y)| \ge |R_1(y)|$.
For every $z \in R_1(y)$ and $x \in L_0(y)$ we would like to find find $x < x' < y$ and $x' \in L_1(y)$ and $z' \in R_1(y) \cup L_1(y)$.
For $x'$ pick any element of $L_1(y)$ and for $z'$ pick anything: $|L_1| \times (|L_1| + |R_1|) \geq |L_0| \times |R_1|$ choices.
# Fifth meeting notes
Let's consider $(i < j < k)$ triples. There are three different types of interesting triples: $(1,0)$, $(0,1)$, $(1,1)$. So the MAX-upper bound is just given as:
$$\Psi_{1,0} + \Psi_{0,1} + \Psi_{1,1}.$$
Consider the execution of the average-linkage clustering algorithm. For every $(i,j,k)$ triple consider the first time a pair in this triple gets merged. Let's use notation:
* $M_{1,0}$ is the number of merge where in $(1,0), (0,1)$ triple the $1$-pair gets merged first
* $M_{0,1}$ is .... where the $0$-pair gets merged first
* $M_{1,1}$ is the number of merged $(1,1)$ triples
Our score is $M_{1,0} + M_{1,1}$ and the potential drop is $M_{1,0} + M_{0,1} + M_{1,1}$. So we would like to show that at every iteration the cumulative number of merges satisfies:
$$M_{1,0} + M_{1,1} \ge \frac 34 \left(M_{1,0} + M_{0,1} + M_{1,1}\right)$$
$$\frac14\left(M_{1,0} + M_{1,1}\right) \ge \frac34 M_{0,1}$$
$$M_{1,0} + M_{1,1} \ge 3 M_{0,1}$$
$$M_{1,0} + M_{1,1} - 3 M_{0,1} \ge 0$$
In the 4-point example we have:
* First iteration: $M_{1,0} = 0$, $M_{1,1} = 2, M_{0,1} = 0$
* Second iteration: $M_{1,0} = 1$, $M_{1,1} = 2, M_{0,1} = 1$
Consider the final step $(C,(A,B))$ where we merge $(A,B)$ and the rest is $C$ on the left. Let's express the changes in score and potential in terms of $M_{0,1}, M_{1,0}, M_{1,1}$:
Under the hypothesis,
$$\Delta S_{merging AC} \leq \Delta S_{merging AB}$$
We have,
$$\Delta M_{0,1} + \Delta M_{1,1} \leq \Delta M_{1,0} + \Delta M_{1,1}$$
# Fourth meeting notes
We would like to show something liks this:
$$\sum_{i = 1}^t \Delta C_i + \frac34 \Delta \Psi^{MAX}_i \ge 0$$
We consider merging $A$ and $B$ where the rest is just $C$ on the left so that we have $(C,(A,B))$.
$$\Delta C = w_{AB}|C|$$
$$\Delta \Psi^{SUM} = - (w_{AB}|C| + w_{AC}|B|)$$
$$\Delta \Psi^{MAX} = - (w_{AB}|C| + w_{AC}|B| - \Psi_{1,1}(C,A,B))$$
Suppose we want a proof by induction going from $t$ to $t + 1$.
We have:
$$\sum_{i = 1}^t \Delta C_i + \frac34 \Delta \Psi^{MAX}_i \ge 0$$
We want to show:
$$\sum_{i = 1}^{t + 1} \Delta C_i + \frac34 \Delta \Psi^{MAX}_i \ge 0$$
$$\Delta C_{t + 1} = w_{AB}|C|$$
$$\Delta \Psi_{t + 1}^{MAX} = - (w_{AB}|C| + w_{AC}|B| - \Psi_{1,1}(C,A,B))$$
$$\Delta \Psi_{t + 1}^{MAX} = - (\Psi_{1,0} + \Psi_{0,1} + \Psi_{1,1})$$
We want:
$$w_{AB}|C| \ge \frac34 (w_{AB}|C| + w_{AC}|B| - \Psi_{1,1}(C,A,B))$$
$$\frac14 w_{AB}|C| \ge \frac34 w_{AC}|B| - \frac34 \Psi_{1,1}(C,A,B)$$
$$\frac14 \bar w_{AB} \ge \frac34 \bar w_{AC} - \frac{3}{4 |A||B||C|} \Psi_{1,1}(C,A,B)$$
$$0 \ge \frac 12 \bar w_{AC} - \frac{3}{4} \bar \Psi_{1,1}(C,A,B)$$
$$0 \ge \frac 12 (\bar \Psi_{1,0} + \bar \Psi_{1,1}) - \frac{3}{4} \bar \Psi_{1,1}(C,A,B)$$
$$0 \ge \frac 12 \bar \Psi_{1,0}(C,A,B) - \frac{1}{4} \bar \Psi_{1,1}(C,A,B)$$
We have $\bar w_{AC} \ge \bar \Psi_{1,1}(C,A,B)$ so we just need to show:
$$0 \ge - \frac{1}{4} \bar \Psi_{1,1}(C,A,B)$$
$$\Psi_{1,1}(C,A,B)$$
$$ |B| w_{AC} = \Psi_{1,0} + \Psi_{1,1}$$
$$ |B| w_{AB} = \Psi_{0,1} + \Psi_{1,1}$$
So,
$$ \bar w_{AC} = \bar \Psi_{1,0} + \bar \Psi_{1,1}$$
$$ \bar w_{AB} = \bar \Psi_{0,1} + \bar \Psi_{1,1}$$
Is it true that $\bar w_{AC} \ge \bar \Psi_{1,1}(C,A,B)$?
In the 4-point example in the second step we have:
$|A| = 2, |B| = 1, |C| = 1$
$\bar w_{A,C} = \frac12$
$$$$
# Third meeting notes
Let's study MAX-potential:
$$MAX = \sum_{i < j < k} \max(w_{ij}, w_{jk})$$
Pick $w$ be threshold at distance $\theta$.
Let's try to characterize the MAX-potential in this case. Let's split $x_i$ points into groups which are at distance $> \theta$ from each other. Suppose there are two such groups $G_1, G_2$.
Let's evaluate the MAX-potential.
$$MAX(G_1 \cup G_2) = MAX(G_1) + MAX(G_2) + C_1 |G_2| + C_2 |G_1|,$$
where $C_i$ is the the number of pairs in group $i$ at distance at most $\theta$.
## 1/2-approximation analysis
Suppose we are merging $(A,B)$ and the all of the rest is on the left and is called $C$. So we have $(C, (A, B))$.
We want to show that $\Delta C + 1/2 \Delta \Psi \ge 0$.
We have:
$$\Delta C = w_{AB}|C|$$
$$\Delta \Psi_{SUM} = - (w_{AB}|C| + w_{AC}|B|)$$
$$\Delta \Psi_{MAX} = - (w_{AB}|C| + w_{AC}|B| - \Psi_{1,1}(C,A,B))$$
Overall in the 4-point example we have:
$$\Psi: 4 \rightarrow 2 \rightarrow 0 \rightarrow 0$$
$$C: 0 \rightarrow 2 \rightarrow 3 \rightarrow 3$$
$$\Psi_{1,1}: 2 \rightarrow 0 \rightarrow 0 \rightarrow 0$$
$$\Psi_{1,0}: 2 \rightarrow 2 \rightarrow 0 \rightarrow 0$$
# Second meeting notes
CA objective is to maximize:
$$\sum_{i < j} d(i,j) |LCA(i,j)|$$
Let's introduce an indicator $I_{k}(i,j)$ for the event that $k$ is in the subtree of $LCA(i,j)$.
Then we have:
$$CA = \sum_{i < j} d(i,j)\sum_{k \neq i,j} I_k(i,j)$$
Let's introduce $CA' = CA - 2 \sum_{i < j} d(i,j)$. Then for a triangle $(i,j,k)$ we have
$$OPT_{CA'} = \sum_{i < j} d(i,j) - \min_{i < j} d(i,j)$$
$$OPT_{MW} = 1 - \min_{i < j} d(i,j)$$
$$ALG_{MW} = 1-d(i^*,j^*) \ge (1/3 + \delta) OPT_{MW} = (1/3 + \delta) (1 - \min_{i < j} d(i,j))$$
This gives us:
$$-d(i^*,j^*) \ge (1/3 + \delta) (1 - \min_{i < j} d(i,j)) - 1$$
$$ALG_{CA'} = \sum_{i < j} d(i,j) - d(i^*,j^*)\ge (2/3 + \epsilon) OPT_{CA'} = (2/3 + \epsilon)(\sum_{i < j} d(i,j) - \min_{i < j} d(i,j))?$$
Plugging in the lower bound on $-d(i^*,j^*)$ from above we have:
$$ALG_{CA'} \ge \sum_{i < j} d(i,j) + (1/3 + \delta) (1 - \min_{i < j} d(i,j)) - 1$$
Using different notation:
$$x < y < z \le 1$$
$$ALG_{CA'} = x + z, OPT_{CA'} = y + z, ALG_{MW} = 1- y, OPT_{MW} = 1 - x$$
$$ALG_{MW} \ge (1/3 + \delta) OPT_{MW}$$
$$ALG_{MW} = 1 - y \ge (1/3 + \delta)(1 - x)$$
I am worried that $y = z$ and $x = 0$.
$$1 - y \ge 1/3 + \delta$$
# First meeting notes
MW objective is to maximize:
$$\sum_{i,j} \rho(i,j) (n - |LCA(i,j)|)$$
Cohen-Addad et al. objective is to maximize:
$$\sum_{i,j} d(i,j) |LCA(i,j)|$$
Straightforward conversion is $\rho(i,j) = 1 - d(i,j)$ after normalizing $d(i,j) \in [0,1]$.
**Counterexample:** Take triangle with sides at distances $(x, y, 1)$ where $x \le y \le 1$ and hence similarities $(1 - x, 1 - y, 0)$.
MW objective is $1 - x$.
Cohen-Addad objective is $2 x + 3 (1 + y)$
Suppose we get $\alpha (1 - x) = 1 - y$ inside, so that $y = 1 - \alpha(1-x)$. Then what can we say about CA-objective?
$2 y + 3(1 + x)$ while the optimum is $2x + 3(1 + y)$ so the ratio is:
$$\frac{2 y + 3x + 3}{2x + 3y + 3} = 1 - \frac{y - x}{2x + 3y + 3}$$
Let's plug in the fact that $(1/3 + \alpha)(1 - x) = 1 - y$ or $y = 1 - (1/3 + \alpha)(1 - x)$. So we have:
$$1 - \frac{1 - (1/3 + \alpha)(1 - x) - x}{2x + 3 + 3 - (1 + 3\alpha)(1 - x)}$$
$$\frac{2 y + 3x + 3}{2x + 3y + 3} = \frac{2(1-\alpha(1-x)) + 3x + 3}{2x + 3(1-\alpha(1-x))+3}$$
$$\frac{5-2\alpha+(3+2\alpha)x}{6-3\alpha+(2+3\alpha)x} = 1 - \frac{1+\alpha+(\alpha-1)x}{6-3\alpha+(2+3\alpha)x}$$
alpha = 0: $\frac{5}{6} \le 1-\frac{1-x}{6+2x} \le 1$

**Claim:** $\frac{1}{3}+\alpha$-approx for MW => $\frac{2}{3} + \frac{1}{2}\alpha$-approx for C-A.
We have $(1/3 + \alpha)(1 - x) = 1 - y$ so $y = 1 - (1/3 + \alpha)(1 - x)$.
$$ y = 1 - 2\frac{1-x}{3} = \frac{2x+1}{3}$$
Let's evaluate CA-objective:
$$\frac{2 y + 3x + 1}{2x + 3y + 3} = \frac{3x+1+\frac{2}{3}(2x+1)}{2x+2x+5}$$
$$= \frac{(3+\frac{4}{3})x+\frac{5}{3}}{4x+5}$$

Triangle inequality:
$$x + y \ge 1$$!
$$ MW(T) \ge \alpha (1-x) $$
$$ CA(T) \ge (1-\alpha) (3(x+y) -x + 3 ) \ge (1-\alpha)(6-x) $$
$$ MW(T) = n\sum \rho(i,j) - \sum |LCA(i,j)| \rho(i,j)$$