# Notes from Nov 30
We are looking at minimizing the objective:
$$\sum_{ij} d_{ij}(1 - \ell_{ij}),$$
where $d_{ij}$ are distances and $\ell_{ij}$ is the fraction of nodes under the LCA of $i$ and $j$.
For starters let's assume that $d_{ij}$ is a $0$-$1$ metric. Because the 0 relationship is transitive these metrics are just a collection of disjoint equivalence classes. This means we are given a $k$-colorable graph for some unknown value of $k$ and we are trying to cluster it.
Claim: If there are no $2$ clusters so that the sum of their sizes is $\ge (1 - \epsilon) n$ ($\epsilon$ is fixed), then the random tree gives a constant approximation.
Question: Suppose cluster sizes are $n_1 \le n_2, \le \dots \le n_k$, can we lower bound the optimum?
Answer: $\sum_{1 \le i < j < t \le k} n_i n_j n_t$
Do case analysis on how many clusters have size at least $\gamma n$.
**Case 0:**
Do Random
**Case 1:**
Let the largest cluster's size be $s$.
If we peel-off the largest cluster first, then our objective is $\le n \cdot (n-s)^2$ (there are $(n-s)^2$ edges between the remaining clusters, and they get multiplier at most $n$).
If we peel-off the largest cluster last, then our objective is $\le (n - s)^2 \cdot n$ (there are at most $n (n - s)$ edges, each getting multiplier at most $n-s$).
**Case >=2:**
(First peel-off small clusters?) Split the two clusters and then random?
**Probabilistic approach**
Suppose we do a probabilistic process where we split randomly into two parts. Our clique sizes are $n_1 \ge n_2 \ge \dots \ge n_k$ and we assign every clique to a cluster $1$ independently with probability $p_i$ which only depends on the index of the clique in this order.
$p_1, p_2, \dots, p_k$
Let's set $p_1 = 1$ and $p_2 = 0$ and then pick the rest of the values to be: $p_i \sim n_i/n_2$.
Assume that your cluster sizes are $a,b,c$.
* If you peel-off cluster $a$ in the beginning, then you get $a(bc)$
$a,b,c,d$
$abc + abd + bcd + acd$
Let the cluster sizes be $n_1, \ldots, n_k$. Then for anyt tree which doesn't split the clusters, the objective is
$$\frac 14\sum_{i<j<t} n_i n_j n_t$$
Proof by induction. Let's add a new split at the top with cluster $n_{k+1}$, then the objective increases by $\frac 14 n_{k+1} \sum_{1 \le i < j \le k} n_i n_j$.
** Conclusion **
**Thm** If $d$ is a $0$-$1$-metric then we can optimize this objective exactly by recovering independent sets first and then building an arbitrary tree on top of these independent sets.
**Proof strategy**
Decompose the objective over triangles and check that this is optimal for every triangle.