# Lecture 4 - Complexity of Union-Find Structure
###### tags: `Algorithm`
## Union-Find Problem
+ Given a set $\{1, 2, ...,n\}$ of n elements
+ Initially each element is in different set
+ $\{1\}, \{2\}, ...,\{n\}$
+ An intermixed sequence of union and find operations is performed
+ A $Union$ operation combines two sets into one
+ Each of the n elements is in exactly one set at any time
+ A $FindSet$ operation identifies the set that contains a particular element
## Disjoint Sets
### Disjoint-Set Data Structure
+ Maintains a collection of disjoint sets
+ Each set has a **representative** element
#### Operation
+ $MakeSet(x)$
+ Given object $x$, create a new set whose only element is pointed by $x$
+ $Find(x)$
+ Given object $x$, return the **representative** of the set containing $x$
+ Note: $Find(i)=Find(j)$ iff $i$ and $j$ are in the same set
+ $Union(x, y)$
+ Given two elements $x$, $y$, merge the disjoint sets containing them
+ The original sets are destroyed
+ The new set has a new representative
#### Application - ***Kruskal*'s algorithm**
```cpp
for (each vertex v in V)
Makeset(v);
for (each edge (u, v) in E in increasing order)
if(Find(u)!=Find(v))
Union(u, v);
```
### Implementation - Linked Lists
#### Basic Version

```cpp
struct Node{
Node *rep, *next;
int x;
}
```
+ $MakeSet(x)$
+ Create a list with only one node, for $x$
+ Time $O(1)$
+ $Find(x)$
+ Return the pointer to the representative
+ Time $O(1)$
+ $Union(x, y)$
1. Find the representatives $x'$, $y'$ of the sets containing $x$ and $y$
2. Append $y'$ list to $x'$ list
3. Pick $x'$ as the new representative
4. Update all represenative pointer in $y'$ list
+ Time $O(n)$
#### Improvement
+ Let each representative keep track of the length of its list and always **append the shorter list to the longer one**
##### Theorem: Any sequence of m operations takes $O(m+n\log n)$ time
+ $Makeset$,$Find$: $O(1)$ time
+ $Union$: $O(\log n)$ time per element in total
+ If we look at a single element, its pointer can only change when it is in the **shorter list** of the merging
+ Thus, after merging, the size of the set $y’$ will double at least
+ Thus, for an element, its pointer can only change $O(\log n)$ time
+ $1, 2, 4, 8, ..., n$
### Implemantation - Up Trees
#### Basic Version

```cpp
int par[MAXN];
void Makeset(int x){
par[x] = x;
}
int Find(int x){
if(x!=par[x])
return Find(par[x]);
return x;
}
void Union(int x, int y){
par[Find(x)] = Find(y);
}
```
+ $MakeSet(x)$: $O(1)$
+ $Find(x)$: Worst case is the height of the tree $O(n)$
+ $Union(x, y)$: Contain $2$ $Find$ operations $O(n)$
#### Union by Weight or Height
+ Height rule: Make tree with **smaller height** a subtree of the other tree (break ties arbitrarily)

+ Weight rule: Make tree with **fewer number of elements** a subtree of the other tree (break ties arbitrarily)

```cpp
int weight[MAXN];
void Makeset(int x){
par[x] = x;
weight[x] = 1;
}
void Union(int x, int y){
int rep_x = Find(x), rep_y = Find(y);
if(weight[rep_x]<weight[rep_y]){
par[rep_x] = rep_y;
weight[rep_y] += weight[rep_x];
}else{
par[rep_y] = rep_x;
weight[rep_x] += weight[rep_y];
}
}
```
##### Theorem: If $Union$ are done by weight, the depth of any element is never greater than $\log n+1$
+ Inductive Proof:
+ Initially, every element is at depth zero
+ When its depth increases as a result of a union operation, it is placed in a tree that becomes at least twice as large as before
+ How often each union be done?
+ $\log n$ times, becauce after at most $\log n$ unions, the tree will contain all n elements
+ Therefore, $Find$ becomes $O(\log n)$ when **$Union$ by weight** is used
#### Path Compression
Each time we do a $Find$ on an element $x$, we make all elements on path from root to $x$ be immediate children of root by making each element’s parent be the representative.

```cpp
int Find(int x){
if(x!=par[x])
return par[x] = Find(par[x]);
return x;
}
```
#### Union by Rank + Path Compression
+ Rank
+ **Upper bound** on height of tree
+ $rank[a]$ for every node $a$ in the tree
+ Initial it is zero
+ It can only increase as a root when linking two trees
```cpp
int par[MAXN], rank[MAXN];
void Makeset(int x){
par[x] = x;
rank[x] = 0;
}
void Link(int x, int y){
if(rank[x] > rank[y]){
par[y] = par[x];
}else{
par[x] = par[y];
if(rank[x]==rank[y])
rank[y]++;
}
}
int Find(int x){
if(x!=par[x])
return par[x] = Find(par[x]);
return x;
}
void Union(int x, int y){ // 2 Find + 1 Link
Link(Find(x), Find(y));
}
```
## Analysis (Union by Rank + Path Compression)
### Basic observation
+ For every node $x$, $rank[x]$ can only increase
+ When $x$ becomes non-root, $rank[x]$ will not change any more
+ If $y$ is the parent of $x$, $rank[y]>rank[x]$
+ When $x$ becomes the child of $y$, $rank[x]\leq rank[y]$, and if $rank[x]=rank[y]$, $rank[y]$ will increase
+ After that, only $y$ can increase
+ Rank $r$ node can only be formed by linking two rank $r-1$ nodes
+ Rank is bounded by $O(\log n)$
+ Number of rank $r$ nodes is $O(n/2^r)$
### Analysis(1) - Straight Forward Observation
#### **Theorem**: For every non-root node $x$, consider $rank[par[x]]-rank[x]$, after path compression, only $\log_2n$ nodes in each $Find$ path don't have their $rank[par[x]]-rank[x]$ multiply by at least $2$
+ Originally $C=par(D)$

+ If $rank[A]-rank[C]\geq rank[C]-rank[D]$

+ Then after path compression $par(D)\leftarrow A$
+ $rank[A]-rank[D]\geq 2(rank[C]-rank[D])$
+ If $rank[A]-rank[C]<rank[C]-rank[D]$

+ Then $rank[A]-rank[D]>2(rank[A]-rank[C])$
+ Since $rank[A]-rank[x]\leq\log n$ for any $x$
+ This can only happen $O(\log n)$ times in a path
#### Corallory: $Find$ is $O(\log n)$ amortized time
+ For nodes don't have their $rank[par[x]]-rank[x]$ multiply by $2$, count in this $Find$ operation, then $O(\log n)$ per operation
+ For nodes have their $rank[par[x]]-rank[x]$ multiply by $2$, count on its own, then $O(\log n)$ per node
### Analysis(2) - Potentials
#### Define: $$\phi(x):= \begin{cases} \log_2n-\lfloor \log_2(rank[par[x]]-rank[x])\rfloor,&\text{if $x$ is a non-root node}\\\log_2n,&\text{if $x$ is a root node}\end{cases}$$
#### Theorem: $\phi(x)\geq 0$, and $\phi(x)$ is monotonically decreasing
+ During path compression, for a path with $s$ nodes, at least $s-\log_2n$ nodes $x$ such that $rank[p(x)]-rank[x]$ multiply by at least $2$
+ $\to$ have their $\phi(x)$ decreases by $\geq 1$
#### Amortized Analysis
+ $Makeset(x)$: $\phi(x)=\log_2n$, amortized time is $O(\log n)$
+ $Link(x, y)$: no $\phi$ will increase, so it is $O(1)$
+ $Find(x)$ with $s$ node visited
+ Actual time: $O(s)$
+ Decrease in potential: $\geq s-O(\log n)$
+ Amortized time: $O(\log n)$
+ Amortized time $=$ Actual time $+$ change of potential
+ Since at the beginning, sum of potentals is $0$, potentials remain positive all the time
+ Thus, real total time $=$ total amortized time $-$ final potentials
+ So total time is at most the total amortized time
### Analysis(3) - Recurrence
Let $T(m,r)$ be the worst-case time needed for a sequence of $m$ operations, and the ranks are bounded by $r$

+ Let $s$ be a rank bound
+ If rank are within low rank

+ Total time for low rank: $T(m,s)$
+ If rank are within high rank

+ Number of high rank nodes $\leq m/2^s$
+ Total time for high rank: $m/2^s\times r+O(m)$
+ For a node $x$ with $rank[x]<s$, its parent can switch from low rank to high rank only once

+ Total time $O(m)$
+ On every path, there is only one node $x$
+ $x$ is low rank but its parent is high rank
+ Total time $O(m)$
+ Thus $T(m,r)\leq T(m,s)+O(m/2^s\times r+m)$
+ Choose $s=\log r$ so $T(m,r)\leq T(m,\log r)+O(m)$
+ Thus, $T(m,r)=O(m\log^*r)$
+ We can rewrite $T(m,r)\leq T(m,s)+O(m/2^s\times \log^*r+m)$ since it holds for $s=0$
+ Choose $s=\log^*r$, so $T(m,s)\leq T(m,\log^*r)+O(m)$
+ Thus, $T(m,r)=O(m\log^{**}r)$
+ $\log^{**}r$: how many $\log^*$ take $r$ below $2$
+ Continue this process, we get $T(m,r)=O(m\log^{*^c}r)$
+ $\log^{*^c}n=\log^{\overbrace{**...*}^{c \ times}}n$
### Problem on Analysis
The algorithm doesn’t change
+ Different analysis gives different time bound
+ What’s its real time bound (for $n$ elements and $m$ operations)?
## Ackermann function
### Definition
$A(m,n)$, where $m$, $n \in \Bbb N$, such that \begin{cases}
A(0,n)=n+1, & \text{$n\geq1$} \\
A(m,0)=A(m-1,1), & \text{$m>0$} \\
A(m,n)=A(m-1,A(m,n-1)), & \text{$m,n>0$}
\end{cases}
+ Rewrite the function, for $n\geq 1$, $$A_m (n)=:\begin{cases} n+1, & \text{if $m = 0$}\\ A_{m-1}^{(n+1)}(1), & \text{if $m \geq 1$} \end{cases}$$
+ m is called the **level** of the function
+ n is called the **iteration**
+ The function $f(n)=A(n,n)$ grows much faster than polynomials or exponentials or any function that you can imagine
+ $A(0,n)=n+1$
+ $A(1,n)=2+(n+3)-3$
+ $A(2,n)=2\times (n+3)-3$
+ $A(3,n)=2^{n+3}-3$
+ $A(4,n)=\underbrace{2^{2^{2^{...2}}}}_{n+3\ terms}-3$

### Inverse of Achermann function
+ **Define**: $\alpha(n)=min\{k|A_k(1)\geq n\}$
+ The inverse function grows very slowly
+ $\alpha(n)\leq 5$ until $n\leq A_5(1)$, where $A_5(1)>>10^{80}$
## Tight Time Complexity Analysis
### **Theorem**(*Tarjan and Van Leeuwen*): Let $T(n,m)$ be the maximum time required to process any intermixed sequence of $m$ $Find$s and $n$ $Makeset$s and $Union$s. $T(n,m) = O(m \alpha(n))$
#### Property of Rank
+ For all non-root node $x$, $rank[x]<rank[par[x]]$
+ As we follow the path from any node to the root, the node ranks strictly increase
+ Every node had rank at most $\log n$
#### $O(m\alpha(n))$ Bound Proof
+ Using amortized analysis
+ Using $Link$ instead of $Union$
+ Suppose converting a sequence $S'$ of $m'$ $Makeset$, $Union$, and $Find$ operations into a sequence $S$ of $m$ $Makeset$, $Link$, $Find$ by turning $Union$ to two $Find$s and one $Link$, then if $S$ runs in $O(m\alpha(n))$, then $S'$ runs in $O(m'\alpha(n))$.
+ $m'\leq m\leq 3m'\to m=O(m')$
#### Potential Function
+ For each node $x$, assign a potential function $\phi_q(x)$ after $q$ operations
+ For the entire forest, $\Phi_q=\sum_x\phi_q(x)$
+ $\Phi_0=0$ in the beginning
+ $\Phi_q$ will never negative
+ $$\phi_q(x)=\begin{cases} \alpha(n)rank[x], & \text{if $x$ is a root or $rank[x]=0$}\\ (\alpha(n)-level(x))rank[x]-iter(x), & \text{otherwise} \end{cases}$$
+ $level(x)=max\{k|rank[par[x]]\geq A_k(rank[x])\}$
+ $iter(x)=max\{i|rank[par[x]]\geq A_{level(x)}^{(i)}(rank[x])\}$
+ When $rank[x]>0$
+ $0\leq level(x)< \alpha(n)$ since
+ $rank[par[x]]\geq rank[x]+1=A_0(rank[x])$
+ $A_{\alpha(n)}(rank[x])\geq A_{\alpha(n)}(1)\geq n>rank[par[x]]$
+ $1\leq iter(x)\leq rank[x]$ since
+ $rank[par[x]]\geq A_{level(x)}(rank[x])=A_{level(x)}^{(1)}(rank[x])$
+ $A_{level(x)}^{(rank[x]+1)}(rank[x])\geq A_{level(x)}^{rank[x]+1}(1)=A_{level(x)+1}(rank[x])>rank[par[x]]$
+ Relation among $rank[par[x]]$, $level(x)$ and $iter(x)$
+ Since $rank[par[x]]$ monotonically increase over time, $iter(x)$ can increase when $level(x)$ unchanged
+ When $iter(x)$ reaches $rank[x]+1$, since $A_{level(x)}^{rank[x]+1}(rank[x]+1)\geq A_{level(x)}^{rank[x]+1}(1)=A_{level(x)+1}(rank[x])$, $level(x)$ will increase, $iter(x)$ goes back to $1$
+ Thus, $\phi_q(x)$ can only decrease when $rank[par(x)]$ increases

#### Lemma: For every node $x$, and for all $q$, $0\leq \phi_q(x)\leq \alpha(n)rank[x]$
+ If $x$ is root or $rank[x]=0$, then correct by definition
+ Suppose $x$ is not root and $rank[x]>0$
+ $\phi_q(x)=(\alpha(n)-level(x))rank[x]-iter(x)\geq (\alpha(n)-(\alpha(n)-1))rank[x]-rank[x]=0$
+ $\phi_q(x)=(\alpha(n)-level(x))rank[x]-iter(x)\leq (\alpha(n)-0)rank[x]-1<\alpha(n)rank[x]$
#### Theorem: The amortized cost of each $Makeset$ operation is $O(1)$
+ Create a single node $x$ with $rank[x]=0$, so $\phi_q(x)=0$
+ No other change to the forest, so $\Phi_{q-1}=\Phi_q$ $\to$ Amortized cost $=O(1)$
#### Lemma: Let $x$ be a node that is not a root, and suppose $q$th operation is either $Link$ or $Find$. Then after the $q$th operation, $\phi_q(x)\leq\phi_{q-1}(x)$. Moreover, if $rank[x]\geq 1$ and either $level(x)$ or $iter(x)$ changes due to the $q$th operation, then $\phi_q(x)\leq\phi_{q-1}(x)-1$
+ $x$ not root $\to$ $rank[x]$ not change
+ If $rank[x]=0$, then $\phi_q(x)=\phi_{q-1}(x)=0$
+ Suppose $rank[x]>0$
+ If $level(x)$ not change
+ If $iter(x)$ not change, $\phi_q(x)=\phi_{q-1}(x)$, since all keep same
+ If $iter(x)$ increase, $\phi_q(x)$ will decrease by at least $1$
+ If $level(x)$ increases, then $(\alpha(n)-level(x))rank[x]$ drops at least by $rank[x]$
+ Suppose $iter(x)$ also drops, then the drop at most $rank[x]-1$, so $\phi_q(x)$ will drop at least $rank[x]-(rank[x]-1)=1$.
+ Thus, $\phi_q(x)\leq\phi_{q-1}(x)-1$
#### Theorem: The amortized cost of each $Link$ is $O(\alpha(n))$
+ Acutal cost for $Link$ operation is $O(1)$
+ Consider potential change
+ Three kinds of nodes: $x$, $y$, and the old children of $y$
+ The potential of $y$'s old children not increase
+ For $x$, $\phi_q(x)=(\alpha(n)-level(x))rank[x]-iter(x)\leq\alpha(n)rank[x]=\phi_{q-1}(x)$
+ For $y$, $rank[y]$ may stay same or increase by 1, so $\phi_q(y)=\alpha(n)rank[y]=\phi_{q-1}(x)$ or $\phi_{q-1}(x)+\alpha(n)$
+ Thus the potential increase due to the $Link$ operation is at most $O(\alpha(n))$
+ Thus, the amortized cost is $O(1)+O(\alpha(n))=O(\alpha(n))$
#### Lemma: There are at most $\alpha(n)+2$ nodes don't have the ancestor with same $level$ on a $Find$ path
+ Pigeon-hole principle: $0\leq level(x)<\alpha(n)$, including the root and the leaf

#### Lemma: For every node $x$, $\phi(x)$ will decrease after path compression if there exists an ancestor $y$ of $x$ such that $level(x)=level(y)$
+ Let $k=level(x)=level(y)$
$rank[par[y]]\geq A_k(rank[y])\geq A_k(rank[par[x]])\geq A_k(A_k^{(iter(x))}(rank[x]))=A_k^{(iter(x)+1)}(rank[x])$
+ After path compression $par[x]=par[y]$, $rank[par[x]]=rank[par[y]]\geq A_k^{(iter(x)+1)}(rank[x])$
+ When $iter(x)$ reaches $rank[x]$, $level(x)$ will increase
+ Which means that either $iter(x)$ increases, or $level(x)$ to increase
+ $\to\phi_q(x)\leq\phi_{q-1}(x)-1$
#### Theorem: The amortized cost of each $Find$ operation is $O(\alpha(n))$
+ Suppose there are $s$ nodes in the $Find$ path
+ The actual cost of $Find$ is $O(s)$
+ Root’s potential doesn't change and no other node’s potential increases
+ At least $max(0,s-(\alpha(n)+2))$ nodes on the find path have their potential decrease by at least $1$
+ Thus, the amortized cost is at most $O(s)-(s-(\alpha(n)+2))=O(\alpha(n))$
### Is it optimal?
+ *Robert Tarjan 1975* gives the proof of running time $O(m\alpha(n))$
+ He also shows that $\Omega(m\alpha(n))$ is the lower bound of this **Union-Find** algorithm
+ *Fredman, Saks 1989* shows that $\Omega(\alpha(n))$ word-operations is a lower bound for this problem, that is, any disjoint-set data structure.