Lecture 4 - Complexity of Union-Find Structure

# Lecture 4 - Complexity of Union-Find Structure ###### tags: `Algorithm` ## Union-Find Problem + Given a set $\{1, 2, ...,n\}$ of n elements + Initially each element is in different set + $\{1\}, \{2\}, ...,\{n\}$ + An intermixed sequence of union and find operations is performed + A $Union$ operation combines two sets into one + Each of the n elements is in exactly one set at any time + A $FindSet$ operation identifies the set that contains a particular element ## Disjoint Sets ### Disjoint-Set Data Structure + Maintains a collection of disjoint sets + Each set has a **representative** element #### Operation + $MakeSet(x)$ + Given object $x$, create a new set whose only element is pointed by $x$ + $Find(x)$ + Given object $x$, return the **representative** of the set containing $x$ + Note: $Find(i)=Find(j)$ iff $i$ and $j$ are in the same set + $Union(x, y)$ + Given two elements $x$, $y$, merge the disjoint sets containing them + The original sets are destroyed + The new set has a new representative #### Application - ***Kruskal*'s algorithm** ```cpp for (each vertex v in V) Makeset(v); for (each edge (u, v) in E in increasing order) if(Find(u)!=Find(v)) Union(u, v); ``` ### Implementation - Linked Lists #### Basic Version ![](https://i.imgur.com/5R8xrRB.png) ```cpp struct Node{ Node *rep, *next; int x; } ``` + $MakeSet(x)$ + Create a list with only one node, for $x$ + Time $O(1)$ + $Find(x)$ + Return the pointer to the representative + Time $O(1)$ + $Union(x, y)$ 1. Find the representatives $x'$, $y'$ of the sets containing $x$ and $y$ 2. Append $y'$ list to $x'$ list 3. Pick $x'$ as the new representative 4. Update all represenative pointer in $y'$ list + Time $O(n)$ #### Improvement + Let each representative keep track of the length of its list and always **append the shorter list to the longer one** ##### Theorem: Any sequence of m operations takes $O(m+n\log n)$ time + $Makeset$,$Find$: $O(1)$ time + $Union$: $O(\log n)$ time per element in total + If we look at a single element, its pointer can only change when it is in the **shorter list** of the merging + Thus, after merging, the size of the set $y’$ will double at least + Thus, for an element, its pointer can only change $O(\log n)$ time + $1, 2, 4, 8, ..., n$ ### Implemantation - Up Trees #### Basic Version ![](https://i.imgur.com/fOX2OXV.png) ```cpp int par[MAXN]; void Makeset(int x){ par[x] = x; } int Find(int x){ if(x!=par[x]) return Find(par[x]); return x; } void Union(int x, int y){ par[Find(x)] = Find(y); } ``` + $MakeSet(x)$: $O(1)$ + $Find(x)$: Worst case is the height of the tree $O(n)$ + $Union(x, y)$: Contain $2$ $Find$ operations $O(n)$ #### Union by Weight or Height + Height rule: Make tree with **smaller height** a subtree of the other tree (break ties arbitrarily) ![](https://i.imgur.com/uRxDOdR.png) + Weight rule: Make tree with **fewer number of elements** a subtree of the other tree (break ties arbitrarily) ![](https://i.imgur.com/5dm9P7G.png) ```cpp int weight[MAXN]; void Makeset(int x){ par[x] = x; weight[x] = 1; } void Union(int x, int y){ int rep_x = Find(x), rep_y = Find(y); if(weight[rep_x]<weight[rep_y]){ par[rep_x] = rep_y; weight[rep_y] += weight[rep_x]; }else{ par[rep_y] = rep_x; weight[rep_x] += weight[rep_y]; } } ``` ##### Theorem: If $Union$ are done by weight, the depth of any element is never greater than $\log n+1$ + Inductive Proof: + Initially, every element is at depth zero + When its depth increases as a result of a union operation, it is placed in a tree that becomes at least twice as large as before + How often each union be done? + $\log n$ times, becauce after at most $\log n$ unions, the tree will contain all n elements + Therefore, $Find$ becomes $O(\log n)$ when **$Union$ by weight** is used #### Path Compression Each time we do a $Find$ on an element $x$, we make all elements on path from root to $x$ be immediate children of root by making each element’s parent be the representative. ![](https://i.imgur.com/8BVXS3H.png) ```cpp int Find(int x){ if(x!=par[x]) return par[x] = Find(par[x]); return x; } ``` #### Union by Rank + Path Compression + Rank + **Upper bound** on height of tree + $rank[a]$ for every node $a$ in the tree + Initial it is zero + It can only increase as a root when linking two trees ```cpp int par[MAXN], rank[MAXN]; void Makeset(int x){ par[x] = x; rank[x] = 0; } void Link(int x, int y){ if(rank[x] > rank[y]){ par[y] = par[x]; }else{ par[x] = par[y]; if(rank[x]==rank[y]) rank[y]++; } } int Find(int x){ if(x!=par[x]) return par[x] = Find(par[x]); return x; } void Union(int x, int y){ // 2 Find + 1 Link Link(Find(x), Find(y)); } ``` ## Analysis (Union by Rank + Path Compression) ### Basic observation + For every node $x$, $rank[x]$ can only increase + When $x$ becomes non-root, $rank[x]$ will not change any more + If $y$ is the parent of $x$, $rank[y]>rank[x]$ + When $x$ becomes the child of $y$, $rank[x]\leq rank[y]$, and if $rank[x]=rank[y]$, $rank[y]$ will increase + After that, only $y$ can increase + Rank $r$ node can only be formed by linking two rank $r-1$ nodes + Rank is bounded by $O(\log n)$ + Number of rank $r$ nodes is $O(n/2^r)$ ### Analysis(1) - Straight Forward Observation #### **Theorem**: For every non-root node $x$, consider $rank[par[x]]-rank[x]$, after path compression, only $\log_2n$ nodes in each $Find$ path don't have their $rank[par[x]]-rank[x]$ multiply by at least $2$ + Originally $C=par(D)$ ![](https://i.imgur.com/OXZo3Fd.png)![](https://i.imgur.com/RSe3GHp.png) + If $rank[A]-rank[C]\geq rank[C]-rank[D]$ ![](https://i.imgur.com/vYcHwU5.png) + Then after path compression $par(D)\leftarrow A$ + $rank[A]-rank[D]\geq 2(rank[C]-rank[D])$ + If $rank[A]-rank[C]<rank[C]-rank[D]$ ![](https://i.imgur.com/YOTAdkG.png) + Then $rank[A]-rank[D]>2(rank[A]-rank[C])$ + Since $rank[A]-rank[x]\leq\log n$ for any $x$ + This can only happen $O(\log n)$ times in a path #### Corallory: $Find$ is $O(\log n)$ amortized time + For nodes don't have their $rank[par[x]]-rank[x]$ multiply by $2$, count in this $Find$ operation, then $O(\log n)$ per operation + For nodes have their $rank[par[x]]-rank[x]$ multiply by $2$, count on its own, then $O(\log n)$ per node ### Analysis(2) - Potentials #### Define: $$\phi(x):= \begin{cases} \log_2n-\lfloor \log_2(rank[par[x]]-rank[x])\rfloor,&\text{if $x$ is a non-root node}\\\log_2n,&\text{if $x$ is a root node}\end{cases}$$ #### Theorem: $\phi(x)\geq 0$, and $\phi(x)$ is monotonically decreasing + During path compression, for a path with $s$ nodes, at least $s-\log_2n$ nodes $x$ such that $rank[p(x)]-rank[x]$ multiply by at least $2$ + $\to$ have their $\phi(x)$ decreases by $\geq 1$ #### Amortized Analysis + $Makeset(x)$: $\phi(x)=\log_2n$, amortized time is $O(\log n)$ + $Link(x, y)$: no $\phi$ will increase, so it is $O(1)$ + $Find(x)$ with $s$ node visited + Actual time: $O(s)$ + Decrease in potential: $\geq s-O(\log n)$ + Amortized time: $O(\log n)$ + Amortized time $=$ Actual time $+$ change of potential + Since at the beginning, sum of potentals is $0$, potentials remain positive all the time + Thus, real total time $=$ total amortized time $-$ final potentials + So total time is at most the total amortized time ### Analysis(3) - Recurrence Let $T(m,r)$ be the worst-case time needed for a sequence of $m$ operations, and the ranks are bounded by $r$ ![](https://i.imgur.com/LiErYtS.png) + Let $s$ be a rank bound + If rank are within low rank ![](https://i.imgur.com/Ee1Tdyu.png) + Total time for low rank: $T(m,s)$ + If rank are within high rank ![](https://i.imgur.com/RlZb9fy.png) + Number of high rank nodes $\leq m/2^s$ + Total time for high rank: $m/2^s\times r+O(m)$ + For a node $x$ with $rank[x]<s$, its parent can switch from low rank to high rank only once ![](https://i.imgur.com/ZH1Ofz2.png) + Total time $O(m)$ + On every path, there is only one node $x$ + $x$ is low rank but its parent is high rank + Total time $O(m)$ + Thus $T(m,r)\leq T(m,s)+O(m/2^s\times r+m)$ + Choose $s=\log r$ so $T(m,r)\leq T(m,\log r)+O(m)$ + Thus, $T(m,r)=O(m\log^*r)$ + We can rewrite $T(m,r)\leq T(m,s)+O(m/2^s\times \log^*r+m)$ since it holds for $s=0$ + Choose $s=\log^*r$, so $T(m,s)\leq T(m,\log^*r)+O(m)$ + Thus, $T(m,r)=O(m\log^{**}r)$ + $\log^{**}r$: how many $\log^*$ take $r$ below $2$ + Continue this process, we get $T(m,r)=O(m\log^{*^c}r)$ + $\log^{*^c}n=\log^{\overbrace{**...*}^{c \ times}}n$ ### Problem on Analysis The algorithm doesn’t change + Different analysis gives different time bound + What’s its real time bound (for $n$ elements and $m$ operations)? ## Ackermann function ### Definition $A(m,n)$, where $m$, $n \in \Bbb N$, such that \begin{cases} A(0,n)=n+1, & \text{$n\geq1$} \\ A(m,0)=A(m-1,1), & \text{$m>0$} \\ A(m,n)=A(m-1,A(m,n-1)), & \text{$m,n>0$} \end{cases} + Rewrite the function, for $n\geq 1$, $$A_m (n)=:\begin{cases} n+1, & \text{if $m = 0$}\\ A_{m-1}^{(n+1)}(1), & \text{if $m \geq 1$} \end{cases}$$ + m is called the **level** of the function + n is called the **iteration** + The function $f(n)=A(n,n)$ grows much faster than polynomials or exponentials or any function that you can imagine + $A(0,n)=n+1$ + $A(1,n)=2+(n+3)-3$ + $A(2,n)=2\times (n+3)-3$ + $A(3,n)=2^{n+3}-3$ + $A(4,n)=\underbrace{2^{2^{2^{...2}}}}_{n+3\ terms}-3$ ![](https://i.imgur.com/uylZ8gC.png) ### Inverse of Achermann function + **Define**: $\alpha(n)=min\{k|A_k(1)\geq n\}$ + The inverse function grows very slowly + $\alpha(n)\leq 5$ until $n\leq A_5(1)$, where $A_5(1)>>10^{80}$ ## Tight Time Complexity Analysis ### **Theorem**(*Tarjan and Van Leeuwen*): Let $T(n,m)$ be the maximum time required to process any intermixed sequence of $m$ $Find$s and $n$ $Makeset$s and $Union$s. $T(n,m) = O(m \alpha(n))$ #### Property of Rank + For all non-root node $x$, $rank[x]<rank[par[x]]$ + As we follow the path from any node to the root, the node ranks strictly increase + Every node had rank at most $\log n$ #### $O(m\alpha(n))$ Bound Proof + Using amortized analysis + Using $Link$ instead of $Union$ + Suppose converting a sequence $S'$ of $m'$ $Makeset$, $Union$, and $Find$ operations into a sequence $S$ of $m$ $Makeset$, $Link$, $Find$ by turning $Union$ to two $Find$s and one $Link$, then if $S$ runs in $O(m\alpha(n))$, then $S'$ runs in $O(m'\alpha(n))$. + $m'\leq m\leq 3m'\to m=O(m')$ #### Potential Function + For each node $x$, assign a potential function $\phi_q(x)$ after $q$ operations + For the entire forest, $\Phi_q=\sum_x\phi_q(x)$ + $\Phi_0=0$ in the beginning + $\Phi_q$ will never negative + $$\phi_q(x)=\begin{cases} \alpha(n)rank[x], & \text{if $x$ is a root or $rank[x]=0$}\\ (\alpha(n)-level(x))rank[x]-iter(x), & \text{otherwise} \end{cases}$$ + $level(x)=max\{k|rank[par[x]]\geq A_k(rank[x])\}$ + $iter(x)=max\{i|rank[par[x]]\geq A_{level(x)}^{(i)}(rank[x])\}$ + When $rank[x]>0$ + $0\leq level(x)< \alpha(n)$ since + $rank[par[x]]\geq rank[x]+1=A_0(rank[x])$ + $A_{\alpha(n)}(rank[x])\geq A_{\alpha(n)}(1)\geq n>rank[par[x]]$ + $1\leq iter(x)\leq rank[x]$ since + $rank[par[x]]\geq A_{level(x)}(rank[x])=A_{level(x)}^{(1)}(rank[x])$ + $A_{level(x)}^{(rank[x]+1)}(rank[x])\geq A_{level(x)}^{rank[x]+1}(1)=A_{level(x)+1}(rank[x])>rank[par[x]]$ + Relation among $rank[par[x]]$, $level(x)$ and $iter(x)$ + Since $rank[par[x]]$ monotonically increase over time, $iter(x)$ can increase when $level(x)$ unchanged + When $iter(x)$ reaches $rank[x]+1$, since $A_{level(x)}^{rank[x]+1}(rank[x]+1)\geq A_{level(x)}^{rank[x]+1}(1)=A_{level(x)+1}(rank[x])$, $level(x)$ will increase, $iter(x)$ goes back to $1$ + Thus, $\phi_q(x)$ can only decrease when $rank[par(x)]$ increases ![](https://i.imgur.com/rDSkc7Z.png) #### Lemma: For every node $x$, and for all $q$, $0\leq \phi_q(x)\leq \alpha(n)rank[x]$ + If $x$ is root or $rank[x]=0$, then correct by definition + Suppose $x$ is not root and $rank[x]>0$ + $\phi_q(x)=(\alpha(n)-level(x))rank[x]-iter(x)\geq (\alpha(n)-(\alpha(n)-1))rank[x]-rank[x]=0$ + $\phi_q(x)=(\alpha(n)-level(x))rank[x]-iter(x)\leq (\alpha(n)-0)rank[x]-1<\alpha(n)rank[x]$ #### Theorem: The amortized cost of each $Makeset$ operation is $O(1)$ + Create a single node $x$ with $rank[x]=0$, so $\phi_q(x)=0$ + No other change to the forest, so $\Phi_{q-1}=\Phi_q$ $\to$ Amortized cost $=O(1)$ #### Lemma: Let $x$ be a node that is not a root, and suppose $q$th operation is either $Link$ or $Find$. Then after the $q$th operation, $\phi_q(x)\leq\phi_{q-1}(x)$. Moreover, if $rank[x]\geq 1$ and either $level(x)$ or $iter(x)$ changes due to the $q$th operation, then $\phi_q(x)\leq\phi_{q-1}(x)-1$ + $x$ not root $\to$ $rank[x]$ not change + If $rank[x]=0$, then $\phi_q(x)=\phi_{q-1}(x)=0$ + Suppose $rank[x]>0$ + If $level(x)$ not change + If $iter(x)$ not change, $\phi_q(x)=\phi_{q-1}(x)$, since all keep same + If $iter(x)$ increase, $\phi_q(x)$ will decrease by at least $1$ + If $level(x)$ increases, then $(\alpha(n)-level(x))rank[x]$ drops at least by $rank[x]$ + Suppose $iter(x)$ also drops, then the drop at most $rank[x]-1$, so $\phi_q(x)$ will drop at least $rank[x]-(rank[x]-1)=1$. + Thus, $\phi_q(x)\leq\phi_{q-1}(x)-1$ #### Theorem: The amortized cost of each $Link$ is $O(\alpha(n))$ + Acutal cost for $Link$ operation is $O(1)$ + Consider potential change + Three kinds of nodes: $x$, $y$, and the old children of $y$ + The potential of $y$'s old children not increase + For $x$, $\phi_q(x)=(\alpha(n)-level(x))rank[x]-iter(x)\leq\alpha(n)rank[x]=\phi_{q-1}(x)$ + For $y$, $rank[y]$ may stay same or increase by 1, so $\phi_q(y)=\alpha(n)rank[y]=\phi_{q-1}(x)$ or $\phi_{q-1}(x)+\alpha(n)$ + Thus the potential increase due to the $Link$ operation is at most $O(\alpha(n))$ + Thus, the amortized cost is $O(1)+O(\alpha(n))=O(\alpha(n))$ #### Lemma: There are at most $\alpha(n)+2$ nodes don't have the ancestor with same $level$ on a $Find$ path + Pigeon-hole principle: $0\leq level(x)<\alpha(n)$, including the root and the leaf ![](https://i.imgur.com/j385nmT.png) #### Lemma: For every node $x$, $\phi(x)$ will decrease after path compression if there exists an ancestor $y$ of $x$ such that $level(x)=level(y)$ + Let $k=level(x)=level(y)$ $rank[par[y]]\geq A_k(rank[y])\geq A_k(rank[par[x]])\geq A_k(A_k^{(iter(x))}(rank[x]))=A_k^{(iter(x)+1)}(rank[x])$ + After path compression $par[x]=par[y]$, $rank[par[x]]=rank[par[y]]\geq A_k^{(iter(x)+1)}(rank[x])$ + When $iter(x)$ reaches $rank[x]$, $level(x)$ will increase + Which means that either $iter(x)$ increases, or $level(x)$ to increase + $\to\phi_q(x)\leq\phi_{q-1}(x)-1$ #### Theorem: The amortized cost of each $Find$ operation is $O(\alpha(n))$ + Suppose there are $s$ nodes in the $Find$ path + The actual cost of $Find$ is $O(s)$ + Root’s potential doesn't change and no other node’s potential increases + At least $max(0,s-(\alpha(n)+2))$ nodes on the find path have their potential decrease by at least $1$ + Thus, the amortized cost is at most $O(s)-(s-(\alpha(n)+2))=O(\alpha(n))$ ### Is it optimal? + *Robert Tarjan 1975* gives the proof of running time $O(m\alpha(n))$ + He also shows that $\Omega(m\alpha(n))$ is the lower bound of this **Union-Find** algorithm + *Fredman, Saks 1989* shows that $\Omega(\alpha(n))$ word-operations is a lower bound for this problem, that is, any disjoint-set data structure.