owned this note
owned this note
Published
Linked with GitHub
# Note of Bottom-up Heapsort (English Version)
> <u>[BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average, QUICKSORT (if n is not very small)](https://www.sciencedirect.com/science/article/pii/030439759390364Y)</u>
## Number of Comparison of Sorting Algorithms
### General Comparision Sort
#### Worst Case Analysis
For a input of $n$ *different* entries, there are at most $n!$ permutations, which **only 1 of them is the sorted list**. If a algorithm always fisinhs after $f(n)$ steps (comparisons), it can at most distinguish $2^{f(n)}$ different permutaions, due to the fact there are **only 2 result of each comparison**, eliminating half of the possibilities.
Therefore, the algorithm must at least yield $2^{f(n)}\geq n!$, or $f(n)\geq\log_2{n!}$
Since the number of comparisons is definetely a whole number, we can assume the lower bound of it at worst case as $\big\lceil\log{n!\ }\big\rceil$, then use the [Strling's Formula](https://www.wikiwand.com/en/Stirling%27s_approximation) to simplify: $$\begin{align}
&&\big\lceil\ln{n!\ }\big\rceil&=n\ln{n}-n+\Theta(\ln{n}) \\
&\Rightarrow&\Big\lceil\frac{\log_2{n!}}{\log_2{e}}\Big\rceil&= n\frac{\log_2{n}}{\log_2{e}}-\frac{\log_2{e^n}}{\log_2{e}}+\Theta\Big(\frac{\log_2{n}}{\log_2{e}}\Big)\\
&\Rightarrow&\lceil\log_2{n}!\rceil&=n\log_2{n}-n\log_2{e}+\Theta(\log_2{n}) \\
&&&\approx n\log_2{n}-1.4427n \\
\end{align}$$
When $n$ is big enough, we can ignore the error $\Theta(\log_2{n})$.
### Merge Sort
Requires at least $n\log{n}-n+1$ comparisions, but also needs an additional array of length $n$, mergesort is only useful for external sorting.
### Insertion Sort
Requires only ==at most== $\log{n!}+n-1$ comparisions. However, the average number of nonessencial operaions (e.g., swapping) is bounded at $\Theta\big(n^2\big)$, which is much higher than other sorting algorithms we consider useful's number ($\text{O}\big(n\log{n}\big)$).
### Quick Sort
(Not that important.)
----
----
<br/>
## Algorithm of Heap Sort / Bottom-Up Heap Sort
:::success
+ **BOLD-CAPTALIZED-DASHED-FONTS** $\rightarrow$ method names / keywords (e.g., **FOR**)
+ `monospace fonts` $\rightarrow$ variables
+ *ITALIC-CAPTALIZED-UNDERSCORED-FONTS* $\rightarrow$ other methods called wwithin methods
:::
### Basic Heap Sort Analyzation
First, we define these followings to perform a heap sort.
+ **LEFT(i)**
The left child of element at index ***i***.
+ **RIGHT(i)**
The right child of element at index ***i***.
+ **HEAPIFY(A, i)**
Assuming in tree ***A***, both subtree which roots are ***LEFT(i)*** / ***RIGHT(i)*** already satisfies the properties of a heap. Reorder so that the subtree which root is ***i*** also satisfies the properties of a heap.
The shift of ***i*** is performed at most ***h*** times, assuming ***h***$=log(n)$ is the height of subtree which root is ***i***; before each shift, at most ***2*** comparision is required.
+ **BUILD-HEAP(A)**
Reorder tree ***A*** so that its element satisfies the properties of a heap. That is, to call *HEAPIFY* on the $\lfloor\frac{n}{2}\rfloor$-th node down to the 1st node.
Then, to perform heap sort, we just need to perform the following procedure:
:::info
**HEAPSORT(A):**
*BUILD_HEAP(A)*
**for** `i` = `A.length` **downto** 2
swap `A[i]` with `A[1]`
`A.heapsize` = `A.heapsize` - 1
*HEAPIFY(A, 1)*
:::
### Bottom-Up Variation
Recall that when reheaping, on every level of shifting, 2 comparisions are required to determine which element to go on. Furthermore, since **most node of a tree are located near the leaf**, we can assume almost all reheaping goes until the leaf.
Thus, there will be a significant difference if we could eliminate 1 of the 2 comparision required.
### Special Leaf & Special Path
To achieve the above goal, we try to find the final location of ***i*** (as in ***REHEAP(i)***) then modify the heap.
From ***i***, we find the ==**Special Path**== by comparing the two child, choose the one that ***i*** should be exchanging with, until we reach the leaf, which we call the ==**Special Leaf**==.
:::info
**LEAF-SEARCH(A, i):** (if building min-heap)
`j` = `i`
**while** `j` < `A.heapsize` **do**
**if** `A[2 * j]` < `A[2 * j + 1]`
`j` = `A[2 * j]`
**else**
`j` = `A[2 * j + 1]`
**return** `j`
:::
Following it, we search from the special leaf, to where ***i*** is supposed to be. Consider the special path as `b[1]...b[k]...b[h]`, the required comparision number is
<span style="position: absolute; left: 50%; transform: translateX(-50%);">$1\times(k-1)+2\times(h-k)$</span>
comparisions.
:::info
**BOTTOM-UP-SEARCH(A, i, j):** (if building min-heap)
**while** `A[i]` < `A[j]` **do**
`j` = floor(`j / 2`)
:::
After we found the position, we must perform the exchange, assuming on the special path excluding the root is `b[1]...b[k]...b[h]`, and the root as `x`. The procedure **INTERCHANGE(A, i, j)** preform actions so that:
- `x` take position of `b[k]`
- `b[1]` ~ b[k] take position of their parent
Last, we modify ***HEAPIFY(A)*** into the following:
:::info
**BOTTOM-UP-HEAPIFY(A, i)**:
`j` = *LEAF-SEARCH(A, i)*
`j` = *BOTTOM-UP-SEARCH(A, i, j)*
*INTERCHANGE(i, j)*
:::
Substituting into the original heapsort algorithm, we get the bottom-up heap sort algorithm.
----
----
<br/>
## Calculating Number Of Comparisions Of The Bottom-Up Heapsort Algorithm
To calculate the total comparision times, we discuss seperate procedures and try to calculate the number of comparison each would use.
### 1. Number of Cmp. used by the *LEAF-SEARCH()* calls during the heap creation phase.
The following is a example tree that we could calculate for. Notice that
- $k$ represents the number of completely filled levels. Can be calculated by $\log_2{(n+1)}$.
- $i$ represents the number of node at the lowest level, if its not completely filled.
- $n$ represents the number of nodes we're considering for. For the whole tree $n=2^k-1+i$.
![](https://i.imgur.com/bJ2IK9n.png)
+ When creating building a heap, the procedure **LEAF-SEARCH()** is called for node $1$ to $\lfloor n/2\rfloor$.
#### Worst Case Situation
+ We first calculate the number of comparisions for nodes for the **fully filled layers** (That is, the <span style="color: orange">**orange nodes**</span>, considering $n = 2^k-1$).
For each $l$ that $1\leq l\leq k-1$, There exists $2^{k-(l+1)}$ nodes, located at level $k - 1 - l$, casuing $l$ comparisions each. For example, if we substitute $l=1$ to the formula, we get that on level $3$, there are $2^{4-(1+1)}=4$ nodes that require $1$ comparisions to the reach the leaf. To sum it up, we try to calculate the result of $$\displaystyle\sum_{l=1} ^{k-1} \Big(l\cdot 2^{k-l-1}\Big)$$
Considering $S_m$ as the result when $k = m$, we can get the recursive formula$$S_m=2S_{m-1}+m-1$$
By substuting $T_m = S_m+m$, we can generate:$$\begin{align}\\&\Rightarrow & S_m+(m-1) &= 2\cdot(S_{m-1} + (m-1)) \\ &\Rightarrow & T_m-1 &= 2\cdot T_{m-1} \\ \end{align}$$
Then, We try to get the form $T_m+c=2\cdot (T_{m-1}+c)$. To do so, we plug in $T_1=1$ and $T_2=3$, to get $c=1$, thus resulting in the following equation allowing us to substitute to itself: $$\begin{align*} \\ T_m+1 &= 2\cdot (T_{m-1}+1) & \\ &= 2\cdot (2\cdot (T_{m-2}+1)) &=&\ 2^2\cdot T_{m-2} \\ &= 2\cdot (2\cdot (T_{m-3}+1)) &=&\ 2^3\cdot T_{m-3} \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \vdots \\ &= 2^{m-1}\cdot(T_1+1) &=&\ 2^{m-1}\cdot2 \\ &= 2^m \\ \Rightarrow (S_m+m)+1 &= 2^m \\ \end{align*}$$
Thus, we finally get the following equation: $$S_k=2^k-k-1$$
To represent with $n$, we again substitute with $n=2^k-1$, getting the number of comparisions: $$n-\big\lceil\log{(n+1)}\big\rceil$$
+ Next, we try to calculate the last $i$ nodes. ( That is, the <span style="color: skyblue">**blue nodes**</span> ).
Every pair of the blue nodes ( that is, when $i\geq 2$ ) causes every node above them costing $1$ more comparisons, that is, for level $k - r, 1\leq r\leq k-1$, the blue nodes causes $$\Big\lceil\frac{i-1}{2^r}\Big\rceil$$
Causing the procedure requireing an additional $$\displaystyle\sum_{r=1}^k\ \Big\lceil\frac{i-1}{2^r}\Big\rceil$$
Comparisons. Since the formula contains a ceiling function, we try to obtain a upper bound. First, we seperate the integer part and the fractal part (which would be modified by the ceiling function) from the formula, getting: $$\displaystyle\sum_{r=1}^k\frac{i-1}{2^r}+\displaystyle\sum_{r=1}^k\Big(\Big\lceil\frac{i-1}{2^r}\Big\rceil-\frac{i-1}{2^r}\Big)$$
The first term is easy to calculate. Eliminating the constant term, we get $$(i-1)\displaystyle\sum_{r=1}^k\frac{1}{2^r}$$ which can be easily evaulated into $$(i-1)\cdot\Big(1-\frac{1}{2^k}\Big)$$
For the second term, we use the property that $$\begin{align} \\ \displaystyle\sum_{r=1}^k\Big(\Big\lceil\frac{i-1}{2^r}\Big\rceil-\frac{i-1}{2^r}\Big) &\leq \overbrace{\Big[\Big(1-\frac{1}{2}\Big)+\Big(1-\frac{1}{2^2}\Big)+\dots+\Big(1-\frac{1}{2^k}\Big)\Big]}^{\text{k terms}} \\ &= k-\Big[\frac{1}{2}+\frac{1}{2^2}+\dots+\frac{1}{2^k}\Big] \\ &=k-\Big[1-\frac{1}{2^k}\Big]\end{align}$$
Adding the two terms up results in $$\begin{align} \\ \displaystyle\sum_{r=1}^k\frac{i-1}{2^r}+\displaystyle\sum_{r=1}^k\Big(\Big\lceil\frac{i-1}{2^r}\Big\rceil-\frac{i-1}{2^r}\Big)&\leq (i-1)\cdot\Big(1-\frac{1}{2^k}\Big)+k-\Big[1-\frac{1}{2^k}\Big] \\ &= i-1-\frac{i-1}{2^k}+k-1+\frac{1}{2^k} \\ &= i+k-2+\frac{2-i}{2^k}\\ \end{align}$$
Since $i\geq 2$, we are able to elimate the last term, thus obtaining the inequality stated in the paper: $$\Big\lceil\frac{i-1}{2^r}\Big\rceil \leq i-2+k$$
<span style="background-color: #ffd7c1; display: block; padding: 2% 2% 0.5% 2%; border-radius: 15px;">
Finally, adding the two value up, we result in: (Recall that for a full tree, $n=2^k-1$) $$2^k-1-k+i-2+k=n-2$$ ==**As the upper-bound for the number of comparisions.**==
</span>
### Best case Situation
Due to the fact that whatever the form the tree is, all **LEAF_SEARCH()** runs until the leaf, thus the number of comparisions remains as $n-\big\lceil\log{(n+1)}\big\rceil$, identical with the worst case.
However, consider the situation that $n=2^k$. That is, the deepest layer only contains 1 simgle node (Only 1 <span style="color:skyblue">**blue node**</span> exists). Althought creating an additional level through the root to leaf, a single simply lets the calculation fall through, ==saving $1$ comparisions for each root==. This happens if **the root sits on the path $1$ to $n$**, saving $\big\lceil\log_2{n}\big\rceil-1$ comparisons in total.
<span style="background-color: #ffd7c1; display: block; padding: 2% 2% 0.5% 2%; border-radius: 15px;">
Thus for the best case, the number of comparisions becomes $$\begin{align}&n-\big\lceil\log{(n+1)}\big\rceil - \Big(\big\lceil\log_2{n}\big\rceil-1\Big) \\ =\ &n-\big\lceil\log{(n+1)}\big\rceil - \big\lceil\log_2{n}\big\rceil+1\end{align}$$
==**As the lower-bound for the number of comparisions.**==
</span>
----
### 2. Number of Cmp. used by the *BOTTOM-UP-SEARCH()* calls during the heap creationn phase.
In this part, we focus on individual *special paths*; that is, the nodes from the processing root till the *special leaf*. Since the heapify procedure is done bottom-up, the nodes on the path should be well ordered according to the rules of a heap.
+ The array $b()$ represents the nodes on the special path, excluding the root, starting from $b(1)$.
- Representing the border of the array, $b(0)=-\infty, b(d+1)=\infty$.
+ $x$ represents the root node.
+ $d$ represents the length of the special path, excluding the root.
![](https://i.imgur.com/trd0Kup.png)
The bottom-up search is simple: we place $x$ at the position after $b(d)$, then move up until before node $b(j)$ that satisfies $b(j)\le x \leq b(j+1)$.
#### Symbols for calculation
Compared with HEAPSORT, the above action requires the following numbers of comparison:
| Method | When $j = 0$ | When $0 \lt j \lt d$ | When $j=d$ |
| -------: | :-: | :-: | :-: |
| **HEAPSORT** <br> requires $2$ comparision per level | $2(j+1)$ | $2(j+1)$ | $2d$ |
| **BOTTOM-UP SEARCH()** <br> requires $1$ comparision per level | $d$ | $d-j+1$ | $d-j+1$ |
Then, we define the following (for one root,):
+ $l_{\text{BUS}}$ represents the number comparison used by **BOTTOM-UP-SEARCH()**
+ $l_{\text{HS}}$ represents *half of* the number of comparison HEAPSORT uses.
The result of $l_{\text{BUS}}+l_{\text{HS}}$ changes according to the condtions:
:::spoiler Calculations
**When $j = 0$**$$\begin{align}\frac{2(j+1)}{2}+d&=(j+1)+d \\ &= (0+1)+d \\ &=d+1\end{align}$$
**When $0 \le j \le d$**$$\begin{align}\frac{2(j+1)}{2}+(d-j+1)&=(j+1)+d-j+1 \\ &=d+1\end{align}$$
**When $j=d$**$$\begin{align}\frac{2d}{2}+d-j+1&=2d-j+1 \\ &=2d-d+1 \\ &=d+1\end{align}$$
:::
| When $j = 0$ | When $0 \le j \le d$ | When $j=d$ |
| :-: | :-: | :-: |
| $d+1$ | $d+2$ | $d+1$ |
#### The comparison number of normal HEAPSORT
> TODO: Read the damn paper which the math is way too over my capacity
According to an even older paper [An Average Case Analysis of Floyd's Algorithm to Construct Heaps](https://core.ac.uk/download/pdf/82135122.pdf), we know that HEAPSORT uses, on average,
+ $(\alpha_{1}+2\alpha_2-2)n+\Theta(\log{n})$ comparisions
+ $(\alpha_1+\alpha_2-2)n+\Theta(\log{n})$ interchanges
Where $\alpha_1 = 1.6066951\dots,\ \ \alpha_2 = 1.1373387\dots$
#### Calculating with mean values
Since the input array is randomly choosed, we use $L_{\text{BUS}}$, $L_{\text{HS}}$ and $D$ to denote the random variable that is the sum of all $l_{\text{BUS}}$, $l_{\text{HS}}$ and $d$, respectively. ( That is to say, for instance, the excpected value of $D, E(D)=d\times\lfloor\frac{n}{2}\rfloor=dn'$ )
According to the first part this section, since the comparision time of **LEAF-SEARCH()** is exactly the sum of special length for each root node, we know that:$$E(D)=n+\Theta(\log{n})$$
Additionally, According to the above theory, we also know that:$$E(L_{\text{HS}})=(\frac{\alpha_1}{2}+\alpha_2-1)n+\Theta(\log{n})$$
By defining the number of calls that $\ l_{\text{BUS}}+l_{\text{HS}}=d+1$ as $T$, the number of calls that $\ l_{\text{BUS}}+l_{\text{HS}}=d+2$ would be $n'-T$. Thus we can calculate that:$$\begin{align} E(L_{\text{BUS}})&=(n'-E(T))\cdot(E(d)+2-E(l_{\text{HS}}))+E(T)\cdot(E(d)+1-E(l_{\text{HS}})) \\[3mm] &= \big[(n'-E(T)+E(T))\cdot E(d)\big]+(n'-E(T))\cdot(2)+E(T)\cdot(1)+\big[(n'-E(T)+E(T))\cdot E(l_{\text{HS}})\big] \\[3mm] &= n'E(d)+2(n'-E(T))+E(T)+n'E(l_{\text{HS}}) \\[3mm] &=E(D)+2n'-2E(T)+E(T)-E(L_{\text{HS}}) \\[3mm] &=E(D)+2n'-E(T)-E(L_{\text{HS}}) \\[1.5mm]
&=E(D)+2\Big\lfloor\frac{n}{2}\Big\rfloor-E(T)-E(L_{\text{HS}}) \\ &=n+n+(1-\frac{\alpha_1}{2}-\alpha_2)n-E(T)+\Theta(\log{n}) \\ &=(3-\frac{\alpha_1}{2}-\alpha_2)n-E(T)+\Theta(\log{n}) \\\end{align}$$
To optain the value of $E(T)$, we consider the 2 situations: when $j=0$, and when $j=d$.
+ **Situation $j=0$ —** Means that <u>*the root node is the smallest (if its a max heap) within the subtree*.</u> If the subtree contains $r$ nodes, the probability of a node be in that case is $\frac{1}{r}$.
Consider a fully filled tree: From the 2nd lowest level, there exists $\frac{n}{4}$ subtrees, containing 3 nodes; the 3rd lowest level exists $\frac{n}{8}$ subtrees, each containing 7 nodes. We are able to only consider a full tree since a $\Theta(\log{n})$ error is allowed.
In general, the $h$-th lowest level exists $\frac{n}{2^h}$ subtrees, contains $2^h-1$ nodes; thus we can conclude there is a $\frac{n}{2^h}\times\frac{1}{2^h-1}$ chance this situation happens on each level; thus we obtain the expected nunber of case where $j=0$ being $$\beta n+\Theta(\log{n})\text{ , where }\beta=\displaystyle\sum_{h=2}^{\infty}\frac{1}{2^h(2^h-1)}$$
+ **Situation $j=d$ —** Means that <u>*after the reheap process, the root node should have $0$ childs*</u>; we first define the following symbols. When using the **old REHEAP() procedure**:
+ $c$ represents the number of comparisons used.
+ $i$ represents the number of interchanges used (moving how many levels down).
+ $s$ represents the number of childs the rood node have after the procedure, $s\in \{0,1,2\}$.
Using the old **REHEAP()**, to advance down, each level requires 2 comparisons, and some additional comparisons at the desired level, comparing with all of its childs. Thus, we can know the relationship between the three symbols being: $$c=2i+s$$
To obtain the correct value, we try rule out the cases when $s=1$ or $2$.
Due to the property of a heap, the lowest level nodes shall be placed at smallest indexes, leaving only 1 possible way for a node to have $s=1$, that is, the $\lfloor\frac{2}{n}\rfloor$-th node. (See image for calculating cmp. of LEAF_SEARCH() to illustrate.) All nodes on that path might result in the position, thus a $\lfloor\ \log{n}\ \rfloor$ probability for this case, which is low enough landing within the error range $\Theta(\log{n})$.
Again for the case $s=2$, we use the random variable $C$, $I$ and $S$ to represent the sum of all $c$, $i$ and $s$ respectivly. We then may know that according to the above property of HEAPSORT, $$\begin{align} && E(C) &= E(S) + 2E(I) \\ &\Rightarrow & E(S) &= E(C) - 2E(I) \\ &&&= [(\alpha_1+2\alpha_2-2)n + \Theta(\log{n})]-2\big[(\alpha_1+\alpha_2-2)n+\Theta(\log{n})\big] \\ &&&=(-\alpha_1+2)n+\Theta(\log{n}) \\ \end{align}$$
Also, we can also use the equation $\displaystyle\sum_{m\in M}mp_m$ to represent expected value, where $m$ is the random variable under set $M$, and $p_m$ as the probability of $m$ happening; that is to say, represent $E(s)$ as $0\cdot p_{s\texttt{=}0}+2\cdot p_{s\texttt{=}2}=2p_{s\texttt{=}2}$, or $E(S)=2p_{s\texttt{=}2}\cdot n'$. Thus, we could calculate as following: $$\begin{align} p_{s=2}&=\frac{E(S)}{n'}\cdot\frac{1}{2} \\ &= \frac{\big[(-\alpha_1+2)n+\Theta(\log{n})\big]}{\big\lfloor\frac{n}{2}\big\rfloor}\cdot\frac{1}{2} \\ &= \frac{(-\alpha_1+2)n+\Theta(\log{n})}{n} \\ &= (-\alpha_1+2)+\Theta(\frac{\log{n}}{n}) \approx 0.3933049\dots+\Theta(\frac{\log{n}}{n}) \end{align}$$
Therefore, we can obtain $p_{s\texttt{=}0}=1-p_{s\texttt{=}2}=1-[(-\alpha_1+2)+\Theta(\frac{\log{n}}{n})]=\alpha_1-1+\Theta(\frac{\log{n}}{n})\approx0.6066951\dots+\Theta(\frac{\log{n}}{n})$
And the expected number of cases that $j=d$ becomes: $$\begin{align} p_{s\texttt{=}0}\cdot n'=(\frac{\alpha_1}{2}-\frac{1}{2})n+\Theta(\log{n})\\\end{align}$$
<span style="background-color: #ffd7c1; display: block; padding: 2% 2% 0.5% 2%; border-radius: 15px;">
Thus, combinding the equation of $E(L_{\text{BUS}})$, $E(T)$, we can finally get the result that $$\begin{align} E(L_{\text{BUS}})&=(3-\frac{\alpha_1}{2}-\alpha_2)n-E(T)+\Theta(\log{n}) \\ &=(3-\frac{\alpha_1}{2}-\alpha_2)n-\Big[\big(\beta n+\Theta{\log{n}}\big)+\big[(\frac{\alpha_1}{2}-\frac{1}{2})n+\Theta(\log{n})\big]\Big]+\Theta(\log{n}) \\ &= \big(\frac{7}{2}-\alpha_1-\alpha_2-\beta\big)n+\Theta(\log{n}) \end{align}$$ ==as the expected number of comparisons used in **BOTTOM-UP-SEARCH()** during the heap-creation phase.==
</span>
---
### 3. The Number of Comparison Used in the Heap Creation Phase
Simply add the above 2 sections' result up, we get
$$\big(\frac{9}{2}-\alpha_1-\alpha_2-\beta\big)n+\Theta(\log{n})$$
---
> TODO: require explaination for content after Lemma 4.4
## References
+ <u>**Paper:**</u> [BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average, QUICKSORT (if n is not very small)](https://www.sciencedirect.com/science/article/pii/030439759390364Y)
+ [Summing with ceiling function](https://math.stackexchange.com/questions/4437074/how-is-the-inequality-displaystyle-sum-r-1k-big-lceil-fraci-12r-big)
+ <u>**Paper:**</u> [An Average Case Analysis of Floyd's Algorithm to Construct Heaps](https://core.ac.uk/download/pdf/82135122.pdf)
## Workplace for Copy&Paste
<span style="background-color: #ffd7c1; display: block; padding: 2% 2% 0.5% 2%; border-radius: 15px;"></span>
<br>
$$b(1)$$