Note of Bottom-up Heapsort (English Version)

# Note of Bottom-up Heapsort (English Version) > [BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average, QUICKSORT (if n is not very small)](https://www.sciencedirect.com/science/article/pii/030439759390364Y) ## Number of Comparison of Sorting Algorithms ### General Comparision Sort #### Worst Case Analysis For a input of $n$ *different* entries, there are at most $n!$ permutations, which **only 1 of them is the sorted list**. If a algorithm always fisinhs after $f(n)$ steps (comparisons), it can at most distinguish $2^{f(n)}$ different permutaions, due to the fact there are **only 2 result of each comparison**, eliminating half of the possibilities. Therefore, the algorithm must at least yield $2^{f(n)}\geq n!$, or $f(n)\geq\log_2{n!}$ Since the number of comparisons is definetely a whole number, we can assume the lower bound of it at worst case as $\big\lceil\log{n!\ }\big\rceil$, then use the [Strling's Formula](https://www.wikiwand.com/en/Stirling%27s_approximation) to simplify: $$\begin{align} &&\big\lceil\ln{n!\ }\big\rceil&=n\ln{n}-n+\Theta(\ln{n}) \\ &\Rightarrow&\Big\lceil\frac{\log_2{n!}}{\log_2{e}}\Big\rceil&= n\frac{\log_2{n}}{\log_2{e}}-\frac{\log_2{e^n}}{\log_2{e}}+\Theta\Big(\frac{\log_2{n}}{\log_2{e}}\Big)\\ &\Rightarrow&\lceil\log_2{n}!\rceil&=n\log_2{n}-n\log_2{e}+\Theta(\log_2{n}) \\ &&&\approx n\log_2{n}-1.4427n \\ \end{align}$$ When $n$ is big enough, we can ignore the error $\Theta(\log_2{n})$. ### Merge Sort Requires at least $n\log{n}-n+1$ comparisions, but also needs an additional array of length $n$, mergesort is only useful for external sorting. ### Insertion Sort Requires only ==at most== $\log{n!}+n-1$ comparisions. However, the average number of nonessencial operaions (e.g., swapping) is bounded at $\Theta\big(n^2\big)$, which is much higher than other sorting algorithms we consider useful's number ($\text{O}\big(n\log{n}\big)$). ### Quick Sort (Not that important.) ---- ---- ## Algorithm of Heap Sort / Bottom-Up Heap Sort :::success + **BOLD-CAPTALIZED-DASHED-FONTS** $\rightarrow$ method names / keywords (e.g., **FOR**) + `monospace fonts` $\rightarrow$ variables + *ITALIC-CAPTALIZED-UNDERSCORED-FONTS* $\rightarrow$ other methods called wwithin methods ::: ### Basic Heap Sort Analyzation First, we define these followings to perform a heap sort. + **LEFT(i)** The left child of element at index ***i***. + **RIGHT(i)** The right child of element at index ***i***. + **HEAPIFY(A, i)** Assuming in tree ***A***, both subtree which roots are ***LEFT(i)*** / ***RIGHT(i)*** already satisfies the properties of a heap. Reorder so that the subtree which root is ***i*** also satisfies the properties of a heap. The shift of ***i*** is performed at most ***h*** times, assuming ***h***$=log(n)$ is the height of subtree which root is ***i***; before each shift, at most ***2*** comparision is required. + **BUILD-HEAP(A)** Reorder tree ***A*** so that its element satisfies the properties of a heap. That is, to call *HEAPIFY* on the $\lfloor\frac{n}{2}\rfloor$-th node down to the 1st node. Then, to perform heap sort, we just need to perform the following procedure: :::info **HEAPSORT(A):**       *BUILD_HEAP(A)*       **for** `i` = `A.length` **downto** 2             swap `A[i]` with `A[1]`             `A.heapsize` = `A.heapsize` - 1             *HEAPIFY(A, 1)* ::: ### Bottom-Up Variation Recall that when reheaping, on every level of shifting, 2 comparisions are required to determine which element to go on. Furthermore, since **most node of a tree are located near the leaf**, we can assume almost all reheaping goes until the leaf. Thus, there will be a significant difference if we could eliminate 1 of the 2 comparision required. ### Special Leaf & Special Path To achieve the above goal, we try to find the final location of ***i*** (as in ***REHEAP(i)***) then modify the heap. From ***i***, we find the ==**Special Path**== by comparing the two child, choose the one that ***i*** should be exchanging with, until we reach the leaf, which we call the ==**Special Leaf**==. :::info **LEAF-SEARCH(A, i):** (if building min-heap)       `j` = `i`       **while** `j` < `A.heapsize` **do**             **if** `A[2 * j]` < `A[2 * j + 1]`                   `j` = `A[2 * j]`             **else**                   `j` = `A[2 * j + 1]`       **return** `j` ::: Following it, we search from the special leaf, to where ***i*** is supposed to be. Consider the special path as `b[1]...b[k]...b[h]`, the required comparision number is $1\times(k-1)+2\times(h-k)$ comparisions. :::info **BOTTOM-UP-SEARCH(A, i, j):** (if building min-heap)       **while** `A[i]` < `A[j]` **do**             `j` = floor(`j / 2`) ::: After we found the position, we must perform the exchange, assuming on the special path excluding the root is `b[1]...b[k]...b[h]`, and the root as `x`. The procedure **INTERCHANGE(A, i, j)** preform actions so that: - `x` take position of `b[k]` - `b[1]` ~ b[k] take position of their parent Last, we modify ***HEAPIFY(A)*** into the following: :::info **BOTTOM-UP-HEAPIFY(A, i)**:       `j` = *LEAF-SEARCH(A, i)*       `j` = *BOTTOM-UP-SEARCH(A, i, j)*       *INTERCHANGE(i, j)* ::: Substituting into the original heapsort algorithm, we get the bottom-up heap sort algorithm. ---- ---- ## Calculating Number Of Comparisions Of The Bottom-Up Heapsort Algorithm To calculate the total comparision times, we discuss seperate procedures and try to calculate the number of comparison each would use. ### 1. Number of Cmp. used by the *LEAF-SEARCH()* calls during the heap creation phase. The following is a example tree that we could calculate for. Notice that - $k$ represents the number of completely filled levels. Can be calculated by $\log_2{(n+1)}$. - $i$ represents the number of node at the lowest level, if its not completely filled. - $n$ represents the number of nodes we're considering for. For the whole tree $n=2^k-1+i$. ![](https://i.imgur.com/bJ2IK9n.png) + When creating building a heap, the procedure **LEAF-SEARCH()** is called for node $1$ to $\lfloor n/2\rfloor$. #### Worst Case Situation + We first calculate the number of comparisions for nodes for the **fully filled layers** (That is, the **orange nodes**, considering $n = 2^k-1$). For each $l$ that $1\leq l\leq k-1$, There exists $2^{k-(l+1)}$ nodes, located at level $k - 1 - l$, casuing $l$ comparisions each. For example, if we substitute $l=1$ to the formula, we get that on level $3$, there are $2^{4-(1+1)}=4$ nodes that require $1$ comparisions to the reach the leaf. To sum it up, we try to calculate the result of $$\displaystyle\sum_{l=1} ^{k-1} \Big(l\cdot 2^{k-l-1}\Big)$$ Considering $S_m$ as the result when $k = m$, we can get the recursive formula$$S_m=2S_{m-1}+m-1$$ By substuting $T_m = S_m+m$, we can generate:$$\begin{align}\\&\Rightarrow & S_m+(m-1) &= 2\cdot(S_{m-1} + (m-1)) \\ &\Rightarrow & T_m-1 &= 2\cdot T_{m-1} \\ \end{align}$$ Then, We try to get the form $T_m+c=2\cdot (T_{m-1}+c)$. To do so, we plug in $T_1=1$ and $T_2=3$, to get $c=1$, thus resulting in the following equation allowing us to substitute to itself: $$\begin{align*} \\ T_m+1 &= 2\cdot (T_{m-1}+1) & \\ &= 2\cdot (2\cdot (T_{m-2}+1)) &=&\ 2^2\cdot T_{m-2} \\ &= 2\cdot (2\cdot (T_{m-3}+1)) &=&\ 2^3\cdot T_{m-3} \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \vdots \\ &= 2^{m-1}\cdot(T_1+1) &=&\ 2^{m-1}\cdot2 \\ &= 2^m \\ \Rightarrow (S_m+m)+1 &= 2^m \\ \end{align*}$$ Thus, we finally get the following equation: $$S_k=2^k-k-1$$ To represent with $n$, we again substitute with $n=2^k-1$, getting the number of comparisions: $$n-\big\lceil\log{(n+1)}\big\rceil$$ + Next, we try to calculate the last $i$ nodes. ( That is, the **blue nodes** ). Every pair of the blue nodes ( that is, when $i\geq 2$ ) causes every node above them costing $1$ more comparisons, that is, for level $k - r, 1\leq r\leq k-1$, the blue nodes causes $$\Big\lceil\frac{i-1}{2^r}\Big\rceil$$ Causing the procedure requireing an additional $$\displaystyle\sum_{r=1}^k\ \Big\lceil\frac{i-1}{2^r}\Big\rceil$$ Comparisons. Since the formula contains a ceiling function, we try to obtain a upper bound. First, we seperate the integer part and the fractal part (which would be modified by the ceiling function) from the formula, getting: $$\displaystyle\sum_{r=1}^k\frac{i-1}{2^r}+\displaystyle\sum_{r=1}^k\Big(\Big\lceil\frac{i-1}{2^r}\Big\rceil-\frac{i-1}{2^r}\Big)$$ The first term is easy to calculate. Eliminating the constant term, we get $$(i-1)\displaystyle\sum_{r=1}^k\frac{1}{2^r}$$ which can be easily evaulated into $$(i-1)\cdot\Big(1-\frac{1}{2^k}\Big)$$ For the second term, we use the property that $$\begin{align} \\ \displaystyle\sum_{r=1}^k\Big(\Big\lceil\frac{i-1}{2^r}\Big\rceil-\frac{i-1}{2^r}\Big) &\leq \overbrace{\Big[\Big(1-\frac{1}{2}\Big)+\Big(1-\frac{1}{2^2}\Big)+\dots+\Big(1-\frac{1}{2^k}\Big)\Big]}^{\text{k terms}} \\ &= k-\Big[\frac{1}{2}+\frac{1}{2^2}+\dots+\frac{1}{2^k}\Big] \\ &=k-\Big[1-\frac{1}{2^k}\Big]\end{align}$$ Adding the two terms up results in $$\begin{align} \\ \displaystyle\sum_{r=1}^k\frac{i-1}{2^r}+\displaystyle\sum_{r=1}^k\Big(\Big\lceil\frac{i-1}{2^r}\Big\rceil-\frac{i-1}{2^r}\Big)&\leq (i-1)\cdot\Big(1-\frac{1}{2^k}\Big)+k-\Big[1-\frac{1}{2^k}\Big] \\ &= i-1-\frac{i-1}{2^k}+k-1+\frac{1}{2^k} \\ &= i+k-2+\frac{2-i}{2^k}\\ \end{align}$$ Since $i\geq 2$, we are able to elimate the last term, thus obtaining the inequality stated in the paper: $$\Big\lceil\frac{i-1}{2^r}\Big\rceil \leq i-2+k$$ Finally, adding the two value up, we result in: (Recall that for a full tree, $n=2^k-1$) $$2^k-1-k+i-2+k=n-2$$ ==**As the upper-bound for the number of comparisions.**== ### Best case Situation Due to the fact that whatever the form the tree is, all **LEAF_SEARCH()** runs until the leaf, thus the number of comparisions remains as $n-\big\lceil\log{(n+1)}\big\rceil$, identical with the worst case. However, consider the situation that $n=2^k$. That is, the deepest layer only contains 1 simgle node (Only 1 **blue node** exists). Althought creating an additional level through the root to leaf, a single simply lets the calculation fall through, ==saving $1$ comparisions for each root==. This happens if **the root sits on the path $1$ to $n$**, saving $\big\lceil\log_2{n}\big\rceil-1$ comparisons in total. Thus for the best case, the number of comparisions becomes $$\begin{align}&n-\big\lceil\log{(n+1)}\big\rceil - \Big(\big\lceil\log_2{n}\big\rceil-1\Big) \\ =\ &n-\big\lceil\log{(n+1)}\big\rceil - \big\lceil\log_2{n}\big\rceil+1\end{align}$$ ==**As the lower-bound for the number of comparisions.**== ---- ### 2. Number of Cmp. used by the *BOTTOM-UP-SEARCH()* calls during the heap creationn phase. In this part, we focus on individual *special paths*; that is, the nodes from the processing root till the *special leaf*. Since the heapify procedure is done bottom-up, the nodes on the path should be well ordered according to the rules of a heap. + The array $b()$ represents the nodes on the special path, excluding the root, starting from $b(1)$. - Representing the border of the array, $b(0)=-\infty, b(d+1)=\infty$. + $x$ represents the root node. + $d$ represents the length of the special path, excluding the root. ![](https://i.imgur.com/trd0Kup.png) The bottom-up search is simple: we place $x$ at the position after $b(d)$, then move up until before node $b(j)$ that satisfies $b(j)\le x \leq b(j+1)$. #### Symbols for calculation Compared with HEAPSORT, the above action requires the following numbers of comparison: | Method | When $j = 0$ | When $0 \lt j \lt d$ | When $j=d$ | | -------: | :-: | :-: | :-: | | **HEAPSORT** requires $2$ comparision per level | $2(j+1)$ | $2(j+1)$ | $2d$ | | **BOTTOM-UP SEARCH()** requires $1$ comparision per level | $d$ | $d-j+1$ | $d-j+1$ | Then, we define the following (for one root,): + $l_{\text{BUS}}$ represents the number comparison used by **BOTTOM-UP-SEARCH()** + $l_{\text{HS}}$ represents *half of* the number of comparison HEAPSORT uses. The result of $l_{\text{BUS}}+l_{\text{HS}}$ changes according to the condtions: :::spoiler Calculations **When $j = 0$**$$\begin{align}\frac{2(j+1)}{2}+d&=(j+1)+d \\ &= (0+1)+d \\ &=d+1\end{align}$$ **When $0 \le j \le d$**$$\begin{align}\frac{2(j+1)}{2}+(d-j+1)&=(j+1)+d-j+1 \\ &=d+1\end{align}$$ **When $j=d$**$$\begin{align}\frac{2d}{2}+d-j+1&=2d-j+1 \\ &=2d-d+1 \\ &=d+1\end{align}$$ ::: | When $j = 0$ | When $0 \le j \le d$ | When $j=d$ | | :-: | :-: | :-: | | $d+1$ | $d+2$ | $d+1$ | #### The comparison number of normal HEAPSORT > TODO: Read the damn paper which the math is way too over my capacity According to an even older paper [An Average Case Analysis of Floyd's Algorithm to Construct Heaps](https://core.ac.uk/download/pdf/82135122.pdf), we know that HEAPSORT uses, on average, + $(\alpha_{1}+2\alpha_2-2)n+\Theta(\log{n})$ comparisions + $(\alpha_1+\alpha_2-2)n+\Theta(\log{n})$ interchanges Where $\alpha_1 = 1.6066951\dots,\ \ \alpha_2 = 1.1373387\dots$ #### Calculating with mean values Since the input array is randomly choosed, we use $L_{\text{BUS}}$, $L_{\text{HS}}$ and $D$ to denote the random variable that is the sum of all $l_{\text{BUS}}$, $l_{\text{HS}}$ and $d$, respectively. ( That is to say, for instance, the excpected value of $D, E(D)=d\times\lfloor\frac{n}{2}\rfloor=dn'$ ) According to the first part this section, since the comparision time of **LEAF-SEARCH()** is exactly the sum of special length for each root node, we know that:$$E(D)=n+\Theta(\log{n})$$ Additionally, According to the above theory, we also know that:$$E(L_{\text{HS}})=(\frac{\alpha_1}{2}+\alpha_2-1)n+\Theta(\log{n})$$ By defining the number of calls that $\ l_{\text{BUS}}+l_{\text{HS}}=d+1$ as $T$, the number of calls that $\ l_{\text{BUS}}+l_{\text{HS}}=d+2$ would be $n'-T$. Thus we can calculate that:$$\begin{align} E(L_{\text{BUS}})&=(n'-E(T))\cdot(E(d)+2-E(l_{\text{HS}}))+E(T)\cdot(E(d)+1-E(l_{\text{HS}})) \\[3mm] &= \big[(n'-E(T)+E(T))\cdot E(d)\big]+(n'-E(T))\cdot(2)+E(T)\cdot(1)+\big[(n'-E(T)+E(T))\cdot E(l_{\text{HS}})\big] \\[3mm] &= n'E(d)+2(n'-E(T))+E(T)+n'E(l_{\text{HS}}) \\[3mm] &=E(D)+2n'-2E(T)+E(T)-E(L_{\text{HS}}) \\[3mm] &=E(D)+2n'-E(T)-E(L_{\text{HS}}) \\[1.5mm] &=E(D)+2\Big\lfloor\frac{n}{2}\Big\rfloor-E(T)-E(L_{\text{HS}}) \\ &=n+n+(1-\frac{\alpha_1}{2}-\alpha_2)n-E(T)+\Theta(\log{n}) \\ &=(3-\frac{\alpha_1}{2}-\alpha_2)n-E(T)+\Theta(\log{n}) \\\end{align}$$ To optain the value of $E(T)$, we consider the 2 situations: when $j=0$, and when $j=d$. + **Situation $j=0$ —** Means that *the root node is the smallest (if its a max heap) within the subtree*. If the subtree contains $r$ nodes, the probability of a node be in that case is $\frac{1}{r}$. Consider a fully filled tree: From the 2nd lowest level, there exists $\frac{n}{4}$ subtrees, containing 3 nodes; the 3rd lowest level exists $\frac{n}{8}$ subtrees, each containing 7 nodes. We are able to only consider a full tree since a $\Theta(\log{n})$ error is allowed. In general, the $h$-th lowest level exists $\frac{n}{2^h}$ subtrees, contains $2^h-1$ nodes; thus we can conclude there is a $\frac{n}{2^h}\times\frac{1}{2^h-1}$ chance this situation happens on each level; thus we obtain the expected nunber of case where $j=0$ being $$\beta n+\Theta(\log{n})\text{ , where }\beta=\displaystyle\sum_{h=2}^{\infty}\frac{1}{2^h(2^h-1)}$$ + **Situation $j=d$ —** Means that *after the reheap process, the root node should have $0$ childs*; we first define the following symbols. When using the **old REHEAP() procedure**: + $c$ represents the number of comparisons used. + $i$ represents the number of interchanges used (moving how many levels down). + $s$ represents the number of childs the rood node have after the procedure, $s\in \{0,1,2\}$. Using the old **REHEAP()**, to advance down, each level requires 2 comparisons, and some additional comparisons at the desired level, comparing with all of its childs. Thus, we can know the relationship between the three symbols being: $$c=2i+s$$ To obtain the correct value, we try rule out the cases when $s=1$ or $2$. Due to the property of a heap, the lowest level nodes shall be placed at smallest indexes, leaving only 1 possible way for a node to have $s=1$, that is, the $\lfloor\frac{2}{n}\rfloor$-th node. (See image for calculating cmp. of LEAF_SEARCH() to illustrate.) All nodes on that path might result in the position, thus a $\lfloor\ \log{n}\ \rfloor$ probability for this case, which is low enough landing within the error range $\Theta(\log{n})$. Again for the case $s=2$, we use the random variable $C$, $I$ and $S$ to represent the sum of all $c$, $i$ and $s$ respectivly. We then may know that according to the above property of HEAPSORT, $$\begin{align} && E(C) &= E(S) + 2E(I) \\ &\Rightarrow & E(S) &= E(C) - 2E(I) \\ &&&= [(\alpha_1+2\alpha_2-2)n + \Theta(\log{n})]-2\big[(\alpha_1+\alpha_2-2)n+\Theta(\log{n})\big] \\ &&&=(-\alpha_1+2)n+\Theta(\log{n}) \\ \end{align}$$ Also, we can also use the equation $\displaystyle\sum_{m\in M}mp_m$ to represent expected value, where $m$ is the random variable under set $M$, and $p_m$ as the probability of $m$ happening; that is to say, represent $E(s)$ as $0\cdot p_{s\texttt{=}0}+2\cdot p_{s\texttt{=}2}=2p_{s\texttt{=}2}$, or $E(S)=2p_{s\texttt{=}2}\cdot n'$. Thus, we could calculate as following: $$\begin{align} p_{s=2}&=\frac{E(S)}{n'}\cdot\frac{1}{2} \\ &= \frac{\big[(-\alpha_1+2)n+\Theta(\log{n})\big]}{\big\lfloor\frac{n}{2}\big\rfloor}\cdot\frac{1}{2} \\ &= \frac{(-\alpha_1+2)n+\Theta(\log{n})}{n} \\ &= (-\alpha_1+2)+\Theta(\frac{\log{n}}{n}) \approx 0.3933049\dots+\Theta(\frac{\log{n}}{n}) \end{align}$$ Therefore, we can obtain $p_{s\texttt{=}0}=1-p_{s\texttt{=}2}=1-[(-\alpha_1+2)+\Theta(\frac{\log{n}}{n})]=\alpha_1-1+\Theta(\frac{\log{n}}{n})\approx0.6066951\dots+\Theta(\frac{\log{n}}{n})$ And the expected number of cases that $j=d$ becomes: $$\begin{align} p_{s\texttt{=}0}\cdot n'=(\frac{\alpha_1}{2}-\frac{1}{2})n+\Theta(\log{n})\\\end{align}$$ Thus, combinding the equation of $E(L_{\text{BUS}})$, $E(T)$, we can finally get the result that $$\begin{align} E(L_{\text{BUS}})&=(3-\frac{\alpha_1}{2}-\alpha_2)n-E(T)+\Theta(\log{n}) \\ &=(3-\frac{\alpha_1}{2}-\alpha_2)n-\Big[\big(\beta n+\Theta{\log{n}}\big)+\big[(\frac{\alpha_1}{2}-\frac{1}{2})n+\Theta(\log{n})\big]\Big]+\Theta(\log{n}) \\ &= \big(\frac{7}{2}-\alpha_1-\alpha_2-\beta\big)n+\Theta(\log{n}) \end{align}$$ ==as the expected number of comparisons used in **BOTTOM-UP-SEARCH()** during the heap-creation phase.== --- ### 3. The Number of Comparison Used in the Heap Creation Phase Simply add the above 2 sections' result up, we get $$\big(\frac{9}{2}-\alpha_1-\alpha_2-\beta\big)n+\Theta(\log{n})$$ --- > TODO: require explaination for content after Lemma 4.4 ## References + **Paper:** [BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average, QUICKSORT (if n is not very small)](https://www.sciencedirect.com/science/article/pii/030439759390364Y) + [Summing with ceiling function](https://math.stackexchange.com/questions/4437074/how-is-the-inequality-displaystyle-sum-r-1k-big-lceil-fraci-12r-big) + **Paper:** [An Average Case Analysis of Floyd's Algorithm to Construct Heaps](https://core.ac.uk/download/pdf/82135122.pdf) ## Workplace for Copy&Paste $$b(1)$$

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.