Note of Bottom-up Heapsort (English Version)

BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average, QUICKSORT (if n is not very small)

Number of Comparison of Sorting Algorithms

General Comparision Sort

Worst Case Analysis

For a input of

n

different entries, there are at most

n!

permutations, which only 1 of them is the sorted list. If a algorithm always fisinhs after

f (n)

steps (comparisons), it can at most distinguish

2^{f (n)}

different permutaions, due to the fact there are only 2 result of each comparison, eliminating half of the possibilities.

Therefore, the algorithm must at least yield

2^{f (n)} \geq n!

, or

f (n) \geq \log_{2} n!

Since the number of comparisons is definetely a whole number, we can assume the lower bound of it at worst case as

⌈ \log n! ⌉

, then use the Strling's Formula to simplify:

\begin{aligned} ⌈ \ln n! ⌉ & = n \ln n - n + Θ (\ln n) \\ \Rightarrow & ⌈ \frac{\log_{2} n!}{\log_{2} e} ⌉ & = n \frac{\log_{2} n}{\log_{2} e} - \frac{\log_{2} e^{n}}{\log_{2} e} + Θ (\frac{\log_{2} n}{\log_{2} e}) \\ \Rightarrow & ⌈ \log_{2} n! ⌉ & = n \log_{2} n - n \log_{2} e + Θ (\log_{2} n) \\ \approx n \log_{2} n - 1.4427 n \end{aligned}

When

n

is big enough, we can ignore the error

Θ (\log_{2} n)

Merge Sort

Requires at least

n \log n - n + 1

comparisions, but also needs an additional array of length

n

, mergesort is only useful for external sorting.

Insertion Sort

Requires only at most

\log n! + n - 1

comparisions. However, the average number of nonessencial operaions (e.g., swapping) is bounded at

Θ (n^{2})

, which is much higher than other sorting algorithms we consider useful's number (

O (n \log n)

Quick Sort

(Not that important.)

Algorithm of Heap Sort / Bottom-Up Heap Sort

BOLD-CAPTALIZED-DASHED-FONTS
$\to$ method names / keywords (e.g., FOR)
monospace fonts
$\to$ variables
ITALIC-CAPTALIZED-UNDERSCORED-FONTS
$\to$ other methods called wwithin methods

Basic Heap Sort Analyzation

First, we define these followings to perform a heap sort.

LEFT(i)
The left child of element at index i.
RIGHT(i)
The right child of element at index i.
HEAPIFY(A, i)
Assuming in tree A, both subtree which roots are LEFT(i) / RIGHT(i) already satisfies the properties of a heap. Reorder so that the subtree which root is i also satisfies the properties of a heap.
The shift of i is performed at most h times, assuming h
$= l o g (n)$ is the height of subtree which root is i; before each shift, at most 2 comparision is required.
BUILD-HEAP(A)
Reorder tree A so that its element satisfies the properties of a heap. That is, to call HEAPIFY on the
$⌊ \frac{n}{2} ⌋$ -th node down to the 1st node.

Then, to perform heap sort, we just need to perform the following procedure:

HEAPSORT(A):
      BUILD_HEAP(A)
      for i = A.length downto 2
            swap A[i] with A[1]
            A.heapsize = A.heapsize - 1
            HEAPIFY(A, 1)

Bottom-Up Variation

Recall that when reheaping, on every level of shifting, 2 comparisions are required to determine which element to go on. Furthermore, since most node of a tree are located near the leaf, we can assume almost all reheaping goes until the leaf.

Thus, there will be a significant difference if we could eliminate 1 of the 2 comparision required.

Special Leaf & Special Path

To achieve the above goal, we try to find the final location of i (as in REHEAP(i)) then modify the heap.

From i, we find the Special Path by comparing the two child, choose the one that i should be exchanging with, until we reach the leaf, which we call the Special Leaf.

LEAF-SEARCH(A, i): (if building min-heap)
      j = i
      while j < A.heapsize do
            if A[2 * j] < A[2 * j + 1]
                  j = A[2 * j]
            else
                  j = A[2 * j + 1]
      return j

Following it, we search from the special leaf, to where i is supposed to be. Consider the special path as b[1]...b[k]...b[h], the required comparision number is

1 \times (k - 1) + 2 \times (h - k)

comparisions.

BOTTOM-UP-SEARCH(A, i, j): (if building min-heap)
while A[i] < A[j] do
j = floor(j / 2)

After we found the position, we must perform the exchange, assuming on the special path excluding the root is b[1]...b[k]...b[h], and the root as x. The procedure INTERCHANGE(A, i, j) preform actions so that:

x take position of b[k]
b[1] ~ b[k] take position of their parent

Last, we modify HEAPIFY(A) into the following:

BOTTOM-UP-HEAPIFY(A, i):
      j = LEAF-SEARCH(A, i)
      j = BOTTOM-UP-SEARCH(A, i, j)
      INTERCHANGE(i, j)

Substituting into the original heapsort algorithm, we get the bottom-up heap sort algorithm.

Calculating Number Of Comparisions Of The Bottom-Up Heapsort Algorithm

To calculate the total comparision times, we discuss seperate procedures and try to calculate the number of comparison each would use.

1. Number of Cmp. used by the LEAF-SEARCH() calls during the heap creation phase.

The following is a example tree that we could calculate for. Notice that

$k$ represents the number of completely filled levels. Can be calculated by
$\log_{2} (n + 1)$ .
$i$ represents the number of node at the lowest level, if its not completely filled.
$n$ represents the number of nodes we're considering for. For the whole tree
$n = 2^{k} - 1 + i$ .
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

When creating building a heap, the procedure LEAF-SEARCH() is called for node
$1$ to
$⌊ n / 2 ⌋$ .

Worst Case Situation

We first calculate the number of comparisions for nodes for the fully filled layers (That is, the orange nodes, considering
$n = 2^{k} - 1$ ).
For each
$l$ that
$1 \leq l \leq k - 1$ , There exists
$2^{k - (l + 1)}$ nodes, located at level
$k - 1 - l$ , casuing
$l$ comparisions each. For example, if we substitute
$l = 1$ to the formula, we get that on level
$3$ , there are
$2^{4 - (1 + 1)} = 4$ nodes that require
$1$ comparisions to the reach the leaf. To sum it up, we try to calculate the result of
$\sum_{l = 1}^{k - 1} (l \cdot 2^{k - l - 1})$
Considering
$S_{m}$ as the result when
$k = m$ , we can get the recursive formula
$S_{m} = 2 S_{m - 1} + m - 1$
By substuting
$T_{m} = S_{m} + m$ , we can generate:
$\begin{aligned} \Rightarrow & S_{m} + (m - 1) & = 2 \cdot (S_{m - 1} + (m - 1)) \\ \Rightarrow & T_{m} - 1 & = 2 \cdot T_{m - 1} \end{aligned}$
Then, We try to get the form
$T_{m} + c = 2 \cdot (T_{m - 1} + c)$ . To do so, we plug in
$T_{1} = 1$ and
$T_{2} = 3$ , to get
$c = 1$ , thus resulting in the following equation allowing us to substitute to itself:
$\begin{aligned} T_{m} + 1 & = 2 \cdot (T_{m - 1} + 1) \\ = 2 \cdot (2 \cdot (T_{m - 2} + 1)) & = & 2^{2} \cdot T_{m - 2} \\ = 2 \cdot (2 \cdot (T_{m - 3} + 1)) & = & 2^{3} \cdot T_{m - 3} \\ ⋮ \\ = 2^{m - 1} \cdot (T_{1} + 1) & = & 2^{m - 1} \cdot 2 \\ = 2^{m} \\ \Rightarrow (S_{m} + m) + 1 & = 2^{m} \end{aligned}$
Thus, we finally get the following equation:
$S_{k} = 2^{k} - k - 1$
To represent with
$n$ , we again substitute with
$n = 2^{k} - 1$ , getting the number of comparisions:
$n - ⌈ \log (n + 1) ⌉$
Next, we try to calculate the last
$i$ nodes. ( That is, the blue nodes ).
Every pair of the blue nodes ( that is, when
$i \geq 2$ ) causes every node above them costing
$1$ more comparisons, that is, for level
$k - r, 1 \leq r \leq k - 1$ , the blue nodes causes
$⌈ \frac{i - 1}{2^{r}} ⌉$
Causing the procedure requireing an additional
$\sum_{r = 1}^{k} ⌈ \frac{i - 1}{2^{r}} ⌉$
Comparisons. Since the formula contains a ceiling function, we try to obtain a upper bound. First, we seperate the integer part and the fractal part (which would be modified by the ceiling function) from the formula, getting:
$\sum_{r = 1}^{k} \frac{i - 1}{2^{r}} + \sum_{r = 1}^{k} (⌈ \frac{i - 1}{2^{r}} ⌉ - \frac{i - 1}{2^{r}})$
The first term is easy to calculate. Eliminating the constant term, we get
$(i - 1) \sum_{r = 1}^{k} \frac{1}{2^{r}}$ which can be easily evaulated into
$(i - 1) \cdot (1 - \frac{1}{2^{k}})$
For the second term, we use the property that
$\begin{aligned} \sum_{r = 1}^{k} (⌈ \frac{i - 1}{2^{r}} ⌉ - \frac{i - 1}{2^{r}}) & \leq \overset{k terms}{\overset{⏞}{[(1 - \frac{1}{2}) + (1 - \frac{1}{2^{2}}) + \dots + (1 - \frac{1}{2^{k}})]}} \\ = k - [\frac{1}{2} + \frac{1}{2^{2}} + \dots + \frac{1}{2^{k}}] \\ = k - [1 - \frac{1}{2^{k}}] \end{aligned}$
Adding the two terms up results in
$\begin{aligned} \sum_{r = 1}^{k} \frac{i - 1}{2^{r}} + \sum_{r = 1}^{k} (⌈ \frac{i - 1}{2^{r}} ⌉ - \frac{i - 1}{2^{r}}) & \leq (i - 1) \cdot (1 - \frac{1}{2^{k}}) + k - [1 - \frac{1}{2^{k}}] \\ = i - 1 - \frac{i - 1}{2^{k}} + k - 1 + \frac{1}{2^{k}} \\ = i + k - 2 + \frac{2 - i}{2^{k}} \end{aligned}$
Since
$i \geq 2$ , we are able to elimate the last term, thus obtaining the inequality stated in the paper:
$⌈ \frac{i - 1}{2^{r}} ⌉ \leq i - 2 + k$

Finally, adding the two value up, we result in: (Recall that for a full tree,

n = 2^{k} - 1

)

2^{k} - 1 - k + i - 2 + k = n - 2

As the upper-bound for the number of comparisions.

Best case Situation

Due to the fact that whatever the form the tree is, all LEAF_SEARCH() runs until the leaf, thus the number of comparisions remains as

n - ⌈ \log (n + 1) ⌉

, identical with the worst case.

However, consider the situation that

n = 2^{k}

. That is, the deepest layer only contains 1 simgle node (Only 1 blue node exists). Althought creating an additional level through the root to leaf, a single simply lets the calculation fall through, saving

1

comparisions for each root. This happens if the root sits on the path
$1$ to
$n$ , saving

⌈ \log_{2} n ⌉ - 1

comparisons in total.

Thus for the best case, the number of comparisions becomes

\begin{aligned} n - ⌈ \log (n + 1) ⌉ - (⌈ \log_{2} n ⌉ - 1) \\ = & n - ⌈ \log (n + 1) ⌉ - ⌈ \log_{2} n ⌉ + 1 \end{aligned}

As the lower-bound for the number of comparisions.

2. Number of Cmp. used by the BOTTOM-UP-SEARCH() calls during the heap creationn phase.

In this part, we focus on individual special paths; that is, the nodes from the processing root till the special leaf. Since the heapify procedure is done bottom-up, the nodes on the path should be well ordered according to the rules of a heap.

The array
$b ()$ represents the nodes on the special path, excluding the root, starting from
$b (1)$ .
- Representing the border of the array,
  $b (0) = - \infty, b (d + 1) = \infty$ .
$x$ represents the root node.
$d$ represents the length of the special path, excluding the root.

The bottom-up search is simple: we place

x

at the position after

b (d)

, then move up until before node

b (j)

that satisfies

b (j) \leq x \leq b (j + 1)

Symbols for calculation

Compared with HEAPSORT, the above action requires the following numbers of comparison:

Method	When $j = 0$	When $0 < j < d$	When $j = d$
HEAPSORT requires $2$ comparision per level	$2 (j + 1)$	$2 (j + 1)$	$2 d$
BOTTOM-UP SEARCH() requires $1$ comparision per level	$d$	$d - j + 1$	$d - j + 1$

Then, we define the following (for one root,):

$l_{BUS}$ represents the number comparison used by BOTTOM-UP-SEARCH()
$l_{HS}$ represents half of the number of comparison HEAPSORT uses.

The result of

l_{BUS} + l_{HS}

changes according to the condtions:

Calculations

When

$j = 0$

\begin{aligned} \frac{2 (j + 1)}{2} + d & = (j + 1) + d \\ = (0 + 1) + d \\ = d + 1 \end{aligned}

When

$0 \leq j \leq d$

\begin{aligned} \frac{2 (j + 1)}{2} + (d - j + 1) & = (j + 1) + d - j + 1 \\ = d + 1 \end{aligned}

When

$j = d$

\begin{aligned} \frac{2 d}{2} + d - j + 1 & = 2 d - j + 1 \\ = 2 d - d + 1 \\ = d + 1 \end{aligned}

When $j = 0$	When $0 \leq j \leq d$	When $j = d$
$d + 1$	$d + 2$	$d + 1$

The comparison number of normal HEAPSORT

TODO: Read the damn paper which the math is way too over my capacity

According to an even older paper An Average Case Analysis of Floyd's Algorithm to Construct Heaps, we know that HEAPSORT uses, on average,

$(α_{1} + 2 α_{2} - 2) n + Θ (\log n)$ comparisions
$(α_{1} + α_{2} - 2) n + Θ (\log n)$ interchanges

Where

α_{1} = 1.6066951 \dots, α_{2} = 1.1373387 \dots

Calculating with mean values

Since the input array is randomly choosed, we use

L_{BUS}

L_{HS}

and

D

to denote the random variable that is the sum of all

l_{BUS}

l_{HS}

and

d

, respectively. ( That is to say, for instance, the excpected value of

D, E (D) = d \times ⌊ \frac{n}{2} ⌋ = d n^{'}

)

According to the first part this section, since the comparision time of LEAF-SEARCH() is exactly the sum of special length for each root node, we know that:

E (D) = n + Θ (\log n)

Additionally, According to the above theory, we also know that:

E (L_{HS}) = (\frac{α_{1}}{2} + α_{2} - 1) n + Θ (\log n)

By defining the number of calls that

l_{BUS} + l_{HS} = d + 1

T

, the number of calls that

l_{BUS} + l_{HS} = d + 2

would be

n^{'} - T

. Thus we can calculate that:

\begin{aligned} E (L_{BUS}) & = (n^{'} - E (T)) \cdot (E (d) + 2 - E (l_{HS})) + E (T) \cdot (E (d) + 1 - E (l_{HS})) \\ = [(n^{'} - E (T) + E (T)) \cdot E (d)] + (n^{'} - E (T)) \cdot (2) + E (T) \cdot (1) + [(n^{'} - E (T) + E (T)) \cdot E (l_{HS})] \\ = n^{'} E (d) + 2 (n^{'} - E (T)) + E (T) + n^{'} E (l_{HS}) \\ = E (D) + 2 n^{'} - 2 E (T) + E (T) - E (L_{HS}) \\ = E (D) + 2 n^{'} - E (T) - E (L_{HS}) \\ = E (D) + 2 ⌊ \frac{n}{2} ⌋ - E (T) - E (L_{HS}) \\ = n + n + (1 - \frac{α_{1}}{2} - α_{2}) n - E (T) + Θ (\log n) \\ = (3 - \frac{α_{1}}{2} - α_{2}) n - E (T) + Θ (\log n) \end{aligned}

To optain the value of

E (T)

, we consider the 2 situations: when

j = 0

, and when

j = d

Situation
$j = 0$ — Means that the root node is the smallest (if its a max heap) within the subtree. If the subtree contains
$r$ nodes, the probability of a node be in that case is
$\frac{1}{r}$ .

Consider a fully filled tree: From the 2nd lowest level, there exists
$\frac{n}{4}$ subtrees, containing 3 nodes; the 3rd lowest level exists
$\frac{n}{8}$ subtrees, each containing 7 nodes. We are able to only consider a full tree since a
$Θ (\log n)$ error is allowed.

In general, the
$h$ -th lowest level exists
$\frac{n}{2^{h}}$ subtrees, contains
$2^{h} - 1$ nodes; thus we can conclude there is a
$\frac{n}{2^{h}} \times \frac{1}{2^{h} - 1}$ chance this situation happens on each level; thus we obtain the expected nunber of case where
$j = 0$ being
$β n + Θ (\log n), where β = \sum_{h = 2}^{\infty} \frac{1}{2^{h} (2^{h} - 1)}$
Situation
$j = d$ — Means that after the reheap process, the root node should have
$0$ childs; we first define the following symbols. When using the old REHEAP() procedure:
- $c$ represents the number of comparisons used.
- $i$ represents the number of interchanges used (moving how many levels down).
- $s$ represents the number of childs the rood node have after the procedure,
  $s \in {0, 1, 2}$ .
Using the old REHEAP(), to advance down, each level requires 2 comparisons, and some additional comparisons at the desired level, comparing with all of its childs. Thus, we can know the relationship between the three symbols being:
$c = 2 i + s$

To obtain the correct value, we try rule out the cases when
$s = 1$ or
$2$ .
Due to the property of a heap, the lowest level nodes shall be placed at smallest indexes, leaving only 1 possible way for a node to have
$s = 1$ , that is, the
$⌊ \frac{2}{n} ⌋$ -th node. (See image for calculating cmp. of LEAF_SEARCH() to illustrate.) All nodes on that path might result in the position, thus a
$⌊ \log n ⌋$ probability for this case, which is low enough landing within the error range
$Θ (\log n)$ .

Again for the case
$s = 2$ , we use the random variable
$C$ ,
$I$ and
$S$ to represent the sum of all
$c$ ,
$i$ and
$s$ respectivly. We then may know that according to the above property of HEAPSORT,
$\begin{aligned} E (C) & = E (S) + 2 E (I) \\ \Rightarrow & E (S) & = E (C) - 2 E (I) \\ = [(α_{1} + 2 α_{2} - 2) n + Θ (\log n)] - 2 [(α_{1} + α_{2} - 2) n + Θ (\log n)] \\ = (- α_{1} + 2) n + Θ (\log n) \end{aligned}$

Also, we can also use the equation
$\sum_{m \in M} m p_{m}$ to represent expected value, where
$m$ is the random variable under set
$M$ , and
$p_{m}$ as the probability of
$m$ happening; that is to say, represent
$E (s)$ as
$0 \cdot p_{s = 0} + 2 \cdot p_{s = 2} = 2 p_{s = 2}$ , or
$E (S) = 2 p_{s = 2} \cdot n^{'}$ . Thus, we could calculate as following:
$\begin{aligned} p_{s = 2} & = \frac{E (S)}{n^{'}} \cdot \frac{1}{2} \\ = \frac{[(- α_{1} + 2) n + Θ (\log n)]}{⌊ \frac{n}{2} ⌋} \cdot \frac{1}{2} \\ = \frac{(- α_{1} + 2) n + Θ (\log n)}{n} \\ = (- α_{1} + 2) + Θ (\frac{\log n}{n}) \approx 0.3933049 \dots + Θ (\frac{\log n}{n}) \end{aligned}$

Therefore, we can obtain
$p_{s = 0} = 1 - p_{s = 2} = 1 - [(- α_{1} + 2) + Θ (\frac{\log n}{n})] = α_{1} - 1 + Θ (\frac{\log n}{n}) \approx 0.6066951 \dots + Θ (\frac{\log n}{n})$

And the expected number of cases that
$j = d$ becomes:
$\begin{array}{r} p_{s = 0} \cdot n^{'} = (\frac{α_{1}}{2} - \frac{1}{2}) n + Θ (\log n) \end{array}$

Thus, combinding the equation of

E (L_{BUS})

E (T)

, we can finally get the result that

\begin{aligned} E (L_{BUS}) & = (3 - \frac{α_{1}}{2} - α_{2}) n - E (T) + Θ (\log n) \\ = (3 - \frac{α_{1}}{2} - α_{2}) n - [(β n + Θ \log n) + [(\frac{α_{1}}{2} - \frac{1}{2}) n + Θ (\log n)]] + Θ (\log n) \\ = (\frac{7}{2} - α_{1} - α_{2} - β) n + Θ (\log n) \end{aligned}

as the expected number of comparisons used in BOTTOM-UP-SEARCH() during the heap-creation phase.

3. The Number of Comparison Used in the Heap Creation Phase

Simply add the above 2 sections' result up, we get

(\frac{9}{2} - α_{1} - α_{2} - β) n + Θ (\log n)

TODO: require explaination for content after Lemma 4.4

References

Workplace for Copy&Paste

b (1)

Uduru0522

2022/05/07 13:17:39

#### Calculating with mean values Since the input array is randomly choosed, we use $L_{\text{BUS}}$, $L_{\text{HS}}$ and $D$ to denote the random variable that is the sum of all $l_{

L not l (Edited)

Peter Chen; Kohan Chen

2023/02/21 03:14:02

fisinhs

typo error: finish (Edited)

2023/02/21 03:20:16

wwithin

typo : within (Edited)

2023/02/21 03:26:27

most node

semantic error: most nodes (Edited)

Note of Bottom-up Heapsort (English Version)

Number of Comparison of Sorting Algorithms

General Comparision Sort

Worst Case Analysis

Merge Sort

Insertion Sort

Quick Sort

Algorithm of Heap Sort / Bottom-Up Heap Sort

Basic Heap Sort Analyzation

Bottom-Up Variation

Special Leaf & Special Path

Calculating Number Of Comparisions Of The Bottom-Up Heapsort Algorithm

1. Number of Cmp. used by the LEAF-SEARCH() calls during the heap creation phase.

Worst Case Situation

Best case Situation

2. Number of Cmp. used by the BOTTOM-UP-SEARCH() calls during the heap creationn phase.

Symbols for calculation

The comparison number of normal HEAPSORT

Calculating with mean values

3. The Number of Comparison Used in the Heap Creation Phase

References

Workplace for Copy&Paste

Read more

Linux Kernel on NUMA

Approximation and Online Algorithms

Efficient Distributed Secure Memory with Migratable Merkle Tree

NCKU CSIE Multimedia Midterm Question Bank (OLD)