Try   HackMD

Note of Bottom-up Heapsort (English Version)

BOTTOM-UP-HEAPSORT, a new variant of HEAPSORT beating, on an average, QUICKSORT (if n is not very small)

Number of Comparison of Sorting Algorithms

General Comparision Sort

Worst Case Analysis

For a input of

n different entries, there are at most
n!
permutations, which only 1 of them is the sorted list. If a algorithm always fisinhs after
f(n)
steps (comparisons), it can at most distinguish
2f(n)
different permutaions, due to the fact there are only 2 result of each comparison, eliminating half of the possibilities.

Therefore, the algorithm must at least yield

2f(n)n!, or
f(n)log2n!

Since the number of comparisons is definetely a whole number, we can assume the lower bound of it at worst case as

logn! , then use the Strling's Formula to simplify:
lnn! =nlnnn+Θ(lnn)log2n!log2e=nlog2nlog2elog2enlog2e+Θ(log2nlog2e)log2n!=nlog2nnlog2e+Θ(log2n)nlog2n1.4427n

When

n is big enough, we can ignore the error
Θ(log2n)
.

Merge Sort

Requires at least

nlognn+1 comparisions, but also needs an additional array of length
n
, mergesort is only useful for external sorting.

Insertion Sort

Requires only at most

logn!+n1 comparisions. However, the average number of nonessencial operaions (e.g., swapping) is bounded at
Θ(n2)
, which is much higher than other sorting algorithms we consider useful's number (
O(nlogn)
).

Quick Sort

(Not that important.)




Algorithm of Heap Sort / Bottom-Up Heap Sort

  • BOLD-CAPTALIZED-DASHED-FONTS
    method names / keywords (e.g., FOR)
  • monospace fonts
    variables
  • ITALIC-CAPTALIZED-UNDERSCORED-FONTS
    other methods called wwithin methods

Basic Heap Sort Analyzation

First, we define these followings to perform a heap sort.

  • LEFT(i)
    The left child of element at index i.
  • RIGHT(i)
    The right child of element at index i.
  • HEAPIFY(A, i)
    Assuming in tree A, both subtree which roots are LEFT(i) / RIGHT(i) already satisfies the properties of a heap. Reorder so that the subtree which root is i also satisfies the properties of a heap.
    The shift of i is performed at most h times, assuming h
    =log(n)
    is the height of subtree which root is i; before each shift, at most 2 comparision is required.
  • BUILD-HEAP(A)
    Reorder tree A so that its element satisfies the properties of a heap. That is, to call HEAPIFY on the
    n2
    -th node down to the 1st node.

Then, to perform heap sort, we just need to perform the following procedure:

HEAPSORT(A):
      BUILD_HEAP(A)
      for i = A.length downto 2
            swap A[i] with A[1]
            A.heapsize = A.heapsize - 1
            HEAPIFY(A, 1)

Bottom-Up Variation

Recall that when reheaping, on every level of shifting, 2 comparisions are required to determine which element to go on. Furthermore, since most node of a tree are located near the leaf, we can assume almost all reheaping goes until the leaf.

Thus, there will be a significant difference if we could eliminate 1 of the 2 comparision required.

Special Leaf & Special Path

To achieve the above goal, we try to find the final location of i (as in REHEAP(i)) then modify the heap.

From i, we find the Special Path by comparing the two child, choose the one that i should be exchanging with, until we reach the leaf, which we call the Special Leaf.

LEAF-SEARCH(A, i): (if building min-heap)
      j = i
      while j < A.heapsize do
            if A[2 * j] < A[2 * j + 1]
                  j = A[2 * j]
            else
                  j = A[2 * j + 1]
      return j

Following it, we search from the special leaf, to where i is supposed to be. Consider the special path as b[1]...b[k]...b[h], the required comparision number is

1×(k1)+2×(hk)
comparisions.

BOTTOM-UP-SEARCH(A, i, j): (if building min-heap)
      while A[i] < A[j] do
            j = floor(j / 2)

After we found the position, we must perform the exchange, assuming on the special path excluding the root is b[1]...b[k]...b[h], and the root as x. The procedure INTERCHANGE(A, i, j) preform actions so that:

  • x take position of b[k]
  • b[1] ~ b[k] take position of their parent

Last, we modify HEAPIFY(A) into the following:

BOTTOM-UP-HEAPIFY(A, i):
      j = LEAF-SEARCH(A, i)
      j = BOTTOM-UP-SEARCH(A, i, j)
      INTERCHANGE(i, j)

Substituting into the original heapsort algorithm, we get the bottom-up heap sort algorithm.




Calculating Number Of Comparisions Of The Bottom-Up Heapsort Algorithm

To calculate the total comparision times, we discuss seperate procedures and try to calculate the number of comparison each would use.

1. Number of Cmp. used by the LEAF-SEARCH() calls during the heap creation phase.

The following is a example tree that we could calculate for. Notice that

  • k
    represents the number of completely filled levels. Can be calculated by
    log2(n+1)
    .
  • i
    represents the number of node at the lowest level, if its not completely filled.
  • n
    represents the number of nodes we're considering for. For the whole tree
    n=2k1+i
    .
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • When creating building a heap, the procedure LEAF-SEARCH() is called for node
    1
    to
    n/2
    .

Worst Case Situation

  • We first calculate the number of comparisions for nodes for the fully filled layers (That is, the orange nodes, considering

    n=2k1).
    For each
    l
    that
    1lk1
    , There exists
    2k(l+1)
    nodes, located at level
    k1l
    , casuing
    l
    comparisions each. For example, if we substitute
    l=1
    to the formula, we get that on level
    3
    , there are
    24(1+1)=4
    nodes that require
    1
    comparisions to the reach the leaf. To sum it up, we try to calculate the result of
    l=1k1(l2kl1)

    Considering
    Sm
    as the result when
    k=m
    , we can get the recursive formula
    Sm=2Sm1+m1

    By substuting
    Tm=Sm+m
    , we can generate:
    Sm+(m1)=2(Sm1+(m1))Tm1=2Tm1

    Then, We try to get the form
    Tm+c=2(Tm1+c)
    . To do so, we plug in
    T1=1
    and
    T2=3
    , to get
    c=1
    , thus resulting in the following equation allowing us to substitute to itself:
    Tm+1=2(Tm1+1)=2(2(Tm2+1))= 22Tm2=2(2(Tm3+1))= 23Tm3                        =2m1(T1+1)= 2m12=2m(Sm+m)+1=2m

    Thus, we finally get the following equation:
    Sk=2kk1

    To represent with
    n
    , we again substitute with
    n=2k1
    , getting the number of comparisions:
    nlog(n+1)

  • Next, we try to calculate the last

    i nodes. ( That is, the blue nodes ).
    Every pair of the blue nodes ( that is, when
    i2
    ) causes every node above them costing
    1
    more comparisons, that is, for level
    kr,1rk1
    , the blue nodes causes
    i12r

    Causing the procedure requireing an additional
    r=1k i12r

    Comparisons. Since the formula contains a ceiling function, we try to obtain a upper bound. First, we seperate the integer part and the fractal part (which would be modified by the ceiling function) from the formula, getting:
    r=1ki12r+r=1k(i12ri12r)

    The first term is easy to calculate. Eliminating the constant term, we get
    (i1)r=1k12r
    which can be easily evaulated into
    (i1)(112k)

    For the second term, we use the property that
    r=1k(i12ri12r)[(112)+(1122)++(112k)]k terms=k[12+122++12k]=k[112k]

    Adding the two terms up results in
    r=1ki12r+r=1k(i12ri12r)(i1)(112k)+k[112k]=i1i12k+k1+12k=i+k2+2i2k

    Since
    i2
    , we are able to elimate the last term, thus obtaining the inequality stated in the paper:
    i12ri2+k

Finally, adding the two value up, we result in: (Recall that for a full tree,

n=2k1
)
2k1k+i2+k=n2
As the upper-bound for the number of comparisions.

Best case Situation

Due to the fact that whatever the form the tree is, all LEAF_SEARCH() runs until the leaf, thus the number of comparisions remains as

nlog(n+1), identical with the worst case.

However, consider the situation that

n=2k. That is, the deepest layer only contains 1 simgle node (Only 1 blue node exists). Althought creating an additional level through the root to leaf, a single simply lets the calculation fall through, saving
1
comparisions for each root
. This happens if the root sits on the path
1
to
n
, saving
log2n1
comparisons in total.

Thus for the best case, the number of comparisions becomes

nlog(n+1)(log2n1)= nlog(n+1)log2n+1

As the lower-bound for the number of comparisions.


2. Number of Cmp. used by the BOTTOM-UP-SEARCH() calls during the heap creationn phase.

In this part, we focus on individual special paths; that is, the nodes from the processing root till the special leaf. Since the heapify procedure is done bottom-up, the nodes on the path should be well ordered according to the rules of a heap.

  • The array
    b()
    represents the nodes on the special path, excluding the root, starting from
    b(1)
    .
    • Representing the border of the array,
      b(0)=,b(d+1)=
      .
  • x
    represents the root node.
  • d
    represents the length of the special path, excluding the root.

The bottom-up search is simple: we place

x at the position after
b(d)
, then move up until before node
b(j)
that satisfies
b(j)xb(j+1)
.

Symbols for calculation

Compared with HEAPSORT, the above action requires the following numbers of comparison:

Method When
j=0
When
0<j<d
When
j=d
HEAPSORT
requires
2
comparision per level
2(j+1)
2(j+1)
2d
BOTTOM-UP SEARCH()
requires
1
comparision per level
d
dj+1
dj+1

Then, we define the following (for one root,):

  • lBUS
    represents the number comparison used by BOTTOM-UP-SEARCH()
  • lHS
    represents half of the number of comparison HEAPSORT uses.

The result of

lBUS+lHS changes according to the condtions:

Calculations

When

j=0
2(j+1)2+d=(j+1)+d=(0+1)+d=d+1

When

0jd
2(j+1)2+(dj+1)=(j+1)+dj+1=d+1

When

j=d
2d2+dj+1=2dj+1=2dd+1=d+1

When
j=0
When
0jd
When
j=d
d+1
d+2
d+1

The comparison number of normal HEAPSORT

TODO: Read the damn paper which the math is way too over my capacity

According to an even older paper An Average Case Analysis of Floyd's Algorithm to Construct Heaps, we know that HEAPSORT uses, on average,

  • (α1+2α22)n+Θ(logn)
    comparisions
  • (α1+α22)n+Θ(logn)
    interchanges

Where

α1=1.6066951,  α2=1.1373387

Calculating with mean values

Since the input array is randomly choosed, we use

LBUS,
LHS
and
D
to denote the random variable that is the sum of all
lBUS
,
lHS
and
d
, respectively. ( That is to say, for instance, the excpected value of
D,E(D)=d×n2=dn
)

According to the first part this section, since the comparision time of LEAF-SEARCH() is exactly the sum of special length for each root node, we know that:

E(D)=n+Θ(logn)

Additionally, According to the above theory, we also know that:

E(LHS)=(α12+α21)n+Θ(logn)

By defining the number of calls that

 lBUS+lHS=d+1 as
T
, the number of calls that
 lBUS+lHS=d+2
would be
nT
. Thus we can calculate that:
E(LBUS)=(nE(T))(E(d)+2E(lHS))+E(T)(E(d)+1E(lHS))=[(nE(T)+E(T))E(d)]+(nE(T))(2)+E(T)(1)+[(nE(T)+E(T))E(lHS)]=nE(d)+2(nE(T))+E(T)+nE(lHS)=E(D)+2n2E(T)+E(T)E(LHS)=E(D)+2nE(T)E(LHS)=E(D)+2n2E(T)E(LHS)=n+n+(1α12α2)nE(T)+Θ(logn)=(3α12α2)nE(T)+Θ(logn)

To optain the value of

E(T), we consider the 2 situations: when
j=0
, and when
j=d
.

  • Situation

    j=0 Means that the root node is the smallest (if its a max heap) within the subtree. If the subtree contains
    r
    nodes, the probability of a node be in that case is
    1r
    .

    Consider a fully filled tree: From the 2nd lowest level, there exists

    n4 subtrees, containing 3 nodes; the 3rd lowest level exists
    n8
    subtrees, each containing 7 nodes. We are able to only consider a full tree since a
    Θ(logn)
    error is allowed.

    In general, the

    h-th lowest level exists
    n2h
    subtrees, contains
    2h1
    nodes; thus we can conclude there is a
    n2h×12h1
    chance this situation happens on each level; thus we obtain the expected nunber of case where
    j=0
    being
    βn+Θ(logn) , where β=h=212h(2h1)

  • Situation

    j=d Means that after the reheap process, the root node should have
    0
    childs
    ; we first define the following symbols. When using the old REHEAP() procedure:

    • c
      represents the number of comparisons used.
    • i
      represents the number of interchanges used (moving how many levels down).
    • s
      represents the number of childs the rood node have after the procedure,
      s{0,1,2}
      .

    Using the old REHEAP(), to advance down, each level requires 2 comparisons, and some additional comparisons at the desired level, comparing with all of its childs. Thus, we can know the relationship between the three symbols being:

    c=2i+s

    To obtain the correct value, we try rule out the cases when

    s=1 or
    2
    .
    Due to the property of a heap, the lowest level nodes shall be placed at smallest indexes, leaving only 1 possible way for a node to have
    s=1
    , that is, the
    2n
    -th node. (See image for calculating cmp. of LEAF_SEARCH() to illustrate.) All nodes on that path might result in the position, thus a
     logn 
    probability for this case, which is low enough landing within the error range
    Θ(logn)
    .

    Again for the case

    s=2, we use the random variable
    C
    ,
    I
    and
    S
    to represent the sum of all
    c
    ,
    i
    and
    s
    respectivly. We then may know that according to the above property of HEAPSORT,
    E(C)=E(S)+2E(I)E(S)=E(C)2E(I)=[(α1+2α22)n+Θ(logn)]2[(α1+α22)n+Θ(logn)]=(α1+2)n+Θ(logn)

    Also, we can also use the equation

    mMmpm to represent expected value, where
    m
    is the random variable under set
    M
    , and
    pm
    as the probability of
    m
    happening; that is to say, represent
    E(s)
    as
    0ps=0+2ps=2=2ps=2
    , or
    E(S)=2ps=2n
    . Thus, we could calculate as following:
    ps=2=E(S)n12=[(α1+2)n+Θ(logn)]n212=(α1+2)n+Θ(logn)n=(α1+2)+Θ(lognn)0.3933049+Θ(lognn)

    Therefore, we can obtain

    ps=0=1ps=2=1[(α1+2)+Θ(lognn)]=α11+Θ(lognn)0.6066951+Θ(lognn)

    And the expected number of cases that

    j=d becomes:
    ps=0n=(α1212)n+Θ(logn)

Thus, combinding the equation of

E(LBUS)
,
E(T)
, we can finally get the result that
E(LBUS)=(3α12α2)nE(T)+Θ(logn)=(3α12α2)n[(βn+Θlogn)+[(α1212)n+Θ(logn)]]+Θ(logn)=(72α1α2β)n+Θ(logn)
as the expected number of comparisons used in BOTTOM-UP-SEARCH() during the heap-creation phase.


3. The Number of Comparison Used in the Heap Creation Phase

Simply add the above 2 sections' result up, we get

(92α1α2β)n+Θ(logn)


TODO: require explaination for content after Lemma 4.4

References

Workplace for Copy&Paste


b(1)