Sorting Intro: Selection and Insertion

# Sorting Intro: Selection and Insertion These notes span 2 lectures: Monday (Nov 8) and Wednesday (Nov 10). ## Sorting Let's start with an exercise. Suppose you were given a list of numbers and asked to put them in sorted order. In class, we'll use notecards for this, but in these notes, I'll just write the list: > 7 4 2 8 5 3 For now, let's not worry about how the list is represented: it's not a Python list or a linked list or anything else like that. It's just a bunch of numbers on paper. How would you put approach sorting this list? Think in basic, tactile terms like swapping cards or moving numbers around. ### Two possible approaches There are lots of ways to proceed. Here's one sequence, where we keep moving the lowest element in the list up, building a _sorted prefix_ as we go along: > 7 4 _2_ 8 5 3 > **2** 7 4 8 5 _3_ > **2 3** 7 _4_ 8 5 > **2 3 4** 7 8 _5_ > **2 3 4 5** _7_ 8 > **2 3 4 5 7** _8_ > **2 3 4 5 7 8** I've used **bold red** to mark the sorted prefix of numbers that have already been shuffled into their proper place, and _italic blue_ to mark the element being moved next. So after the **2** has been moved, the _3_ is put into position after it and then gets bolded and colored red. Here's another, where we always take the first unsorted element and move it into its proper place in the sorted prefix: > _7_ 4 2 8 5 3 > **7** _4_ 2 8 5 3 > **4 7** _2_ 8 5 3 > **2 4 7** _8_ 5 3 > **2 4 7 8** _5_ 3 > **2 4 5 7 8** _3_ > **2 3 4 5 7 8** __What do you notice in common between these two approaches?__ They both move one number at a time, and implicitly divide the list into "sorted" and "unsorted" sections. __What do you notice that's different?__ The first approach always finds and moves the _smallest_ element remaining in the unsorted list. The second always moves the _first_ unsorted element. The first method is called _selection sort_ because it selects the smallest element to move every time. The second method is called _insertion sort_, because it inserts the next element into its proper place. These are the two sorting algorithms that people most often use naturally when sorting things like cards or paper records. They both have more or less the same worst-case efficiency, but more on that later. ### Insertion Sort: Runtime Let's focus on insertion sort. Can we write out the program, or at least specify the steps? Here's a description we might use: ``` loop over everything in the list move it into the right place in the sorted prefix ``` We haven't written Python yet, but we can still reason informally about the worst-case runtime. #### On a list of length N, what will the big-O runtime of insertion sort be? <details> <summary>Think, then click!</summary> $O(N^2)$. This is because the first line in the above description loops $N$ times, and the second line loops from $1$ to $N-1$ times, depending on how far into the list we are and how unsorted it is. This is the same situation as we saw when writing `distinct` a few weeks back. In the worst, case, the list is sorted in reverse order, and we need to move each element to the beginning of the list. But we won't discover that until we've actually examined the entire sorted prefix. We'll check $1$, then $2$, then $3$, and so on. $\sum_{i=1}^{n}(i) = \dfrac{n(n-1)}{2}$, which is in $O(N^2)$. </details> #### How about the _best_ case? Remember that we're talking about the best case for an arbitrary list length. Of course a list of only 1 element is easy to sort, but that's not what we mean here. <details> <summary>Think, then click!</summary> In the best case, the list is already in order and so no elements need to be moved at all; they're already in the proper insertion position. Thus, insertion sort's best case performance is $O(N)$. Still, worst-case $O(n^2)$ means there are _better_ algorithms for sorting than insertion sort. However, if your list is _mostly_ sorted, then insertion sort can perform very well. </details> We'll come back to selection sort later. ## Implementing Insertion Sort Let's write a version of insertion sort that works over Python lists. We'll make the design decision to modify the list itself, rather than returning a new list; this is sometimes called an _in place_ sort. ```python def insertion_sort(lst: list): """ loop over everything in the list move it into the right place in the sorted prefix """ pass ``` ### Some tests Let's write some tests as examples to make sure we understand the problem and some of the most likely ways we could get it wrong. ```python def test_insertion_sort(): list1 = [] list2 = [1] list3 = [1, 2, 3] list4 = [-1, 17, 12, 4, 2, 14, 1, 1, 17] # ... insertion_sort(list1) insertion_sort(list2) insertion_sort(list3) insertion_sort(list4) assert list1 == [] assert list2 == [1] assert list3 == [1, 2, 3] assert list4 == [-1, 1, 1, 2, 4, 12, 14, 17, 17] ``` ### Writing the code At first, we might want to start like this. After all, that's what the 2-line summary said: "loop over everything in the list". ```python for item in lst: # ??? ``` But there's a problem. All we have in the loop body is the element itself, not the context we need to _move the element around_ in the list. We don't know where the previous element is, or if there even is one. We're going to need a different kind a loop. A _while loop_ in Python is a loop that keeps running so long as its condition is true. We'll set one of those up here, starting at index 1 (since the first element is already "sorted" on its own): ```python index = 1 while index < len(lst): # ??? ``` Note that a while loop is more dangerous than a for loop. We need to _stop it_ somehow, or else it will continue running forever. We'll want to make sure that the `index` variable keeps increasing or we'll be in trouble. (If you ever find yourself in an infinite loop in Python, you can often terminate the program by hitting `CTRL-C`.) Moving on, we might proceed like this, using explicit swapping to move elements upward: ```python index = 1 while index < len(lst): element = lst[index] if element < lst[index-1]: # safe because index starts at 1 lst[index] = lst[index-1] lst[index-1] = element index = index + 1 ``` This will work great on lists of 1 or 2 elements, but on larger lists we might need to move an element further than just 1 position. To see why, return to the example from before: > 7 4 2 8 5 3 If `index == 1` (pointing to the `4`), the above works. But after that step: > 4 7 2 8 5 3 when `index == 2` (pointing to the 2), we would produce: > 4 2 7 8 5 3 which doesn't move the `2` far enough. Now what? We need to _keep moving the element_ as long as needed. That sounds like another loop. But we need to be careful to make sure it stops: ```python index = 1 while index < len(lst): element = lst[index] while index > 0 and element < lst[index-1]: # safe because index starts at 1 lst[index] = lst[index-1] lst[index-1] = element index = index - 1 index = index + 1 ``` That's better. Maybe it even works! We can try some examples, and watch it running in the debugger. In fact, let's watch it work on our big test in the debugger. We'll see that we're inefficiently conflating the index of _the element we're inserting_ and the index we're _considering for insertion at_. Let's give these two ideas different names: ```python index = 1 while index < len(lst): element = lst[index] insertion_index = index while insertion_index > 0 and element < lst[insertion_index-1]: # safe because index starts at 1 lst[insertion_index] = lst[insertion_index-1] lst[insertion_index-1] = element insertion_index = insertion_index - 1 index = index + 1 ``` ### Insertion sort runtime (more formally) Now let's convince ourselves that the worst-case runtime is indeed $O(n^2)$. Here's a worst-case input: [5, 4, 3, 2, 1] The outer loop will perform $N-1 = 4$ operations (i.e., we'll move 4 elements). The inner loop runs once, then twice, and so on. So our intuition above seems to fit. We can confirm this by examining the program and labeling how many times each operation runs: ```python index = 1 # 1 operation while index < len(lst): # N-1 operations (comparisons) element = lst[index] # N-1 operations insertion_index = index # N-1 operations while insertion_index > 0 and element < lst[insertion_index-1]: # N(N-1)/2 times lst[insertion_index] = lst[insertion_index-1] # N(N-1)/2 times lst[insertion_index-1] = element # N(N-1)/2 times insertion_index = insertion_index - 1 # N(N-1)/2 times index = index + 1 # N-1 operations ``` In total, this is $1 + 4(N-1) + 4(N(N-1)/2)$, which is in O(N^2)$. ### Selection sort runtime What about selection sort? Let's look at a run; recall that selection sort works by finding the smallest element that isn't in the sorted prefix, and putting it at the end of the sorted prefix. > 3 1 4 7 2 > 1 3 4 7 2 > 1 2 3 4 7 # can it tell we're done yet? (this is important) > 1 2 3 4 7 > 1 2 3 4 7 We might write this as: ``` for a list of length N, do N times: find the smallest element of the unsorted list move it to the end of the sorted section ``` Selection sort doesn't know where that smallest element is. It has to _look_ for it, no matter what. As a result, there's no distinction between the best and worst cases: it always proceeds in the same way. It needs $O(n)$ steps to find the smallest unprocessed element _every_ time. ### Looking ahead It turns out that we can be very clever and get better worst-case than $O(n^2)$. More on this next time! But for now, I wonder if there's a better way to find the right spot to insert elements than a linear search? It's also worth thinking about when one sort might be better than another. Worst-case and best-case runtimes are important, but not the only criteria. What else do you think we should care about when picking an approach to sorting?