Program Performance 2

# Program Performance 2 ## Table of Contents 1. [Big-O (or asymptotic) Notation](#starting) 2. [Why does this matter?](#sets) 3. [Formally Speaking](#formally) 4. [Exercise: Comparing Rainfall Performance](#rainfall) ## Big-O (or asymptotic) Notation <a name="starting"></a> Last time, we talked about analyzing the performance of programs. When deciding whether a function would run in constant, linear, or quadratic time, we ignored the constants and just looked at the fastest-growing (i.e., worst-scaling) term. We can define this somewhat more formally using something that's usually called _big-O notation_ (because a capital "O" is involved). Let’s look again at the last function we considered last time: ```python def distinct(l: list) -> list: seen = [] for x in l: if x not in seen: # 0 + 1 + 2 + ... + (n-1) seen.append(x) # once for every new element return seen ``` We decided that this function’s running time is quadratic in its input (in the worst case). For a list of length $n$, we can calculate the worst-case number of operations as $((n∗(n−1))/2)+n+2$: * $((n∗(n−1))/2)$: when every element in `l` is unique, the inner membership check has to loop through the entire `seen` list, which grows by 1 element every iteration of the outer loop (and algebra gets us from $0 + 1 + ... + (n-1)$ to this form); * $n$: we know the `append` has to run once for every element of the input list in the worst case (we glossed over this last time); and * $2$: constant setup time for creating the `seen` list and returning from the function (last time we may have used $3$ here, counting the initialization and assignment as separate operations). A computer scientist would say that on a list of $n$ elements, this function runs in $O(n^2)$ time. Said out loud, this might be pronounced: "Big Oh of n squared". The intuition here is that we're trying to collapse out factors like how fast my computer is, and focus on _growth_: is there a way to formalize the idea that one algorithm is faster than another? #### I wonder... Why do we get to treat `append` as something that runs in constant time? Does it necessarily do so? Might there be different ways that `append` works, which have different worst-case runtimes? ## Why does this matter? <a name="sets"></a> You've noticed already that which data structures you use can influence the runtime of your program. If `append` were implemented with a Pyret list, it would be worst-case _linear_, rather than _constant_. Where else do these differences appear? A week or two ago, I promised to tell you why you might want to store data in a set, rather than a list. Let's run a quick experiment ```python def distinct_list(l: list) -> list: seen = [] for x in l: if x not in seen: seen.append(x) return seen def distinct_set(l: list) -> list: seen = set() for x in l: if x not in seen: seen.add(x) return list(seen) # convert back to a list ``` Try running each of these on Frankenstein. On my laptop, with sets: `0.054 total` and with lists: `2.153 total`. Both numbers are in seconds. It seems like there's a huge impact on performance. In fact, the version that uses sets performs around the same as `count_the` did last time, and `count_the` had worst-cast runtime proportional to the input size. That is, its worst-case performance was in $O(n)$. It seems like using a set has somehow eliminated the linear cost from the `x not in seen` check. Next time we'll talk about how sets and dictionaries make this kind of fast lookup possible, and whether there are any limitations. (There are limitations.) For now, think of big-O notation as a way to quickly capture how different tools (like data structures or algorithms) scale for different purposes. If you have the option to use something in $O(n)$ instead of something in $O(n^2)$, and there are no other factors involved, you should pick the better-scaling option. #### I wonder... Is there a potential bug lurking in the switch from `distinct_list` to `distinct_set`? Hint: it depends on what you might need to use the output for. ## Formally Speaking <a name="formally"></a> The formal definition looks like this: If we have (mathematical, not Python!) functions $f(x)$ and $g(x)$, then $f(x)$ is in $O(g(x))$ if and only if there are constants $x_0$ and $C$ such that for all $x > x_0$, $f(x)<C∗g(x)$. That is, $f(x)$ is in $O(g(x))$ whenever there is a point ($x_0$) after which $f(x)$ is always bounded by a constant multiple ($C$) of $g(x)$. There may be many such $x_0$ and $C$ values that work. But if even one such pair exists, $f$ is in $O(g)$. If we want to show that $((n∗(n−1))/2)$ is in $O(n^2)$, our constants could be, say, $x_0 = 4$ and $C = 4$. After the list is at least $4$ elements long, $((n∗(n−1))/2)$ will always be less than $4*n^2$. We could prove that using algebra, but instead, here's a picture motivating the argument: ![](https://i.imgur.com/reHr3Pz.png) <B>Important Note:</B> We're experimenting with using Wolfram Alpha for generating plots like this. Wolfram Alpha is an excellent tool which is free for personal use. Its terms [allow](https://www.wolframalpha.com/termsofuse/) academic use in settings such as this provided that we credit them and ideally link to the original query, which you can explore [here](https://www.wolframalpha.com/input/?i=plot+%28n*%28n-1%29%29%2F2+and+n%5E2+from+0+to+5). Note that they don't allow, for instance, web scraping vs. their tool, so if you use it, please use it responsibly! In this class, we won’t expect you to rigorously prove that a function’s running time is in a particular big-O class. We will, though, use the notation. E.g., we'll O(n) as a shorthand for "linear". And, indeed, the quick trick of looking for the largest term ($n$, $n^2$ etc.) will _usually_ work everything we give this semester, and in fact is a good place to start regardless. ## Exercise: Comparing Rainfall Performance <a name="rainfall"</a> As a bit of practice, let's figure out what the big-O runtime is for those two alternative rainfall solutions: #### Rainfall Version 1 ```python def average_rainfall(sensor_input: list) -> float: number_of_readings = 0 total_rainfall = 0 for reading in sensor_input: if reading == -999: return total_rainfall / number_of_readings elif reading >= 0: number_of_readings += 1 total_rainfall += reading return total_rainfall / number_of_readings ``` #### Rainfall Version 2 ```python def list_before(l: list, item) -> list: result = [] for element in l: if element == item: return result result.append(element) return result def average_rainfall(sensor_input: list) -> float: readings_in_period = list_before(sensor_input, -999) good_readings = [reading for reading in readings_in_period if reading >= 0] return sum(good_readings) / len(good_readings) ``` What are the worst-case running times of each solution? <details> <summary><B>Think, then click!</B></summary> For inputs of size $n$, both solutions run in $O(n)$ time. This is true even though version 2 contains multiple `for` loops. The reason is that these `for` loops are sequential, and not "nested": the program loops through the input, and then loops through the input again, separately. For example, this program has worst-case performance _linear_ in the length of `l`: ``` sum = 0 for i in l: sum = sum + i for i in l: sum = sum + i ``` But this program has quadratic performance in the length of `l`: ``` sum = 0 for i in l: for j in l: sum = sum + 1 ``` </details> #### A note on the drill One of our multiple-choice answers was "logarithmic". We haven't introduced this concept yet; we'll talk about what it means in a couple of weeks. So far, everything has been constant ($O(1)$), linear ($O(n)$), or quadratic ($O(n^2)$). #### I wonder... Is there such a thing as runtime in $O(n^3)$? What might that kind of program look like?