Data Structure Performance

# Data Structure Performance ## Table of Contents 1. [Two Inefficient Set Structures](#twoset) 2. [Making a Reasonable Tradeoff: Hash Functions](#hashing) 3. [The Challenge: Collisions](#challenge) 4. [Hashing More Complicated Keys](#keys) 5. [(If time permits) Exercise: Comparing Rainfall Performance](#rainfall) #### Setting the stage Last time we saw that switching from a `list` to a `set` in our `distinct` function produced a massive performance improvement. Today we'll learn about why. Or, rather, we'll invent the trick that Python `set`s use for ourselves! ## Two Inefficient Set Structures <a name="twoset"></a> #### Inefficient Structure 1 If we really wanted to, we could use Python lists to store sets. Adding an element would just mean appending a new value to the list, and checking for membership would involve a linear search through the list. Python lists are built so that accessing any particular index takes constant time. For example, suppose we have a list called `staff` that stores the first names of the staff of a course: ![](https://i.imgur.com/zooIP6E.png) If we asked Python for `staff[2]`, it would be able to find `'Ben'` in constant time. This is because the list is a contiguous block in memory; Python doesn't have to count down the list elements one by one, it can just do some arithmetic and skip ahead. (How this works exactly is a topic for a later course.) The trouble is that this kind of fast _lookup by index_ doesn't help us check for membership in the set. There's no connection between `2` and `'Ben'` except the arbitrary order in which elements were added. #### Inefficient Structure 2 What if, instead, the element we were searching for was the index? What if, say, we wanted to check for membership in the set based on student ID number. Then we'd have a picture like this, where the list elements are just `True` and `False`: ![](https://i.imgur.com/vn3hKVa.png) Now we can have lookup in worst-case constant time! In fact, we could even store more than just a boolean: we could store names, or more complex records, etc. and access them, again, in _constant time_. Even better, there are tricks we can use to encode most data-types as numbers. So this idea isn't limited to only looking up non-negative integers. We could use it to solve the lookup-by-name problem above by deciding to encode, say, 'Ben' as $2*26^2+5*26+14$. And, in practice, there are far better encodings. This all sounds great: it really does get us worst-case constant time lookup. _But there's a cost._ This technique needs a colossally big list—probably more than we could fit in memory! Worse, most of it is going to be used on `False` values, to say "nope, nothing here!". We're making an extreme trade-off of space for time: allocating a list cell for _every potential search key_, but reaping the constant-time lookup rewards. ## Making a Reasonable Tradeoff: Hash Functions <a name="hashing"></a> Can we find a happy medium between these solutions? Maybe by spending a _little_ bit of extra space, we can somehow end up with constant-time lookup? And even if we don't always manage $O(1)$, maybe we can still improve on both the extremes above. Let's start with another design question. Suppose we want to map the IDs of CSCI 0112 staff to their corresponding name. Further, let's say that IDs are non-negative numbers ranging between `0` and `999999`, but that there are far fewer than a million staff members: there are fewer than 10, and so you'll never have more than 10 datapoints to store in a given semester. So I'm going to give you a Python list with just 10 elements to work with. Strictly speaking, that ought to be enough space, but nowhere near enough to make a list cell for every _possible_ ID. Here's what it looks like: ![](https://i.imgur.com/jSC0Q1q.png) How can we convert a 6-digit student ID to a _unique index between 0 and 9_? We can't. But what if we were willing to give up the "unique" part of that question? Do we know a way to convert an ID to a number between 0 and 9 (while still making a reasonable _effort_ at uniqueness?) Every ID has a remainder when divided by 10. For instance, `1234` has remainder `4` when divided by 10. Let's try using this as our index function. We'll insert two datapoints: `{1234 : 'Tim'}` and `{5678: 'Ashley'}`. ![](https://i.imgur.com/dl4zgj3.png) Now if we want to look up the key `5678` we can do so in constant time! Just divide by 10 to get `8`, and do a lookup in this Python list. Do you see a problem? <details> <summary>Think, then click!</summary> We gave up uniqueness! For every index $i$ in the list, there are ten thousand potential keys with remainder $i$ when divided by 10. Just look up the key `0004` in the above list to see it. This isn't Tim's ID, but it still gets mapped to the string `'Tim'`. This is often called a _collision_: two actually-used keys get mapped to the same index, with potentially destructive results. So: this is a nice start, but we'd better do a little bit more work. </details> #### A Note on Terminology This kind of key-to-index transformation is often called a _hash function_, where "hash" here means a digest or summary of the original data. Hash functions are usually fast, lossy, and (in applications beyond this lecture!) built to have a relatively uniform distribution of hashes over the key population to reduce the chance of collision. ## The Challenge: Collisions <a name="challenge"></a> You'll learn more about the details if you take CSCI 0200. In short, there are a few ways of handling collisions: my favorite is storing a separate list for every index, but there are more sophisticated approaches. The key takeaway is that, in the worst case (all keys hash to the same index, nothing but collisions) searching a hash table is as slow as a list: $O(n)$. With a well-chosen hash function, and enough space in the table, collisions turn out to be rare. With some probability theory, we could prove that the average case is $O(1)$. It's _usually_ incredibly fast, and anyway never scales worse than a list. That's how sets and dictionaries are so fast in Python, and that's why data structure choice matters. You'll be choosing between lists, dictionaries, sets, etc. on the final project, so keep these scaling differences in mind. #### Another Note on Terminology This kind of data structure is called a _hash table_. For myself, I like to make a distinction between a hash table and a dictionary: one is low level, and one is high level. A dictionary (also called a _map_ in some other languages) is an abstract description of the shape of the data (a key-value store) and certain common operations (like lookup). You could implement a dictionary using any of the above data structures! Hash tables are by far the most common (and usually the wisest) way to implement a dictionary, and Python's dictionaries are backed by hash tables. The two terms are often used interchangeably. ## Hashing More Complicated Keys <a name="keys"></a> In principle, you can use any datatype you'd like as either the key or value in a hash table (and thus in a Python set or dictionary). There's one snag: the key needs to be something that can be reliably hashed. And if the key is a mutable datatype, like a list, it might hash to one value today and a different value tomorrow. Python avoids this problem by forcing us to use _immutable_ datatypes as keys in sets and dictionaries. Numbers, strings, etc. are all OK. If you want to use something like a list, check out the `tuple` datatype instead. If you want to use a dataclass as a key, you need to tell Python that the data cannot be changed by using the `frozen=True` annotation, like this: ```python @dataclass(frozen=True) class Location: lat: int long: int ``` There is a lot more to hash tables than we have time to discuss in 0112. If you want to learn more (without official 0112 support) about how hashing works in Python, check out the documentation on the `__hash__` method. If you want to learn more in the context of a course, check out 0220! ## (If time permits) Exercise: Comparing Rainfall Performance <a name="rainfall"></a> We didn't get to this last time, so, as a bit of practice, let's figure out what the big-O runtime is for those two alternative rainfall solutions: #### Rainfall Version 1 ```python def average_rainfall(sensor_input: list) -> float: number_of_readings = 0 total_rainfall = 0 for reading in sensor_input: if reading == -999: return total_rainfall / number_of_readings elif reading >= 0: number_of_readings += 1 total_rainfall += reading return total_rainfall / number_of_readings ``` #### Rainfall Version 2 ```python def list_before(l: list, item) -> list: result = [] for element in l: if element == item: return result result.append(element) return result def average_rainfall(sensor_input: list) -> float: readings_in_period = list_before(sensor_input, -999) good_readings = [reading for reading in readings_in_period if reading >= 0] return sum(good_readings) / len(good_readings) ``` What are the worst-case running times of each solution? <details> <summary>Think, then click!</summary> For inputs of size $n$, both solutions run in $O(n)$ time. This is true even though version 2 contains multiple `for` loops. The reason is that these `for` loops are sequential, and not "nested": the program loops through the input, and then loops through the input again, separately. For example, this program has worst-case performance _linear_ in the length of `l`: ``` sum = 0 for i in l: sum = sum + i for i in l: sum = sum + i ``` But this program has quadratic performance in the length of `l`: ``` sum = 0 for i in l: for j in l: sum = sum + 1 ``` </details>