Hashing I - HackMD

## Ice Breaker - Mosaic+ Show and Tell Find an item around you that brings you joy, has a funny memory, or inspires you?💡 We will put you into breakout rooms and have you share what item someone else presented interested you the most? ## Objective: By the end of the lesson, you will learn: * A data structure called Hashtables, which are valuable alternative to lists * Hashing, a technique that is useful in many contexts like data structures and security * You'll get an appreciation for some of the challenges and trade-offs that come up when you're designing a new data structure ### No Spoilers! **Reminder:** if you know the topic we're covering, don't shout it out -- we'll get to it. ## Student Database Let’s say you are a teacher and you want to create a student database to store, manage and retrieve student-related information using their student ID.This table contains the names of the students in your class and their ID. A valid ID is considered to be an integer between 0 and 99. | ID | Student | | --- | ------- | | 04 | Melyssa | | 15 | Dior | | 22 | Nicole | | 20 | Tim | **Discussion:** What data structure would you use to store this data? Think about the data structures you have learned so far. <details> <summary> Think, then click! </summary> We could use a list, especially a nested list! </details> Okay, let's try to store this data using a nested list. **Discussion:** What would that look like? <details> <summary> Think, then click! </summary> ```python student_database_list = [ ["04", "Melyssa"], ["15", "Dior"], ["22", "Nicole"], ["20", "Tim"] ] ``` </details> Now let's say you want to access the student with the student ID 20 from the nested list. How long do you expect that to take? <details> <summary> Think, then click! </summary> To access the student with the student ID 20 in the nested list, you need to iterate through each sublist and check if the first element (the student ID) matches 20. If that is true, you can access the second element from the sublist to get the student's name. As you can see, this process takes some time. It might not be that bad in this case because we only have four sublists to loop through, but let's say you had millions and millions of sublists inside your big list. In that scenario, the time it takes to find the element you're looking for would be much longer. </details> **Discussion**: How can we modify the way we are using our list so that it becomes faster to search for a student based on their *ID NUMBER*? <details> <summary> Think, then Click </summary> Instead of using a nested list, we could use a single list and have the student's ID as the index and store the name at that index </details> **Exercise:** In your groups, use this [jamboard](https://jamboard.google.com/d/1zhBi_1dgk3MuTCIVFy4YwDwlrdDQUpkygJ-NmA8Vxh8/edit?usp=sharing) to draw how the list would look like. **Think about the following questions:** - what size do we want our list to be? - How would you use the student ID as the list index? <details> <summary> Think, then Click </summary> <img src="https://i.imgur.com/XT4kNPS.jpg" style="max-width:50%; max-height:50%;"> </details> ### What If Lets say we want to add 3 new students, Laura, Muhiim, Kamryn, to our student database but we dont know their student IDs. However, we do know that there are a 100 student IDs they could possibly have. To accommodate for this range of values we need to make the list size 100, so that if the student ID is 99 we have a spot for it in the list. ```python = student_database_list = ['']*100 ``` ### Exercise Lets say we discovered their student IDs. | ID | Student | | --- | ------- | | 38 | Laura | | 67 | Muhiim | | 91 | Kamryn | Now, let's add them to our student_database_list. In your previous groups, draw how the list would look now with the three new students. [Click here for the Jamboard](https://jamboard.google.com/d/1zhBi_1dgk3MuTCIVFy4YwDwlrdDQUpkygJ-NmA8Vxh8/edit?usp=sharing) <details> <summary> Think, then Click </summary> Here is how the lists would look for our student database: <img src="https://i.imgur.com/fZH5y05.jpg" width="50%" height="50%"> As you can see, we only have 7 elements, but we have created 100 slots in our list, 93 of which are empty. </details> **Discussion:** What are some problems you discovered while making your lists? <details> <Summary> Think, then Click </Summary> There are alot of gaps. </details> ### Summary We tried to use lists in 2 ways, and each had a different problem: #### Problem 1: Worst-case Linear Runtime for Searching When we used a nested list to store our data, we discovered that we were space efficient, but the process of searching for elements from the nested list was time-consuming. How time consuming? Well, in the worst case, the element we're looking for isn't in the list at all. Then, when we loop through the list, we would have to loop through everything else that's there. The work that we have to do for the search is proportional to the size of the list! You might hear this referred to as "linear" runtime in the worst case. #### Problem 2: Gaps When we used a single list, we were able to quickly access the elements, but we weren't space efficient. Due to the way lists are designed, they would waste storage by allocating space for the missing elements. While this may not be a significant issue on a smaller scale, it becomes crucial for larger platforms like YouTube, which have more potential video IDs. Let's take a look at a YouTube video code. How long is it, and what are the potential characters? It turns out the YouTube video code is 11 characters long, and it can consist of 62 potential characters, including upper-case letters (A-Z), lower-case letters (a-z), and digits (0-9). This results in a staggering 62^11 ≈ 7.94 x 10^18 potential YouTube video codes! That's more space than I have on my laptop, even if I didn't store the videos. This problem can lead to significant costs and reduced speed. There must be a better way to do this. **Discussion:** What other data structure could we use to immediately look up the values we are looking for while keeping the space compact? ## Hashtable A hashtable is a data structure, where values are stored using keys instead of indices. In our student database, the keys represent the student IDs, while the corresponding values are the student names. Let's use a hashtable to store this data. We'll start by initializing an empty hashtable: ```python= studentHashtable = {} ``` Next, we can populate our studentHashtable by assigning the student IDs as keys and their respective names as values. ```python= studentHashtable[4] = 'Melyssa' studentHashtable[15] = 'Dior' studentHashtable[22] = 'Nicole' studentHashtable[20] = 'Tim' studentHashtable[38] = 'Laura' studentHashtable[67] = 'Muhiim' studentHashtable[91] = 'Kamryn' ``` Now that we have filled our hashtable, let's explore how to access specific keys. **Question:** What does studentHashtable[20] give us? <details> <summary> Think, then click! </summary> Tim </details> Now let's try to understand the structure of hashtables and how our data is stored. You can think of hashtables as special lists. Let's suppose that we know we will only ever have 7 students. So we'll simplify things by creating a list with only 7 slots. **Discussion:** How do we stuff these 7 elements in this 7-slot list in a way that I still know how to directly go from student ID to its list position? <img src="https://i.imgur.com/inhb95f.jpg" width="100%" height="100%"> <details> <summary> Think, then click! </summary> <img src="https://i.imgur.com/FrPRhAl.jpg" width="100%" height="100%"> We want a function that changes the IDs into numbers between 0 and 6. It's fine not to know the answer immediately. Just understanding that we need a function that maps one thing to another helps us figure out the next steps to solve the problem. </details> **Discussion:** So how might this function look like? Think about what arithmetic function that, when applied to numbers, give you numbers ranging from 0 to 6. <details> <summary> Think, then click! </summary> Since we have only 7 slots in the list, we need to make sure we never try to insert a record at an index outside of {0, 1, 2, 3, 4, 5, 6}. There are a few ways to do this, but one is to take the record's key, divide by 7, and use the remainder. </details> ### Computing the Remainder The modulus operator, denoted by the symbol % (in Python), calculates the remainder of the division of one number (the dividend) by another number (the divisor). In other words, it returns the integer remainder of the division operation For Example, - 10 % 7 = 3: Dividing 10 by 7 gives a quotient of 1 and a remainder of 3. - 15 % 7 = 1: Dividing 15 by 7 gives a quotient of 2 and a remainder of 1. - 50 % 7 = 1: Dividing 50 by 7 gives a quotient of 7 and a remainder of 1. This is one way to write what is called **hash functions**, to convert keys to lists indexes. Here is the hash funciton: ```python= int index = key % size ``` The size refers to the list size. The use of % help can make our lists smarter. If we had a list that had a large amount of empty space, looking for data would be quick but at the cost of space. This hash function is good for our data since it does the job of converting keys that come from a large space of potential values to other keys from a smaller set of potential values. For example, from {0, 1, 2, ..., 99} to {0, 1, 2, ..., 6}, which helps with space efficiency. Even with this improvement, the data isn't safe from inefficiency. The usage of space is better but at what cost, there has to be a tradeoff to this? Keep this in mind. We will come back to it! ## Hashing We can use hashing to convert a key to a list index using a hash function. When you want to put something into the hashtable, you provide a key (e.g., a student ID) to the hash function. The hash function performs some calculations on the key and returns an index value. This index value determines the position or slot where the corresponding value will be stored in the hashtable. <img src="https://i.imgur.com/5CzveCB.png" width="100%" height="100%"> Now let's try to apply the hash function to our keys to find their position in the hashtable. ### Example We can try to find where Key 04 will go in the hashtable. So, 4 modulo 7 is 4. This means Melyssa will be stored at index 4. | Keys | Names | Index | | ---- | ------- | ----- | | 04 | Melyssa | 4 | | 15 | Dior | | | 22 | Nicole | | | 20 | Tim | | | 38 | Laura | | | 67 | Muhiim | | | 91 | Kamryn | | ### Let's Practice! Try to find where the rest of the keys will go in the hashtable using the hash function. <details> <summary> Think, then click! </summary> | Keys | Names | Index | | ---- | ------- | ----- | | 04 | Melyssa | 4 | | 15 | Dior | 1 | | 22 | Nicole | 1 | | 20 | Tim | 6 | | 38 | Laura | 3 | | 67 | Muhiim | 4 | | 91 | Kamryn | 0 | </details> **Disussion:** What kind of problems did you notice? Note, we shrunk the list size to save space. But there's this consequence. <details> <summary> Think, then click! </summary> Different keys gave us the same index. This is called Collision. What happens if there is a collision? How bad is it? </details> The broad lesson is: tradeoffs. What we did wasn't free. But that doesn't mean it's not worthwhile. CS is often about being creative around these tradeoffs. Maybe we can deal with collisions somehow and fix the new problem...