Hashing II - HackMD

## Ice Breaker **Rose, Bud , Thorn**: - Roses are the positive things and win that you experienced. - Buds are the new ideas that you're excited to explore. - Thorns are the obstacles and problems that you faced. #### Questions from Yesterday Last time, we discovered collision. **Question:** When does collision happen? <details> <summary> Think, then click! </summary> In our last class, we discovered that multiple keys are being hashed to the same index when we apply this hash function: `key % size`. | Keys | Names | Index | | ---- | ------- | ----- | | 04 | Melyssa | 4 | | 15 | Dior | 1 | | 22 | Nicole | 1 | | 20 | Tim | 6 | | 38 | Laura | 3 | | 67 | Muhiim | 4 | | 91 | Kamryn | 0 | </details> ## Objectives By the end of the lesson, you will learn: - how to implement a simple hashtable - how to prevent collision - some properties of a good hash function ## Country Code Data Today, we will be mostly livecoding, and we'll be working with the [ISO 3166-1 alpha-2 dataset](https://drive.google.com/file/d/1raXtufAQNPbc6b7XcQOuLIs2lvTH4BuB/view?usp=drive_link). ISO 3166-1 alpha-2 is a standardized coding system that represents country names with two-letter codes. It helps to uniquely identify countries. To get started, download it from this [link](https://drive.google.com/file/d/1raXtufAQNPbc6b7XcQOuLIs2lvTH4BuB/view?usp=drive_link). Using the Google Colab [link](https://colab.research.google.com/drive/10ciIX1gTkmAlE5z82uzV2ZaaFmxYP3Cn?usp=sharing), make a copy and follow along. You will see that the data is already preprocessed for you! #### How is the data preprocessed? - You don't have to understand the code for this part, but just know that we have converted our large text into a nested list containing the country code and the country name. # Inventing Hashtables! We'll implement a simple hashtable to understand its data structure. Let's get started! ### Step1: Initializing the Hashtable **Discussion:** What should we do to initialize our hashtable? What data structure should we use? <details> <summary> Think, then click! </summary> Recall that we learned that hashtables can be thought of as lists. So let's first create a list with a certain number of empty slots. </details> **Discussion:** What size do we want our hashtable to be? <details> <summary> Think, then click! </summary> There are 3 cases to consider when creating a data structure: 1. when you create the data structure without knowing the actual data or queries it will handle. 2. when you know the data that will be inserted into the data structure but don't know the queries. 3. When you know both the query and the data. This case is not impossible but it is rare. These 3 cases can overlap. For example, if you have data constantly coming in while also receiving queries. But for our purposes, let's pretend they are distinct. In general, you won't know how many keys you might need to store until you know the data. In our case, we know what data we want to store in our hashtable (country codes, and country names.) As if right now, there are 249 country codes in the ISO 3166-1 standard. So we want to first initialize our hashtable with 249 empty slots. </details> Here is the code for doing that: ```python= hash_table = ['None'] * 249 print (hash_table) ``` ### Step 2: Populating the Hashtable Let's now consider how we should fill our hashtable with the country data. **Discussion:** What should we do to populate our hashtable? Think about the hash function we learned. Think about what we want our keys and values to be. <details> <summary> Think, then click! </summary> 1. Loop through the country_list and extract the country codes and country names. We want the country codes to be the keys, and the country names to be the corresponding values. 2. Apply the simple hash function we learned last time to convert the keys into array positions and store the value at that index.Since the keys are in a string format, we need to convert them into integers. </details> Here is the hash function we learned last class: ```python= def hash_function(key, size): # Calculate the hash value using the modulo operation index = key % size # Return the hash value return index ``` Because our keys are strings this time, we can't just do `key % size`. We first need to convert the keys, which are two letters, into integers. How should we do this? In python, there is a built-in function called ord() that converts a single character (like a letter or symbol) into its corresponding number representation according to the ASCII standard. Below is the ASCII table. You can see the numbers associated with each letter. Notice that capital "A" and lowercase "a" have different numeric values in the table. | Uppercase Letter | Decimal Value | | Lowercase Letter | Decimal Value | |------------------|---------------|----|------------------|---------------| | A | 65 | | a | 97 | | B | 66 | | b | 98 | | C | 67 | | c | 99 | | D | 68 | | d | 100 | | E | 69 | | e | 101 | | F | 70 | | f | 102 | | G | 71 | | g | 103 | | H | 72 | | h | 104 | | I | 73 | | i | 105 | | J | 74 | | j | 106 | | K | 75 | | k | 107 | | L | 76 | | l | 108 | | M | 77 | | m | 109 | | N | 78 | | n | 110 | | O | 79 | | o | 111 | | P | 80 | | p | 112 | | Q | 81 | | q | 113 | | R | 82 | | r | 114 | | S | 83 | | s | 115 | | T | 84 | | t | 116 | | U | 85 | | u | 117 | | V | 86 | | v | 118 | | W | 87 | | w | 119 | | X | 88 | | x | 120 | | Y | 89 | | y | 121 | | Z | 90 | | z | 122 | **Discussion:** With this information, how can we convert keys (strings) into integers? <details> <summary> Think, then click! </summary> - Initialize a variable `int_key` with the value 0, which will be used to store our key as an integer - Iterate through each character in the key - for each character in the key, use ord () to find its ASCII value and add that to the int_key. </details> Here is the hash function: ```python= def hash_function1(key, size): # Initialize an integer representing the key int_key = 0 # For each character in the key, add its ASCII value to the integer key for char in key: int_key += ord(char) # Calculate the hash value using the modulo operation hash_value = int_key % size # Return the hash value return hash_value ``` With this hash function, we can populate our data structure now ```python= # Try 1: Using the simple hash function we learned last class for country_code, country_name in country_list: index= hash_function1(country_code, len(hash_table)) hash_table[index] = country_name print(hash_table) ``` The below is the output for running and printing out our hashtable: ```python= ['None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'BosniaandHerzegovina', 'Canada', 'Andorra', 'Cocos(Keeling)Islands', 'Congo(DemocraticRepublicofthe)', 'Gabon', 'UnitedKingdomofGreatBritainandNorthernIreland', 'Estonia', 'Grenada', 'Georgia', "LaoPeople'sDemocraticRepublic", 'Morocco', 'Namibia', 'Monaco', 'Panama', 'Qatar', 'Niger', 'SaudiArabia', 'SolomonIslands', 'Ukraine', 'HolySee', 'Chad', 'SaintVincentandtheGrenadines', 'FrenchSouthernTerritories', 'SouthAfrica', 'Uganda', 'WallisandFutuna', 'Yemen', 'VirginIslands(U.S.)', 'Timor-Leste', 'Turkmenistan', 'UnitedStatesMinorOutlyingIslands', 'Tonga', 'VietNam', 'Suriname', 'Turkey', 'Zambia', 'UnitedStatesofAmerica', 'ElSalvador', 'Samoa', 'Vanuatu', 'SyrianArabRepublic', 'Mayotte', 'Uruguay', 'Uzbekistan', 'None', 'Zimbabwe', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None'] ``` As you can see we have multiple empty slots, indicating that not every country was assigned to a unique slot. This happened because some countries were hashed to the same hash value, leading to collisions. ```python= empty_slots = [] for country in hash_table: if country == 'None': empty_slots.append(country) print(len(empty_slots)) ``` <details> <summary> Code Explanation </summary> 1. We first initialize the empty_slot list to hold the data from the hashtable 2. Then we loop through the data in the hashtable and if there is a slot that is empty we add that to the empty slots list </details> We can use the above code to count the number of empty slots in our hashtable. There are `203!` The problem here is NOT "wasted space" but that data has been lost. ### Step 3: Preventing collision **Discussion:** Any ideas on how we can prevent collision? **Guiding Question:** - Can we manipulate the data or add more slots without expanding the number of indexes? <details> <summary> Think, then click! </summary> There are multiple ways to prevent collisions. Today, we are going to focus on one called chaining, which is useful for understanding what is going on. </details> #### Chaining Chaining allows us to store multiple items in the same slot of the list. To implement chaining, we create an empty list in each slot of our hashtable. So now, the structure of our hashtable becomes a list of lists. Instead of overwriting the previous item when a collision occurs, we simply append the new item to the list at that slot, forming a chain of elements. Using chaining to solve our collision problem: ```python= # Try 2: can we still use lists, but in a more subtle fashion, to avoid # the collision problem? hash_table_buckets = [] num_buckets= 200 for idx in range(num_buckets): # 0 to 199 (inclusive) hash_table_buckets.append([]) print(hash_table_buckets) ``` <details> <summary> Code Explanation </summary> 1. We first initialize the hash_table_buckets list to hold the data from the hashtable. These buckets allow for us to put multiple pieces of data into one index but have differnt values at that index. We will make 200 of them. 2. Then we loop through the data in the hashtable all the way to 199 and add an empty list to hold the data at these indexes. </details> Now, the structure of our hashtable is a list of lists of strings. **Let's try it:** To populate our data structure, how should we modify the code below? ```python= # Try 2: can we still use lists, but in a more subtle fashion, to avoid # the collision problem for country_code, country_name in country_list: index= hash_function1(country_code, len(hash_table)) hash_table[index] = country_name ``` Recall that our `hash_table_buckets` is a list of lists, and `hash_table_buckets[index]` gives us a list. <details> <summary> Try it, then Click </summary> ```python= for country_code, country_name in country_list: index = hash_function1(country_code, num_buckets) hash_table_buckets[index].append(country_name) ``` </details> This is the output of the hashbuckets distrbution of the data after using hash function 1: ```python= [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 3, 3, 4, 4, 4, 3, 4, 5, 8, 8, 7, 7, 5, 9, 7, 12, 7, 11, 12, 9, 9, 12, 11, 10, 9, 5, 6, 5, 7, 4, 4, 5, 3, 5, 3, 3, 2, 3, 1, 2, 2, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ``` **Discussion:** In what ways when we look at the table does it look bad? <details> <summary> Think, then click! </summary> When we look at the table, we see that there are a lot of zeroes, which means that there are many lists with zero elements. We also have lists with eleven and twelve elements. This indicates an uneven distribution. </details> **Discussion:** Why is the country code dataset producing such an unevenly distribution when mapped to this hash function? **Here are a few things to think about:** - Think about how hash function 1 works. - When you look at our [data](https://drive.google.com/file/d/1raXtufAQNPbc6b7XcQOuLIs2lvTH4BuB/view?usp=sharing), what do you notice about the keys? <details> <summary> Think, then click! </summary> To understand why this distribution is bad, let's recall how `hash_function1` works. It converts the key into an integer by finding the numeric representation of each letter in the key using the ASCII table. Then, it takes the modulo of their sum with the given size to calculate the hash value. Now, if you look at the [data](https://drive.google.com/file/d/1raXtufAQNPbc6b7XcQOuLIs2lvTH4BuB/view?usp=sharing), you will notice that: - Some countries, such as Armenia (AM) and Morocco (MA), hash to the same value because addition is commutative. - There are a lot of A's and E's as the first letters of the keys. Some letters are more common than others, and these more common letters tend to contribute the same amount to the hash function. - It is possible that different keys result in the same numeric representation. For instance, there are different ways you can achieve the number 7 by adding two different numbers. I expect 3 + 4, 1 + 6, and 0 + 7 to all give me 7. In this data, the keys are not evenly distributed. So, we have this biased data, and we're feeding it into a hash function that doesn't eliminate that bias. </details> **Discussion:** What makes a good hash function? <details> <summary> Think, then click! </summary> For our purposes here today, the primary goal of a hash function is to take a larger domain of input values and map them into a smaller domain known as buckets or slots. What makes a good hash function? A good hash function should yield an even distribution of values across all buckets. </details> **Discussion:** We have discovered,`hash_function1` does not work well for this data. Now, the question becomes, which other hash function can we use? Instead of using the symbol hash function we can use python's hash function: ```python= #Try 3: using python's hash () def hash_function2(key, size): return hash(key) % size ``` **Discussion:** why use `hash(key) % len(size)` rather than just `key % len(size)`? <details> <summary> Think, then click! </summary> 1. We want to avoid collisions whenever we can, and hash() will usually give us a more even distribution than plain `%` (for reasons you'll learn if you take more CS); 2. Not all keys are numbers. so we can't immediately apply `%`. Python's hash() converts other types, like strings, etc. to numbers that we can then apply `%` to. </details> Let's try to implement our hashtable using python's hash function ```python= #Try 3: Using python's hash function def hash_function2(key, size): # using python's hash () return hash(key) % size hash_table_buckets = [] num_buckets= 200 for idx in range(num_buckets): # 0 through 199, inclusive hash_table_buckets.append([]) # print(hash_table_buckets) for country_code, country_name in country_list: hash_key = hash_function2(country_code, num_buckets) hash_table_buckets[hash_key].append(country_name) print([len(hash_table_buckets[hash_key]) for hash_key in range(num_buckets)]) ``` The output of the hashbuckets distrbution of the data. ```python= [1, 2, 3, 0, 1, 0, 1, 2, 1, 2, 0, 0, 1, 1, 1, 1, 0, 1, 2, 2, 1, 3, 1, 0, 2, 0, 1, 1, 3, 3, 1, 2, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 2, 1, 0, 0, 1, 2, 4, 1, 1, 2, 1, 0, 2, 2, 0, 1, 1, 1, 0, 2, 2, 0, 0, 2, 0, 0, 1, 4, 1, 1, 2, 1, 0, 3, 2, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0, 1, 1, 3, 4, 4, 1, 1, 0, 2, 1, 3, 0, 1, 0, 0, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 4, 1, 1, 1, 1, 0, 3, 2, 1, 0, 4, 3, 2, 4, 0, 2, 3, 1, 1, 0, 1, 0, 1, 1, 1, 1, 2, 0, 2, 1, 0, 0, 0, 1, 1, 2, 0, 2, 1, 1, 0, 3, 1, 0, 0, 1, 0, 0, 1, 0, 1, 4, 3, 0, 3, 3, 0, 0, 1, 2, 2, 2, 2, 2, 1, 1, 4, 0, 1, 1, 0, 0, 0, 1, 1, 2, 0, 5, 1, 3, 2, 0, 2, 1, 0, 0, 3, 1] ``` As you can see, this is a much better distribution. Each list holds numbers ranging from 0 to 5 (with only one list holding 5 elements), as opposed to 1-12 with hash function 1. **Question:** Why do we care about having an even distribution across all the buckets? Let's say you need to retrieve a specific value from the hash table. You use its key to find its corresponding bucket. However, with a bad distribution, you might end up with a large number of elements stored in a single bucket. As a result, you would need to iterate through this entire bucket to find the value you are looking for. Note that, AT SCALE, the problem of bad distribution is magnified -- 20 elements in the bucket isn't so bad, but suppose it's 10000 every time you ask for a video. **What did we learn?** It wasn't just the hash table technique. It was the power of working around obstacles creatively in CS. Design in CS, whether algorithmic (like this example) or otherwise is a CREATIVE task as much as a technical one. That is it! Now you implemented a hashtable!