Maps & Hashing

# Maps & Hashing Algorithms Course by Grow with Google ###### tags: Data Structures & Algorithms Practice `{%hackmd theme-dark %}` ## Sets * Comparable to a List - it's vaguely a collection of things * A list has some kind of ordering for it's elements * A *Set* doesn't have that ordering but instead doesn't allow for repeated elements * You can think of a set like a bag. You can reach in a nd pull something out, but you'll never know what order you're getting the elements out in ## Maps A *Map* is a set-base data structure * Map = <key, value> * A group of *keys* is a *set*. The keys of a map, like the words in a dictionary, need to be unique: word = Definition 1 | Definition 2 | Definition 3 You can have several definitions for the same word stored in the value, but if you had the same word in the dictionary many times, you would randomly pick out a definition when you were first looking. Thus, each key only exists once in a map ## Python Dictionaries In Python, the map concept appears as a built-in data type called a dictionary. A dictionary contains key-value pairs. Dictionaries—are extremely easy to use and useful. Here's a sample of setting up a dictionary ```python udacity = {} udacity['u'] = 1 udacity['d'] = 2 udacity['a'] = 3 udacity['c'] = 4 udacity['i'] = 5 udacity['t'] = 6 udacity['y'] = 7 print udacity # {'u': 1, 'd': 2, 'a': 3, 'c': 4, 'i': 5, 't': 6, 'y': 7 ``` In this case, the letters in "udacity" were each keys in our dictionary, and the position of that letter in the string was the value. Thus, I can do the following: ```python print udacity['t'] # 6 ``` This statement is saying "go to the key labeled 't' and find it's value, 6". Dictionaries are wonderfully flexible—you can store a wide variety of structures as values. You store another dictionary or a list: ```python dictionary = {} dictionary['d'] = [1] dictionary['i'] = [2] dictionary['c'] = [3] dictionary['t'] = [4] dictionary['i'].append(5) dictionary['o'] = [6] dictionary['n'] = [7] dictionary['a'] = [8] dictionary['r'] = [9] dictionary['y'] = [10] print udacity # {'d': [1], 'i': [2, 5], 'c': [3], 't': [4], 'o': [6], 'n': [7], 'a': [8], 'r': [9], 'y':[10]} ``` You can learn even more about dictionaries in the [Python documentation](https://docs.python.org/2/tutorial/datastructures.html#dictionaries). ### Dictionary Practice Time to play with Python dictionaries! You're going to work on a dictionary that stores cities by country and continent. One is done for you - the city of Mountain View is in the USA, which is in North America. You need to add the cities listed below by modifying the structure. Then, you should print out the values specified by looking them up in the structure. Cities to add: * Bangalore (India, Asia) * Atlanta (USA, North America) * Cairo (Egypt, Africa) * Shanghai (China, Asia) Print the following (using "print"): 1. A list of all cities in the USA in alphabetic order. 2. All cities in Asia, in alphabetic order, next to the name of the country. In your output, label each answer with a number so it looks like this: 1 \ American City \ American City \ 2 \ Asian City - Country \ Asian City - Country ```python locations = {'North America': {'USA': ['Mountain View']}} locations['North America']['USA'].append('Atlanta') locations['Asia'] = {'India': ['Bangalore']} locations['Asia']['China'] = ['Shanghai'] locations['Africa'] = {'Egypt': ['Cairo']} print 1 usa_sorted = sorted(locations['North America']['USA']) for city in usa_sorted: print city print 2 asia_cities = [] for countries, cities in locations['Asia'].iteritems(): city_country = cities[0] + " - " + countries asia_cities.append(city_country) asia_sorted = sorted(asia_cities) for city in asia_sorted: print city ``` ## Hashing The purpose of a hash function is to transform some value into one that can be stored and retrieved easily You give it some value, it converts the value based on some formula, and spits out a coded version of the value which is often an index to an array *Hashes* are good for: * Modeling relationships from one thing to another thing * Filtering out duplicates * Catching/memorizing data instead of making your server do the work For example, Say you are hosing a concert, a ticket is required to enter. Attendees purchase tickets that each contain a unique barcode. To keep track of which tickets have been scanned you can use a hash. When numbers are big and random we need some way to convert those numbers into array indices quickly *Common pattern:* Is to take the last few digits of a big number, divide it by some consistent number, and using the remainder from that division to find a place to store that number in an array ex.) 0123456 -> 56/10 = 5 (remainder=6) In this case, the remainder turned directly into the index for the array In most cases, the last few digits are the most random. ### Collisions A flaw in the elegant system, There are times when a hash functions will spit out the same hash value for two different inputs Two ways to fix collisons: 1. Change the value in your hash function, or to change the hash function completely so you have enough slots to store all your potential values 2. Keep the original hash function but change the structure of your array. Instead of storing one hash value in each slot, you can store some type of lists(*Buckets*) that contains all values hashed at that spot. Rather than storing one value in each slot, you can store multiple values or a collection in each bucket For the first approach, you can maintain constant time look up but by using a bigger number in your hash function, going to require a lot more space to store values ## Load Factor When we're talking about hash tables, we can define a *"load factor"*: ```python Load Factor = Number of Entries / Number of Buckets ``` The purpose of a load factor is to give us a sense of how "full" a hash table is. For example, if we're trying to store 10 values in a hash table with 1000 buckets, the load factor would be 0.01, and the majority of buckets in the table will be empty. We end up wasting memory by having so many empty buckets, so we may want to rehash, or come up with a new hash function with less buckets. We can use our load factor as an indicator for when to rehash—as the load factor approaches 0, the more empty, or sparse, our hash table is. On the flip side, the closer our load factor is to 1 (meaning the number of values equals the number of buckets), the better it would be for us to rehash and add more buckets. Any table with a load value greater than 1 is guaranteed to have collisions. ### Load factor Practice One of your coworkers comes to you with a hash function that divides a group of values by 100, and uses the remainder as a key. The values are 100 numbers that are all multiples of 5. What is the load factor? you should divide the number of values by the number of buckets. There are 100 values, as stated in the question and 100 buckets(0 to 99) thus 100/100 = 1 He thinks its a little slow - what number would you recommend his function to divide by rather than 100 to speed it up? (87, 125, 107, or 1001) The answer to the second part is 107. The other values all had something wrong with them: - 125 is also a multiple of 5. Dividing a bunch of multiples of 5 by another multiple of 5 will cause a lot of collisions. Here's an example, where 10 is used as the divisor: - 5 % 10 = 5 - 10 % 10 = 0 - 15 % 10 = 5 - 20 % 10 = 0 - 87 is better than 125, but because it's less than 100 it'll still have collisions. - 1001 is good, but it'll create a ton of leftover buckets and waste a lot of memory. ## Hash Maps * A python dictionary is a hash map You can use the keys as inputs to a hash function, then store the key value pair in the bucket of the hash value produced by the function. Since the key in a map are unique since they belong to a set you could use a hash function to give them each their own unique buckets. You can also design a hash function to allow for collisions. ## String keys A function that converts letters into numbers. Individual letters can be converted into ASCII values and many languagues have functions built in to do that. We can combine the ASCII values with a formula to get a unique hash for each letter ### String Keys Practice In this quiz, you'll write your own hash table and hash function that uses string keys. Your table will store strings in buckets by their first two letters, according to the formula below: ```python Hash Value = (ASCII Value of First Letter * 100) + ASCII Value of Second Letter ``` You can assume that the string will have at least two letters, and the first two characters are uppercase letters (ASCII values from 65 to 90). You can use the Python function `ord()` to get the ASCII value of a letter, and `chr()` to get the letter associated with an ASCII value. You'll create a HashTable class, methods to store and lookup values, and a helper function to calculate a hash value given a string. You cannot use a Python dictionary—only lists! And remember to store lists at each bucket, and not just the string itself. For example, you can store "UDACITY" at index 8568 as `["UDACITY"]`.