Hashing - HackMD

# Hashing ## Introduction In this section we will go through the most important and efficient techniques of data storing and data retrieval which is also commonly asked in interviews, i.e "Hashing". By the end of this article, you should be able to understand the below topics: - Need for hashing - What is Hashing? - How hashing works? - Key terms used in hashing - Hash Functions - Hash Table - Collisions - Hashing Techniques - Separate Chaining - Open Addressing - Applications of Hashing - FAQs ## What is the need for Hashing? - Consider a scenario where you want to design a system for storing records of employees of your organization keyed using their employee IDs. We need this system primarily to do the following 3 tasks: - Insert a employee ID along with corresponding employee data. - Search based on employee ID and fetch that employee's details. - Delete a employee data based on the employee ID entered. ![](https://i.imgur.com/gfxuRPo.png) - What are the data structures that we can think of for performing the above functionalities for our system efficiently? Well, let's analyze about implementing this system using the known data structures: - **Arrays / Linked Lists** of employee ID and data: - If we implement the search functionality in linear fashion, it can be costly to us when we have huge number of records to be searched upon. - If we make sure that the arrays are maintained in sorted manner based on the employee ID, then using Binary Search, we can search for employee ID `O(Logn)` time. But this will have an impact on insert and delete functionalities because we need to maintain sorted order of arrays upon every insert and deleting of records. ![](https://i.imgur.com/rrC2DgZ.png) - Hence this data structure is not feasible to use. - **Balanced binary search tree (BBST)** taking employee ID as keys: - A BBST is a binary tree in which the left and right subtrees of every node **always** differs in height by **no more than 1**. - Here, due to the nature of BBST data structure, we can guarantee that the search, insert and delete functionalities can take place in `O(Logn)` time. ![](https://i.imgur.com/1LdBxcs.png) - But, is there any more efficient way? Let's look for other options. - **Direct Access Table (DAT)**: - This data structure uses the `O(1)` retrieval time complexity property of Arrays to its advantage. Lets see how below. - In DAT, we make a very large array and use employee ID as index for the array. - We treat a value in array as `null` if employee ID is not present, else the value stores pointer to data mapping to the employee ID. ![](https://i.imgur.com/LtczUAi.png) - Here, we can do all the operations - Search, Insert and Delete in `O(1)` time which is the best possible we can achieve. - But, this has many limitations. - The obvious major problem is we need to have a large space dedicated for this data structure. The space complexity would be huge. - Another problem is if we have employee ID purely as an integer of n digits, then a programming language cannot support to store its value as integer as the value increases (as the number of records increase after a particular limit). ### What is Hashing? How does it work? Hashing is the process of converting a given key into another smaller value for O(1) retrieval time. - This is done by taking the help of some function or algorithm which is called as **hash function** to map data to some encrypted or simplified representative value which is termed as "hash code" or "hash". This hash is then used as an index to narrow down search criteria to get data quickly. ![](https://i.imgur.com/V98D6Zi.png) - Hashing can be considered as a significant improvement over DAT to reduce the space complexity. - In our example of employee system that we have seen in the Introduction part, we can simply pass the employee ID to our hash function, get the hash code and use it as key to query over records. - Let us see one more example to understand how hashing works. - #### Note to Editor: - Here you can use the existing IB article explanation for "How Hashing Works?" from https://www.interviewbit.com/tutorial/introduction-to-hashing/ directly. - Copy from "*Purely as an example to help us grasp the concept, let us suppose that we want to map a list of string keys to string values*" ........... **to** ..... "*But on the other hand, we’d be left with thousands of, say, 6-letter words all competing for the same slot in the array.*" (including the images there) ## Key terms in Hashing Before going into Hashing techniques, let us understand certain key terms that are used in hashing. ### **Hash Table:** A hash table is an array that stores pointers to data mapping to a given hashed key. - Hash table uses hash function to compute index of array where a record will be inserted or searched. Basically, there are 2 main components of hash table: Hash function and array. - A value in hash table is `null` if no existing key has hash value equal to the index for the entry. - The average time complexity needed to search for a record in hash table is `O(1)` under reasonable assumptions. - The maximum size of the array is according to the amount of data expected to be hashed. ![](https://i.imgur.com/yb2SEzP.png) - **Note to Editor:** - Here you can link this to this existing article or insert the details from existing article here https://www.interviewbit.com/tutorial/hash-tables/ - **Copy from** "*On the previous slide, we introduced the notion of hashing, mapping a piece of data such as a string to some kind of a representative integer value. We can then create a map by using this hash as an index into an array of key/value pairs.*" ......**to**...... "*The structure we have just illustrated is essentially the one used by Java’s hash maps and hash sets. However, we generally wouldn’t want to use the string length as the hash code.*" (including the images there) - Let us see how we can convert an object into a hash code more effectively by using hash functions in the next section rather than just using string length as the hashing algorithm. ### **Hash Functions:** A hash function is a function or algorithm that is used to generate the encrypted or shortened value to any given key. The resultant of hash function is termed as a hash value or simply hash. **Properties of good hash function:** - Should be a **one-way** algorithm. In simple words, the hash value generated should not be converted back into the original key. ![](https://i.imgur.com/FxSmcB3.png) - Should be **efficiently computable**. In real applications, if our algorithm takes longer time to compute the hash value itself from a hash function, then we lose the purpose of hashing. - Should **uniformly distribute** the keys among the available records. ![](https://i.imgur.com/ZDxCywp.png) **Types of Hash Functions:** - **Index Mapping Method:** The most trivial form of hashing technique is called "Index Mapping (or Trivial Hashing)". Here: - We consider an array where in the value of each position of the array corresponds to a key in the given list of keys. - This technique is very effective when the number of keys are reasonably small. It ensures that allocating one position of the array for every possible key is affordable. - Here, the hash function just takes the input and returns out the same value as output. ![](https://i.imgur.com/o0DnWdY.png) - This is effective only because it considers that any retrieval takes only `O(1)` time. - But otherwise, this approach is very trivial and inefficient to be used in real life scenarios as it assumes that the values are integers whereas in real life, we might have any datatype of data. Also, even for the case of integers, if the data is large, this is not suitable. - **Division method:** - Here, for hash functions, we map each key into the slots of hash table by taking the remainder of that key divided by total table size available i.e `h(key) = key % table_length ` => `h(key) = key MODULO table_length` ![](https://i.imgur.com/ag0OKVB.png) - This method is quite fast since it takes help of a single division operation and is most commonly used. - Things to keep in mind while using division method: - Certain values of table size is avoided for effective performance. For example: the size of table should not be a power, say `p` of certain number, say `r` such that if **table_length = rp**, then `h(key)` accomodates `p` lowest-order bits of key. We need to ensure that hash function we design depends on all the bits of the key unless we are sure that all low-order p-bit patterns are equally likely. - Research says that the hash function obtained from the division method gives the best results when the size of the table is **prime**. If `table_length` is prime and if `r` is the number of possible character codes on a system such that `r % table_length = 1`, then `h(key) = sum of binary representation of characters in key % table_length`. - **Mid square method:** - Suppose we want to place a record of key 3101 and the size of hash table is 2000. - Here, firstly the key is squared and then **mid part** of the resultant number is taken as the index for the hash table. So in our case: key = 3101 => (3101)2 = 3101*3101 = 9616201 i.e. `h(key) = h(3101) = 162` (by taking middle 3 digit from 9616201). Place the record in 162nd index of the hash table. ![](https://i.imgur.com/8qxoNb6.png) - **Digit folding method:** - Here, the key is divided into separate parts and by using simple operations these separated parts are combined to produce a hash. - Consider a record of key 12345678. Divide this into parts say 123, 456, 78. After dividing the parts combine these parts by performing add operation on them. `h(key) = h(12345678) = 123 + 456 + 78 = 657` ![](https://i.imgur.com/viW0ldL.png) - **Multiplication method:** - In this method, the hash function implementation is as follows: - we multiply key `key` by a real number `c` that lies in the range `0 < c < 1` and fractional part of their product is extracted. - Then, this fractional part is multiplied by size of hash table `table_size`. The floor of the result is considered as the final hash. It is represented as ``` h(key) = floor (table_size * (key * (c mod 1))) or h(key) = floor (table_size * fractional (key * c)) ``` Here, the function floor(x) returns the integer part of the real number x, and fractional(x) yields the fractional part of that number by performing `fractional(x) = x – floor(x)` ![](https://i.imgur.com/rbWYyrM.png) - The main advantage of this method is that the value of table size (table_size) doesnt matter. Typically, this value is chosen as a power of 2 since this can be easily implemented on most computing systems. ### **Collisions:** - The phenomenon where **two keys generate same hash value** from a given hash function is called as a collision. A good hash function should ensure that the collisions are minimum. ![](https://i.imgur.com/ElNWKmY.png) - We will look in details about how to minimise collisions in the next section. ### **Load Factor:** - The load factor is simply a measure of how full (occupied) the hash table is, and is simply defined as: `α = number of occupied slots/total slots` - In simple words, consider we have a hash table of size 1000, and we have 500 slots filled, then the load factor would become `α = 500/1000 = 0.5` - Larger the load factor value, larger is the space saved but at the cost of inserting more elements in a slot via chaining (We will learn about chaining in the below sections). This would worsen the access time of elements. - Generally, whenever a tight situation between deciding what is more important - time or space - we incline towards optimising time (increasing speed) than the memory space. This is called as time-space tradeoff. - To optimise the time complexity, we can reduce the load factor, but doing this will increase the memory or space required for hash table. ## Hashing Techniques Collisions are bound to occur no matter how good a hash function is. Hence, to maintain the performance of a hash table, it is important to minimise collisions through various collision resolution techniques. ![](https://i.imgur.com/msRVRMP.png) There are majorly 2 methods for handling collisions: - Separate Chaining - Open Addressing ### **Separate Chaining** - Here the main idea is to make each slot of hash table point to a linked list of records that have the same hash value. ![](https://i.imgur.com/u3HBQrn.png) - **Performance Analysis:** - Hashing performance can be evaluated under the assumption that each key is equally likely and uniformly hashed to any slot of hash table. - Consider `table_size` as number of slots in hash table and `n` as number of keys to be saved in the table. Then: ``` We have Load factor α = n/table_size Time complexity to search/delete = O(1 + α) Time complexity for insert = O(1) Hence, overall time complexity of search, insert and delete operation will be O(1) if α is O(1) ``` - **Advantages of Separate Chaining:** - This technique is very simple in terms of implementation. - We can guarantee that the insert operation always occurs in `O(1)` time complexity as linked lists allows insertion in constant time. - We need not worry about hash table getting filled up. We can always add any number elements to the chain whenever needed. - This method is less sensitive or not very much dependent on the hash function or the load factors. - Generally this method is used when we do not know exactly how many and how frequently the keys would be inserted or deleted. - **Disadvantages of Separate Chaining:** - Chaining uses extra memory space for creating and maintaining links. - It might happen that some parts of hash table will never be used. This technically contributes to wastage of space. - In the worst case scenario, it might happen that all the keys to be inserted belong to a single bucket. This would result in a linked list structure and the search time would be `O(n)`. - Chaining cache performance is not that great since we are storing keys in the form of linked list. Open addressing techniques has better cache performance as all the keys are guaranteed to be stored in the same table. We will explore open addressing techniques in the next section. ### **Open Addressing** - In this technique, we ensure that all records are stored in the hash table itself. The size of the table must be greater than or equal to the total number of keys available. In case the array gets filled up, we increase the size of table by copying old data whenever needed. How do we handle the following operations in this techniques? Let's see below: - **Insert(key):** When we try to insert a key to the bucket which is already occupied, we keep probing the hash table until an empty slot is found. Once we find the empty slot, we insert `key` into that slot. ![](https://i.imgur.com/UR7VdF4.png) - **Search(key):** While searching for `key` in the hash table, we keep probing until slot’s value doesn’t become equal to `key` or until an empty slot is found. - **Delete(key):** While performing delete operation, when we try to simply delete `key`, then the search operation for that key might fail. Hence, deleted key's slots are marked as “**deleted**” so that we get the status of the key when searched. - The term "open addressing" tells us that the address or location of the key to be placed is not determined by its hash value. - Following are the techniques for following open addressing: - **Linear Probing:** - In this, we linearly probe for the next free slot in the hash table. Generally, gap between two probes is taken as 1. - Consider `hash(key)` be the slot index computed using hash function and `table_size` represent the hash table size. Suppose `hash(key)` index has a value occupied already, then: ``` We check if (hash(key) + 1) % table_size is free If ((hash(key) + 1) % table_size) is also not free, then we check for ((hash(key) + 2) % table_size) If ((hash(key) + 2) % table_size) is again not free, then we try ((hash(key) + 3) % table_size), : : : and so on until we find the next available free slot ``` - When performing search operation, the array is scanned linearly in the same sequence until we find the target element or an empty slot is found. Empty slot indicates that there is no such key present in the table. ![](https://i.imgur.com/O2GxQmb.png) - **Disadvantage:** There would be cases where many consecutive elements form groups and the time complexity to find the available slot or to search an element would increase greatly thereby reducing the efficiency. This phenomenon is called as **clustering**. - **Quadratic Probing:** - In this approach, we look for **i2**-th slot in i-th iteration. - Consider `hash(key)` be the slot index required to place key computed using hash function and `table_size` is the size of hash table available, then: ``` If slot (hash(key) % table_size) is not free, then we check for availability of ((hash(key) + 1*1) % table_size) If ((hash(key) + 1*1) % table_size) is again full, then we check for ((hash(key) + 2*2) % table_size) If ((hash(key) + 2*2) % table_size) is not empty, then we check for the status of ((hash(key) + 3*3) % table_size) : : : and so on until we find the next available empty slot. ``` ![](https://i.imgur.com/ITbVeSY.png) - **Double Hashing:** - In this method: we follow the idea of applying a second hash function on the key whenever a collision occurs. ![](https://i.imgur.com/5ZnIgPP.png) - It can be done as follows: ``` (hash1(key) + c * hash2(key)) % Table_Size , where c keeps incremented by 1 upon every collision to find the next free slot. ``` - Here `hash1()` is the first hash function and `hash2()` is the second hash function applied in case of collision and `Table_Size` is the size of hash table. - First hash function can be defined as `hash1(key) = key % Table_Size` . Second hash function is popularly like `hash2(key) = prime_no – (key % PRIME)` where `prime_no` is a prime number smaller than `Table_Size`. - **Analysis of Open Addressing:** - The performance of hashing technique can be evaluated under the assumption that each key is equally likely and uniformly hashed to any slot of the hash table. - Consider `table_size` as total number of slots in the hash table, `n` as number of keys to be inserted into the table, then: ``` * Load factor, α = n/table_size ( α < 1 ) * Expected time taken to search/insert/delete operation < (1/(1 - α)) * Hence, search/insert/delete operations take at max (1/(1 - α)) time ``` ## Implementation **Note to Editor:** We can use the already existing implementations provided in this link https://www.interviewbit.com/tutorial/hashing-implementation-details/ ## Applications Hashing has found applications in lots of fields due to its property of irreversibility and constant time access. Following are the applications of hashing: - **Password Security and Verification** - Due to property of "one-way" or "irreversible" hash function, the world has seen a rise in cryptographic hash functions which provides the hash value in encrypted format. - **Security**: Whenever we are using any website for the first time which requires our authentication using "Sign Up" wherein we enter our credentials to access the accounts that belongs to us, we need to be assured that our account does not fall into wrong hands. Hence, the password entered is stored in the database in the form of a hash. This is done to avoid our crucial data falling into wrong hands. ![](https://i.imgur.com/w86gXKc.png) - **Verification**: When we try to login to the website from next time onwards, the password hash is computed and sent to the server for verifying whether that password belongs to us. The hash value is sent so that whenever a data is sent from client to server there is no case of sniffing. ![](https://i.imgur.com/j29dUsT.png) - **Tokenization**: The same above concept explained in "Password Security and Verification" can be extended to tokenization for storing the authorization tokens of users already logged into the systems. - **Data Structures**: Hashing is commonly used in defining and developing data structures of programming languages. For example: HashMap in Java, unordered_map in C++ etc. - **Compiler Development** The compilers corresponding to programming languages save all the keywords (such as if, else, for, while, return etc) that are used by that language in the form of hash table so that the compiler can easily identify what keywords and identified are needed to be compiled during compilation process. - **Message Digests:** This is again an application of cryptographic hashes. - Consider a scenario where you use cloud services to store your files and data. Now you need to be assured that the third party cloud service providers dont tamper with your files. We do so by computing hash of that file/data using a cryptographic hashing algorithms. We store these computed hashes in our local machines for reference. - Next time whenever we download the files from cloud storages, we can directly compute the hash of the files. We match it with the locally stored hash that was previously computed. - If the file has been tampered by the service providers, the hash value of the file would definitely change. If it hasnt been tampered then the hash values of the downloaded file and the locally saved value would be exactly same. ![](https://i.imgur.com/ITb5ViI.png) - Most common cryptographic algorithm is SHA256. These are also called as digest algorithms. The files in the above example is considered to be a message. - **Others:** - Hashing is also used in various common algorithms like Rabin Karp algorithm, counting frequency of characters in a word, creating git commit codes by git tool, caching etc. - Apart from above mentioned applications, hashing is also commonly used in Blockchain, Machine learning for feature hashing, storing objects or files in AWS S3 Buckets, Indexing on Database for faster retrieval and many others! ## FAQs - **Are Hashing algorithms secure?** - We have a separate branch of hashing algorithms that provide security and they are called as Secure Hashing Algorithms (SHAs). These are also called as cryptographic hashing algorithms which are widely used in computer and cloud systems for providing security to any resource. - The most secure hashing algorithm available is SHA256. - **Who invented Hashing?** - Hashing was first invented by **Hans Peter Luhn** in the 1940s. - He demonstrated his work on hashing at a six-day international conference devoted to scientific information which was scheduled in November 1958. - **How does Hashing work?** - Hashing mainly works by mapping data of any arbitrary size to fixed-size value with the help of a function called "hash function" and storing it in a data structure called "hash table" at the value returned by hash functions. The values returned by this function are called hash codes, digests, hash values or hashes. - Please refer to the "**What is Hashing? How does it work**?" section for more details. - **How is hashing implemented in Java?** - Java has provided inbuilt classes for the purpose of implementing hashing and they are as follows: - **HashTable<Object, Object>** : This is a synchronised (very useful in multi-threaded environments) implementation of hashing. - **Implementation:** - Code: ``` import java.util.*; class InterviewBit { public static void main(String args[]) { // Defining HashTable to store // String values mapped to integer keys Hashtable<Integer, String> hashTable = new Hashtable<Integer, String>(); // Fill the values in HashTable hashTable.put(1, "InterviewBit"); hashTable.put(2, "Example"); hashTable.put(5, "for"); hashTable.put(3, "HashTable"); // Display the hashtable System.out.println(hashTable); } } ``` - **Output:** `{1=InterviewBit, 2=Example, 5=for, 3=HashTable}` - **HashMap<Object, Object>** : This is a non - synchronised (doesnt work in multi-threaded environments) and faster implementation of hashing. Here, the order of insertion of keys are not maintained. - **Implementation:** - Code: ``` import java.util.*; class InterviewBit { public static void main(String args[]) { // Defining HashMap to store // String values mapped to integer keys HashMap<Integer, String> hashMap = new HashMap<Integer, String>(); // Fill the values in HashMap hashMap.put(1, "InterviewBit"); hashMap.put(2, "Example"); hashMap.put(5, "for"); hashMap.put(3, "HashMap"); // Display the hashMap System.out.println(hashMap); } } ``` - **Output:** `{1=InterviewBit, 5=for, 3=HashMap, 2=Example}` - **LinkedHashMap<Object, Object>:** Works similar to HashMap but here the order of insertion of keys are maintained. - **ConcurrentHashMap<Object, Object>:** This class works similar to HashTable class. This is synchronized, meaning it will work great with multi-threaded environments and this class is faster than HashTable as it supports efficient means of handling multiple locks on the resource. - **HashSet<Object>:** This works similar to HashMap class but here only the keys are maintained. Doesn't support duplicate objects. Here, Order of insertion of keys is not maintained. - **LinkedHashSet<Object>** This is similar to LinkedHashMap. The class just supports the keys. Order of insertion of keys are maintained. - **Why the results of hashing cannot be reverted back to original input?** - Hashing was developed by keeping in mind of the security of any systems. Hence, for a good hash function, it is always important for the function to be "one-way" so that the hackers do not get easy access to the critical data. - **When is it not advisable to use Hashing/ Hash Tables?** - In general, hashing has excellent time complexity for Insert/ Search / Delete operations. - For smaller application, it is advisable to use arrays because hash table consumes more memory space. - Hash tables do not support some operations like iterating over the entries where the keys are within a defined range, finding the entry with the largest/smallest key and so on. In such cases, it is better to go with arrays. - **Where is Hashing used?** - Hashing is most widely used across many applications such as password security, password verification, tokenization, data structures and compilers of programming languages, blockchain, machine learning feature hashing and many more! - Please refer to the "**Applications**" section for more details.