Data Structures and Algorithms Lab 7 - Hashing and Maps

<h1 style=" display: flex; align-items: center; justify-content: center;" > DSaA Lab 7 - Hashing and Maps </h1> <h4 style="text-align: center;"> The Islamic University of Gaza<br> Engineering Faculty<br> Department of Computer Engineering </h4> <font color="#ec53c4"> Prepared by: Eng. Ahmad Albarasy </font> <br> <font color="#ec53c4"> Under the supervision of: Prof. Ahmad Mahdi </font> --- In this lab, we are going to explore hash tables, one of the most efficient and widely used data structures for problems that require fast insertion, search, and deletion of data. We will start by learning about hashing and its key terminology, then move on to maps and their Abstract Data Type (ADT). Finally, we will focus on hash maps, hash tables, and hash sets, examining their structure, operations, and practical examples to see how they work in real-world scenarios. ## Hashing ### Definition **Hashing** is a process of converting data (text, numbers, etc..) into a fixed-length string of characters (in the context of cryptography) or to an index within a fixed range (in the context of the Hash Table data structure, as we will see later). The hashing process is done by applying a special algorithm we call **a hash function** <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/Byhiz9hfbx.png" alt="image"> </div> In both contexts, the core idea of hashing is the same: transforming input data into another value using a hash function. However, the way hashing is designed and the purpose it serves differ depending on the context. ### Hashing Terminology * **Hash Function:** A function that takes a key and maps it to an integer value called a hash code. This allows non-integer keys to be processed numerically * **Hash Code:** An integer value generated by a hash function based on the key. It is used as input to the compression function to determine the index in the hash table * **Compression Function:** A function that maps the hash code to a valid index within the range of the hash table ### Hashing in Java In Java, all classes -whether user-defined or provided by the standard library- implicitly extend the [Object](https://docs.oracle.com/javase/8/docs/api/java/lang/Object.html) class provided in the `java.lang` package. The `Object` class provides the `hashCode()` method, which returns an integer hash code for the object based on the object’s memory address. **This method is used by hash-based** collections such as `HashMap` and `HashSet` to efficiently locate objects. However, when two objects are considered logically equal (using the `equals()` method), it is important to override `hashCode()` so that they produce the same hash code. This ensures that hash-based collections like HashMap or HashSet handle such objects correctly, preventing unexpected duplicates and maintaining fast lookup performance. ### Hashing Collisions Collisions happen when two different keys map to the same index. **Unfortunately, collisions are inevitable due to the limited size of the table,** but a good hash function aims to distribute keys uniformly and minimize the number of collisions as much as possible. ### Collision Resolution Schemes #### Separate Chaining A simple and efficient way for dealing with collisions is to have each bucket `A[j]` store its own secondary container, holding all entries `(k,v)`. The container is typically a linked list, but other data structures such as dynamic arrays or balanced trees may also be used. <div style="text-align: center;"> <img src="https://hackmd.io/_uploads/BJz2iWCzZx.png" alt="image"> </div> The efficiency of separate chaining depends on the number of elements stored in each container, which is influenced by the load factor of the hash table. #### Open Addressing In open addressing, all entries are stored directly in the hash table itself, **and each bucket can hold at most one entry**. When a collision occurs, the hash table searches for another empty bucket according to a specific probing scheme. The two most common open addressing techniques are **linear probing** and **quadratic probing**, which differ in how they compute the sequence of alternative indices to try. ## The Map ### Definition A map is a data structure designed to efficiently store and retrieve values based upon a **uniquely identifying search key** for each. Specifically, a map stores key-value pairs `(k,v)`, which we call **entries**, where `k` is the key and `v` is its corresponding value. One of the main advantages of using maps is that the key can be an object of any type, not necessarily an integer. This makes maps flexible and easy to use, as elements can be indexed based on meaningful identifiers such as names, IDs, or other object properties rather than numeric positions. ### The Map ADT ```java= public interface Map<K, V> { /** * Returns the number of entries in the map. * * @return number of entries */ int size(); /** * Returns whether the map is empty. * * @return true if the map contains no entries, false otherwise */ boolean isEmpty(); /** * Returns the value associated with the given key. * * @param key the search key * @return the associated value, or null if no such entry exists */ V get(K key); /** * Inserts or updates a key–value entry in the map. * * @param key the key of the entry * @param value the value to associate with the key * @return the previous value associated with the key, * or null if none existed */ V put(K key, V value); /** * Removes the entry with the given key from the map. * * @param key the key of the entry to remove * @return the value that was removed, * or null if no such entry existed */ V remove(K key); /** * Returns an iterable collection of all keys stored in the map. * * @return iterable collection of keys */ Iterable<K> keySet(); /** * Returns an iterable collection of all values stored in the map. * Values may be repeated if multiple keys map to the same value. * * @return iterable collection of values */ Iterable<V> values(); /** * Returns an iterable collection of all key–value entries in the map. * * @return iterable collection of entries */ Iterable<Entry<K, V>> entrySet(); /** * A key–value pair stored in the map. */ interface Entry<K, V> { /** * Returns the key stored in this entry. * * @return the key */ K getKey(); /** * Returns the value stored in this entry. * * @return the value */ V getValue(); } } ``` ### Implementation We have different options to choose from when implementing a map functionality. Each approach represents a trade-off between simplicity, performance, ordering guarantees, and memory usage. Common of the implementations are: 1. **Unsorted Map (List-Based Implementation):** The unsorted map implementation stores entries in an arbitrary order using a simple linear structure such as an array or a linked list (Check out **Code Fragment 10.4** from the textbook for the detailed implementation). Performance-wise, this implementation, despite its simplicity, is inefficient because searching, updating, and removing entries all require a linear scan through the structure, resulting in a time complexity of `O(n)` for these operations. 2. **Sorted Map (Tree-Based Implementation):** The sorted map implementation **stores entries in a data structure that maintains the keys in sorted order**, such as a balanced binary search tree (for example, a red–black tree or an AVL tree). In terms of performance, this implementation allows searching, insertion, and removal to be performed in `O(log n)` time, making it more efficient than unsorted maps while also supporting ordered operations such as finding the minimum, maximum, or the next larger key. 3. **Hash Map (Hash Table-Based Implementation):** The hash map implementation stores entries in an array **indexed by the result of a hash function applied to the key**. From a performance perspective, this approach provides constant-time `O(1)` **average-case** performance for searching, insertion, and removal, making it the most efficient general-purpose map implementation, **although its worst-case performance can degrade to** `O(n)` **if many collisions occur**. ## Hash Maps and Hash Tables ### Definition In general terms, hash maps and hash tables are basically the same underlying idea of using a hash function and an array to store and retrieve key–value pairs efficiently. The distinction between both is that **a hash table is the data structure itself**, **while a hash map is a map abstraction implemented using a hash table.** However, in Java, there is also a practical distinction: `HashMap` is not synchronized and is therefore not thread-safe by default, whereas `Hashtable` is synchronized and thread-safe, but this comes at the cost of additional overhead and reduced performance in single-threaded or lightly concurrent applications. ### Real-World Use Cases Hash maps are widely used in software systems whenever fast lookup, insertion, and deletion of data based on a key are required. Some common real-world use cases include: 1. **Caching:** Hash maps are used to store recently accessed data so it can be retrieved quickly, for example in web servers, database query caches, or application-level memory caches. 2. **Indexing and Lookup Tables:** Hash maps allow efficient mapping from identifiers (such as user IDs, product codes, or usernames) to corresponding objects or records. 3. **Counting and Frequency Analysis:** Hash maps are commonly used to count occurrences of elements, such as counting word frequencies in a text, tracking page visits, or analyzing log data. 4. **Symbol Tables in Compilers and Interpreters:** Programming language implementations use hash maps to associate variable names with their values, types, or memory locations. 5. **Session Management and Authentication:** Web applications often use hash maps to map session tokens or user IDs to session data, enabling fast validation and retrieval. These use cases take advantage of the constant-time average performance of hash maps, making them a natural choice for building efficient and scalable systems. ### The Load factor and Rehashing The **load factor** of a hash table is defined as the ratio between the number of stored entries `n` and the number of available buckets `m`: $$ \alpha = \frac{n}{m} $$ It provides a measure of how full the hash table is and directly affects its performance. **A low load factor means that many buckets are empty**, which reduces collisions but wastes memory. **A high load factor means the table is crowded**, increasing the number of collisions and slowing down operations such as insertion, lookup, and removal. To maintain efficient performance, hash tables use a process called **rehashing**. When the load factor exceeds a predefined threshold (for example, `0.75` in Java’s `HashMap`), the table is resized, typically by allocating a new, larger array and reinserting all existing entries into it using the hash function again. This redistribution reduces collisions and restores the expected constant-time performance of the hash table operations. Although rehashing is an expensive operation with a time cost of `O(n)`, it occurs infrequently. As a result, the average cost of insertions remains constant over time, a property known as *amortized constant time*. ## Hash Sets A hash set is a data structure that only store unique elements and doesn't allow duplicate values. It's based internally on a hash table that only stores each unique key. The [HashSet](https://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html) class in Java implements the [Set](https://docs.oracle.com/javase/8/docs/api/java/util/Set.html) interface, which has the following primary methods: ```java= public interface Set<E> { /** * Returns the number of elements in the set. * * @return number of elements */ int size(); /** * Returns whether the set is empty. * * @return true if the set contains no elements, false otherwise */ boolean isEmpty(); /** * Returns true if the set contains the specified element. * * @param e the element to search for * @return true if the element is present, false otherwise */ boolean contains(E e); /** * Adds the specified element to the set if it is not already present. * * @param e the element to add * @return true if the set was modified, * false if the element was already present */ boolean add(E e); /** * Removes the specified element from the set if it is present. * * @param e the element to remove * @return true if the element was removed, false if it was not present */ boolean remove(E e); /** * Removes all elements from the set. */ void clear(); /** * Returns an iterator over the elements in the set. * * @return an iterator over the elements */ java.util.Iterator<E> iterator(); } ``` Hash sets are commonly used when the main requirement is to maintain a collection of unique elements and to perform fast membership checks. Typical use cases include removing duplicates from a dataset, tracking which items have already been processed in an algorithm, implementing whitelists or blacklists, and representing sets of permissions, tags, or visited states. Their constant-time average performance for insertion, removal, and lookup makes hash sets especially suitable for large collections where uniqueness and fast existence checks are important. ## Tasks ### Task 1 (4 points) Given two strings `s` and `t`, return true if `t` is **an anagram** of `s`, and false otherwise. Question Link on [LeetCode](https://leetcode.com/problems/valid-anagram/) Example 1: >Input: `s` = "anagram", `t` = "nagaram" >Output: `true` Example 2: >Input: `s` = "rat", `t` = "car" >Output: `false` **Constraints:** $1$ <= `s.length`, `t.length` <= $5 * 10^4$ `s` and `t` consist of lowercase English letters. --- ### Task 2 (6 points) Given a string `s`, find the length of the longest substring without duplicate characters. Question Link on [LeetCode](https://leetcode.com/problems/longest-substring-without-repeating-characters/) Example 1: >Input: `s` = "abcabcbb" >Output: `3` **Explanation:** The answer is "abc", with the length of 3. Note that "bca" and "cab" are also correct answers. Example 2: >Input: `s` = "bbbbb" >Output: `1` **Explanation:** The answer is "b", with the length of 1. Example 3: >Input: `s` = "pwwkew" >Output: `3` **Explanation:** The answer is "wke", with the length of 3. Notice that the answer must be a substring, "pwke" is a subsequence and not a substring. **Constraints:** $0$ <= `s.length` <= $5 * 10^4$ `s` consists of English letters, digits, symbols and spaces. --- ### Tasks Submission Guide You will solve both problems on LeetCode and you have to make sure that your solution passes all test cases and that LeetCode accepts it. Once you are done, its time to submit your solution on GradeScope. Firstly, make sure to submit only two files: **1. TaskOne.java 2. TaskTwo.java** Secondly, for each question, you have to follow these steps to make sure GradeScope's autograder runs without issues: #### For Task 1: 1. Copy your whole solution into a file named `TaskOne.java` 2. Change the class name from `Solution` to `TaskOne`, and make sure that the class is public 3. add the following package identifier to the beginning of the file. Make sure to copy it and **don't write it manually to avoid typos**: ```java= package com.gradescope.labSeven; ``` Here's an example solution for reference: ```java= package com.gradescope.labSeven; public class TaskOne { public boolean isAnagram(String s, String t) { } /* if any helper methods are needed, make sure to include them */ } ``` #### For Task 2: 1. Copy your whole solution into a file named `TaskTwo.java` 2. Change the class name from `Solution` to `TaskTwo`, and make sure that the class is public 3. add the following package identifier to the beginning of the file. Make sure to copy it and **don't write it manually to avoid typos**: ```java= package com.gradescope.labSeven; ``` Here's an example solution for reference: ```java= package com.gradescope.labSeven; public class TaskTwo { public int lengthOfLongestSubstring(String s){ } /* if any helper methods are needed, make sure to include them */ } ``` --- <div style="display: flex; align-items: center; justify-content: center;" > Happy Coding 🙂 </div>