Data Structure / Hash Table

# Data Structure / Hash Table ## 1. 基礎概念 Basic Concepts ### 1.1 什麼是雜湊表？雜湊表（Hash Table）是一種基於陣列實作的資料結構，它能夠將鍵（key）對應到表中的值（value），進而實現快速的資料存取。想像一間書是隨機擺放的圖書館，要找一本特定的書就需要走遍整個圖書館，但如果我們依照書名的某個規則（例如：第一個字的 ASCII 碼）來決定書的位置，就能快速找到目標書籍，這就是雜湊表的基本概念。 ### 1.2 為什麼需要雜湊表？在許多應用場景中，我們需要一個能夠支援快速插入、刪除和查詢操作的資料結構： - 資料庫索引 - 快取系統（Cache） - 符號表（Symbol Table） - 拼字檢查器 - 網頁爬蟲 - 密碼學應用 ### 1.3 雜湊表的基本操作雜湊表支援以下基本操作，理想情況下皆為 O(1) 時間複雜度： - 插入（Insert）：加入新的鍵值對 - 刪除（Delete）：移除指定鍵值對 - 查詢（Search / Lookup）：根據鍵值查找對應的值 - 更新（Update）：修改已存在鍵值對的值 ## 2. 直接定址表 Direct-Address Table ### 2.1 基本概念直接定址表是最簡單的雜湊表形式，它直接使用鍵值作為陣列索引。這種方法適用於鍵值範圍較小且密集的情況。 ### 2.2 實作範例 ```java= public class DirectAddressTable<V> { private Object[] table; public DirectAddressTable(int size) { table = new Object[size]; } @SuppressWarnings("unchecked") public V search(int key) { if (key < 0 || key >= table.length) { throw new IndexOutOfBoundsException("Invalid key"); } return (V) table[key]; } public void insert(int key, V value) { if (key < 0 || key >= table.length) { throw new IndexOutOfBoundsException("Invalid key"); } table[key] = value; } public void delete(int key) { if (key < 0 || key >= table.length) { throw new IndexOutOfBoundsException("Invalid key"); } table[key] = null; } } ``` 使用範例： ```java= public class DirectAddressTableExample { public static void main(String[] args) { DirectAddressTable<String> table = new DirectAddressTable<>(100); // 插入一些資料 table.insert(5, "Apple"); table.insert(10, "Banana"); table.insert(15, "Orange"); // 查詢資料 System.out.println(table.search(10)); // 輸出: Banana // 刪除資料 table.delete(10); System.out.println(table.search(10)); // 輸出: null } } ``` ### 2.3 直接定址表的限制 1. 空間使用效率低：必須配置等同於鍵值範圍大小的記憶體空間 2. 只適用於整數鍵值 3. 鍵值必須是非負數且範圍較小 ## 3. 雜湊函數 Hash Functions ### 3.1 理想的雜湊函數特性一個好的雜湊函數應該具備以下特性： 1. **均勻分配（Uniform Distribution）**：每個雜湊值被選中的機率應該相近 2. **確定性（Deterministic）**：相同的輸入必須產生相同的輸出 3. **高效能（Efficient）**：計算速度要快 4. **雪崩效應（Avalanche Effect）**：輸入的微小改變應該導致輸出的顯著改變以下是一些常見雜湊函數的介紹與實作： ### 3.2 除法雜湊法 #### 3.2.1 概念除法雜湊法是最基本的雜湊技術之一，通過將鍵值 $x$ 除以一個整數 $M$（通常是雜湊表的大小），然後取餘數作為雜湊表中的索引。其公式為： $$ f_D(x) = x \mod M $$ #### 3.2.2 特點 - **簡單易實現**：計算過程中只需進行一次除法運算。 - **碰撞風險**：如果鍵值的分佈不均勻，可能會導致碰撞（即不同鍵值映射到同一索引）。 - **溢位處理**：當發生碰撞時，可以使用線性探測或再雜湊等方法來解決問題。 #### 3.2.3 程式碼實作 ```java= public class DivisionHash { private final int tableSize; public DivisionHash(int size) { // 選擇大於 size 的最小質數作為表大小 this.tableSize = nextPrime(size); } public int hash(int key) { return Math.abs(key) % tableSize; } // 判斷是否為質數 private boolean isPrime(int n) { if (n <= 1) return false; if (n <= 3) return true; if (n % 2 == 0 || n % 3 == 0) return false; for (int i = 5; i * i <= n; i += 6) { if (n % i == 0 || n % (i + 2) == 0) return false; } return true; } // 找下一個質數 private int nextPrime(int n) { while (!isPrime(n)) { n++; } return n; } } ``` ### 3.3 乘法雜湊法 #### 3.3.1 概念乘法雜湊法通過將鍵值 $k$ 與一個常數 $A$ 相乘，然後取小數部分來計算雜湊地址。具體公式為： $$ h(k) = \lfloor m \cdot (k \cdot A \mod 1) \rfloor $$ 其中 $m$ 是雜湊表的大小，$A$ 是一個介於 0 和 1 之間的常數（通常選擇無理數如 $( \sqrt{5} - 1 ) / 2$） #### 3.3.2 特點 - **均勻性優異**：相較於除法方法，乘法方法能更均勻地分佈鍵值，降低碰撞的概率。 - **實現簡單**：計算過程中涉及乘法和取小數部分，操作簡單。 - **適用性廣**：適合靜態雜湊表，即在初始化後不頻繁變動的情況下使用。 #### 3.3.3 程式碼實作 ```java= public class MultiplicationHash { private final int tableSize; private final double A = (Math.sqrt(5) - 1) / 2; // 0.618033988749895 public MultiplicationHash(int size) { this.tableSize = 1 << (int) Math.ceil(Math.log(size) / Math.log(2)); } public int hash(int key) { double temp = key * A; double frac = temp - Math.floor(temp); return (int) (tableSize * frac); } } ``` ### 3.4 通用字串雜湊 #### 3.4.1 概念哈希值如同數據的數字指紋，是透過將任意形式的數據輸入哈希函數後，轉換為固定大小的數值產生；這一過程被稱為「哈希化」，哈希值具有以下幾個重要特性： 1. **雪崩效應**：即使是微小的數據變動，也會導致哈希值完全不同，這保證了數據的唯一性。 2. **碰撞抵抗性**：哈希函數的設計即是為防止「碰撞」，不同的輸入不應產生相同的哈希值。 3. **效率**：生成哈希值的過程非常快速，適合處理大量數據和進行快速驗證。 4. **不可逆性**：從哈希值無法推導出原始數據。 5. **固定長度**：無論輸入數據的大小或類型，哈希值的長度都是固定的。在區塊鏈和加密貨幣中，哈希值用於維護數據完整性和安全性。例如，比特幣使用 SHA-256 哈希算法來確保交易記錄的不可篡改性，每個區塊包含前一區塊的哈希值以形成安全鏈接。這些特性使得哈希值在身份驗證、數據完整性檢查等多個領域中扮演著關鍵角色。哈希值與通用字串雜湊之間的關係主要體現在哈希值是通用字串雜湊的結果之一。通用字串雜湊是一種特定類型的哈希函數，專門設計用來處理字符串數據。它將字符串映射到一個固定大小的哈希值上，並且通常設計為在性能和碰撞抵抗性之間取得平衡。常見的通用字串雜湊函數（如 BKDRHash、DJBHash 等）廣泛應用於資料結構中，以便快速查找和存儲字符串。Overall，哈希值是通用字串雜湊過程中的產物，而通用字串雜湊則是一種專門針對字符串設計的哈希方法。兩者共同作用於數據管理和安全領域，確保數據的完整性和快速檢索。另外列舉一些常見的哈希函數： - **MD5**：雖然曾經廣泛使用，但由於其安全性問題，現在不再推薦用於安全應用。 - **SHA-1**：同樣存在安全漏洞，逐漸被淘汰。 - **SHA-256**：當前廣泛使用且被認為非常安全，特別是在加密貨幣（如比特幣）中應用廣泛。 #### 3.4.2 特點 - **安全性高**：具備碰撞抗性，極難找到兩個不同的輸入產生相同的哈希值。 - **應用廣泛**：在數位簽名、資料完整性驗證等方面有良好的應用。 - **性能考量**：需要平衡計算效率與安全性，設計良好的哈希函數能有效防止各種攻擊。 #### 3.4.3 程式碼實作 ```java= import java.security.MessageDigest; import java.security.NoSuchAlgorithmException; import java.nio.charset.StandardCharsets; public class SHA256Example { public static void main(String[] args) { String input = "Hello, World!"; // 要哈希的輸入字符串 String hash = generateSHA256Hash(input); System.out.println("SHA-256 Hash: " + hash); } public static String generateSHA256Hash(String input) { try { // 創建SHA-256 MessageDigest實例 MessageDigest digest = MessageDigest.getInstance("SHA-256"); // 計算哈希值，並獲取字節數組 byte[] hashBytes = digest.digest(input.getBytes(StandardCharsets.UTF_8)); // 將字節數組轉換為十六進制字符串 return bytesToHex(hashBytes); } catch (NoSuchAlgorithmException e) { throw new RuntimeException(e); // 處理異常 } } private static String bytesToHex(byte[] bytes) { StringBuilder hexString = new StringBuilder(); for (byte b : bytes) { // 將每個字節轉換為兩位數的十六進制表示 String hex = Integer.toHexString(0xff & b); if (hex.length() == 1) hexString.append('0'); // 確保每個字節是兩位數 hexString.append(hex); } return hexString.toString(); } } ``` ## 4. 碰撞處理 Collision Resolution 雜湊表中紀錄了許多呈現一對一關係的鍵值對，然而在這裡一對多的情況是不合法的，這種情況我們將其稱為碰撞（Collision）。為了讓個鍵都可以對應到專屬的值，我們會透過以下方法來處理碰撞： ### 4.1 鏈結法（Separate Chaining）鏈結法是最常用的碰撞處理方法，它使用鏈結串列來儲存雜湊值相同的元素。 ```java= public class HashTableChaining<K, V> { private class Node { K key; V value; Node next; Node(K key, V value) { this.key = key; this.value = value; } } private Node[] table; private int size; private int capacity; private final double LOAD_FACTOR_THRESHOLD = 0.75; @SuppressWarnings("unchecked") public HashTableChaining(int initialCapacity) { this.capacity = initialCapacity; this.table = (Node[]) new HashTableChaining.Node[capacity]; this.size = 0; } public void put(K key, V value) { if (key == null) { throw new IllegalArgumentException("Key cannot be null"); } if ((double) size / capacity >= LOAD_FACTOR_THRESHOLD) { resize(); } int index = hash(key); Node current = table[index]; // 檢查是否已存在相同的key while (current != null) { if (current.key.equals(key)) { current.value = value; return; } current = current.next; } // 新增節點到開頭 Node newNode = new Node(key, value); newNode.next = table[index]; table[index] = newNode; size++; } public V get(K key) { if (key == null) { throw new IllegalArgumentException("Key cannot be null"); } int index = hash(key); Node current = table[index]; while (current != null) { if (current.key.equals(key)) { return current.value; } current = current.next; } return null; } private int hash(K key) { return Math.abs(key.hashCode() % capacity); } @SuppressWarnings("unchecked") private void resize() { int newCapacity = capacity * 2; Node[] oldTable = table; table = (Node[]) new HashTableChaining.Node[newCapacity]; capacity = newCapacity; // 重新雜湊所有元素 for (Node head : oldTable) { Node current = head; while (current != null) { Node next = current.next; int newIndex = hash(current.key); current.next = table[newIndex]; table[newIndex] = current; current = next; } } } } ``` ### 4.2 開放定址法（Open Addressing）開放定址法是一種將所有數據存儲在同一數組中的方法，當發生碰撞時，它會尋找下一個空位來存儲數據。這種方法不使用額外的數據結構來處理衝突，而是在原數組中尋找可用的位置。常見的探測策略有以下三種： - **線性探測**：如果發生衝突，則依次檢查下一個位置。公式為： $$ H_i(key) = (H(key) + i) \mod m $$ - **二次探測**：使用平方增量來尋找下一個位置，以減少聚集現象。公式為： $$ H_i(key) = (H(key) + i^2) \mod m $$ - **雙重散列**：使用第二個哈希函數計算增量。公式為： $$ H_i(key) = (H1(key) + i \cdot H2(key)) \mod m $$ --- #### 以下是以線性探測為例的程式碼 ```java= public class LinearProbingHashTable<K, V> { private class Entry { K key; V value; boolean isDeleted; Entry(K key, V value) { this.key = key; this.value = value; this.isDeleted = false; } } private Entry[] table; private int size; private int capacity; private final double LOAD_FACTOR_THRESHOLD = 0.5; // 開放定址法需要較低的負載因子 @SuppressWarnings("unchecked") public LinearProbingHashTable(int initialCapacity) { this.capacity = initialCapacity; this.table = new Entry[capacity]; this.size = 0; } public void put(K key, V value) { if (key == null) { throw new IllegalArgumentException("Key cannot be null"); } if ((double) size / capacity >= LOAD_FACTOR_THRESHOLD) { resize(); } int index = hash(key); int i = 0; while (i < capacity) { int currentIndex = (index + i) % capacity; if (table[currentIndex] == null || table[currentIndex].isDeleted) { table[currentIndex] = new Entry(key, value); size++; return; } if (table[currentIndex].key.equals(key)) { table[currentIndex].value = value; return; } i++; } throw new RuntimeException("Hash table is full"); } public V get(K key) { if (key == null) { throw new IllegalArgumentException("Key cannot be null"); } int index = hash(key); int i = 0; while (i < capacity) { int currentIndex = (index + i) % capacity; if (table[currentIndex] == null) { return null; } if (!table[currentIndex].isDeleted && table[currentIndex].key.equals(key)) { return table[currentIndex].value; } i++; } return null; } private int hash(K key) { return Math.abs(key.hashCode() % capacity); } @SuppressWarnings("unchecked") private void resize() { Entry[] oldTable = table; capacity *= 2; table = new Entry[capacity]; size = 0; for (Entry entry : oldTable) { if (entry != null && !entry.isDeleted) { put(entry.key, entry.value); } } } } ``` ## 5. 進階主題 Advanced Topics ### 5.1 一致性雜湊 Consistent Hashing 一致性雜湊是一種特殊的雜湊方法，主要用於分散式系統中。它能夠在新增或移除節點時，最小化需要重新映射的鍵值數量，因此這種方法特別適合需要動態增減節點的情況。 **工作原理：** - 哈希環：一致性雜湊將所有節點（如伺服器）和數據鍵映射到一個虛擬的哈希環上，每個節點和數據鍵都通過哈希函數計算出一個位置，然後在環上進行排列。 - 數據分配：當需要將一個數據鍵分配給某個節點時，從該鍵的哈希值開始，順時針查找第一個出現的節點，該節點即為該數據鍵的負責節點。 ```java= public class ConsistentHash<T> { private final HashFunction hashFunction; private final int numberOfReplicas; private final SortedMap<Integer, T> circle = new TreeMap<>(); public ConsistentHash(HashFunction hashFunction, int numberOfReplicas, Collection<T> nodes) { this.hashFunction = hashFunction; this.numberOfReplicas = numberOfReplicas; for (T node : nodes) { add(node); } } public void add(T node) { for (int i = 0; i < numberOfReplicas; i++) { circle.put(hashFunction.hash(node.toString() + i), node); } } public void remove(T node) { for (int i = 0; i < numberOfReplicas; i++) { circle.remove(hashFunction.hash(node.toString() + i)); } } public T get(Object key) { if (circle.isEmpty()) { return null; } int hash = hashFunction.hash(key); if (!circle.containsKey(hash)) { SortedMap<Integer, T> tailMap = circle.tailMap(hash); hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey(); } return circle.get(hash); } public interface HashFunction { int hash(Object key); } } ``` ### 5.2 布隆過濾器 Bloom Filter 布隆過濾器是一種基於哈希的高效空間查找結構，用於檢查某個元素是否在一個集合中。它能夠快速地回答「某個元素是否在集合內」這個問題，但可能會產生誤報（False Positive）。假設回傳值為 positive，那僅代表資料 **有可能存在於布隆過濾器中**，並不表示絕對存在。布隆過濾器被廣泛應用於網絡安全、資料庫查詢、垃圾郵件過濾等場景，例如Google的Bigtable和HBase都使用了布隆過濾器來提高查詢效率。 **工作原理：** - 結構：布隆過濾器由一個位向量和多個哈希函數組成。位向量初始化為全0。 - 插入元素：當插入一個元素時，通過多個哈希函數計算出多個索引，並將這些索引位置的位設置為1。 - 查詢元素：查詢時，同樣使用多個哈希函數計算索引。如果所有計算出的位都為1，則認為該元素可能存在；如果有任何位為0，則確定該元素不存在。 > 這篇講的嘎嘎好，我推：[資料結構大便當 - Bloom Filter](https://medium.com/@Kadai/%E8%B3%87%E6%96%99%E7%B5%90%E6%A7%8B%E5%A4%A7%E4%BE%BF%E7%95%B6-bloom-filter-58b0320a346d) ```java= public class BloomFilter<T> { private BitSet bitSet; private int size; private int numHashFunctions; public BloomFilter(int size, int numHashFunctions) { this.size = size; this.numHashFunctions = numHashFunctions; this.bitSet = new BitSet(size); } public void add(T item) { for (int i = 0; i < numHashFunctions; i++) { bitSet.set(getHash(item, i)); } } public boolean mightContain(T item) { for (int i = 0; i < numHashFunctions; i++) { if (!bitSet.get(getHash(item, i))) { return false; } } return true; } private int getHash(T item, int seed) { int h1 = item.hashCode(); int h2 = h1 >>> 16; int hash = (h1 + seed * h2) % size; return Math.abs(hash); } public void clear() { bitSet.clear(); } } // 使用範例 public class BloomFilterExample { public static void main(String[] args) { BloomFilter<String> filter = new BloomFilter<>(1000, 3); // 添加元素 filter.add("apple"); filter.add("banana"); filter.add("orange"); // 檢查元素 System.out.println(filter.mightContain("apple")); // true System.out.println(filter.mightContain("banana")); // true System.out.println(filter.mightContain("grape")); // 很可能是 false } } ``` ### 5.3 完美雜湊 Perfect Hashing 完美雜湊是一種特殊的雜湊方法，用於在靜態集合（即不會變動的資料集）中實現零碰撞（即每個鍵都映射到唯一的位置）。這意味著查找操作可以在常數時間內完成，進而保證在最壞情況下仍然有 O(1) 的查詢時間。程式語言中的保留字集合、字典等需要快速查找且不經常變動的資料結構就特別適合使用完美雜湊。 **工作原理：** - 二級哈希表：完美雜湊通常會使用到兩個哈希表。第一級表用於將所有鍵映射到若干槽位，每個槽位可能會有多個鍵（這時需要處理碰撞）。第二級表則用於解決第一級表中的碰撞，每個槽位使用獨立的哈希函數進行二次映射，以確保沒有碰撞發生。 - 大小選擇：第二級表的大小通常設置為第一級槽位中鍵數量的平方，以確保能夠容納所有鍵而不發生碰撞。 ```java= public class PerfectHash<K, V> { private static class SecondaryHash<K, V> { private final Entry<K, V>[] table; @SuppressWarnings("unchecked") public SecondaryHash(List<Entry<K, V>> items, int size) { table = new Entry[size]; for (Entry<K, V> item : items) { int index = hash2(item.getKey(), size); table[index] = item; } } public V get(K key) { int index = hash2(key, table.length); Entry<K, V> entry = table[index]; return (entry != null && entry.getKey().equals(key)) ? entry.getValue() : null; } private int hash2(K key, int size) { return ((key.hashCode() * 31) & 0x7fffffff) % size; } } private final SecondaryHash<K, V>[] tables; private final int primarySize; @SuppressWarnings("unchecked") public PerfectHash(Map<K, V> initialData) { primarySize = initialData.size() * 2; tables = new SecondaryHash[primarySize]; // 將元素分組 List<List<Entry<K, V>>> buckets = new ArrayList<>(primarySize); for (int i = 0; i < primarySize; i++) { buckets.add(new ArrayList<>()); } for (Map.Entry<K, V> entry : initialData.entrySet()) { int index = hash1(entry.getKey()); buckets.get(index).add(new AbstractMap.SimpleEntry<>( entry.getKey(), entry.getValue())); } // 為每個非空bucket建立二級雜湊表 for (int i = 0; i < primarySize; i++) { List<Entry<K, V>> bucket = buckets.get(i); if (!bucket.isEmpty()) { int secondarySize = bucket.size() * bucket.size(); tables[i] = new SecondaryHash<>(bucket, secondarySize); } } } public V get(K key) { SecondaryHash<K, V> secondaryTable = tables[hash1(key)]; return secondaryTable == null ? null : secondaryTable.get(key); } private int hash1(K key) { return (key.hashCode() & 0x7fffffff) % primarySize; } } ``` ## 6. 實際應用與最佳實踐 Practical Applications and Best Practices ### 6.1 Java HashMap 的實作考量 Java 的 HashMap 實作採用了多項優化技術： 1. **初始容量與負載因子** ```java= public class OptimizedHashMap<K, V> { private static final int DEFAULT_INITIAL_CAPACITY = 16; private static final float DEFAULT_LOAD_FACTOR = 0.75f; // 初始化 public static <K, V> HashMap<K, V> createWithExpectedSize(int expectedSize) { return new HashMap<K, V>((int) (expectedSize / DEFAULT_LOAD_FACTOR + 1)); } } ``` 2. **樹化（Treeification）**：當鏈表長度超過閾值（Java 8 中為 8）時，將鏈表轉換為紅黑樹。 ### 6.2 效能優化建議 1. **合適的初始容量**：預估資料量大小，避免頻繁擴容 2. **選擇合適的雜湊函數**：根據鍵值特性選擇 3. **監控負載因子**：維持在 0.75 左右 4. **避免雜湊碰撞**：確保 hashCode() 實作品質 ### 6.3 常見陷阱與解決方案 1. **可變物件作為鍵值** ```java= public class MutableKeyExample { static class MutableKey { private int value; public MutableKey(int value) { this.value = value; } public void setValue(int value) { this.value = value; } @Override public int hashCode() { return value; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (!(obj instanceof MutableKey)) return false; return value == ((MutableKey) obj).value; } } public static void demonstrateMutableKeyProblem() { Map<MutableKey, String> map = new HashMap<>(); MutableKey key = new MutableKey(10); map.put(key, "Original Value"); System.out.println(map.get(key)); // "Original Value" key.setValue(20); // 改變鍵值 System.out.println(map.get(key)); // null，無法找到原值 } } ``` 2. **hashCode() 和 equals() 的一致性** ```java= public class Person { private final String name; private final int age; public Person(String name, int age) { this.name = name; this.age = age; } @Override public boolean equals(Object o) { if (this == o) return true; if (!(o instanceof Person)) return false; Person person = (Person) o; return age == person.age && Objects.equals(name, person.name); } @Override public int hashCode() { return Objects.hash(name, age); } } ``` ## 7. 效能評估與比較 Performance Analysis ### 7.1 時間複雜度比較 | 操作 | 平均情況 | 最壞情況 | |------|----------|----------| | 插入 | O(1) | O(n) | | 查詢 | O(1) | O(n) | | 刪除 | O(1) | O(n) | ### 7.2 空間使用分析 - 鏈結法：O(n + m)，其中 n 為元素數量，m 為表大小 - 開放定址法：O(m)，其中 m 為表大小