雜湊表 - HackMD

--- title: '雜湊表' disqus: kyleAlien --- 雜湊表 === ## OverView of Content 雜湊 (hashing) 是有效率的方法，避免了從頭到尾的迭代搜尋，效能勝於一般的搜尋演算法 ! [TOC] ## 雜湊表 vs. 迭代 * 迭代方式：必須從頭到尾來搜尋，**最差情況下耗費時間為 O(N)** ```java= private static int[] month_length = new int[] { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 }; private static String[] month_key_array = new String[] { "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December" }; public int iteration_month(String month) { int result = -1; for (int i = 0; i < month_key_array.length; i++) { String mon = month_key_array[i]; if (mon.equals(month)) { result = i; break; } } return result != -1 ? month_length[result] : -1; } ``` * 雜湊表方式：使用 JDK 自提供的 `HashMap` 類，**時間平均可攤平在 O(1) 的效能** ```java= private static final Map<String, Integer> monthMap = new HashMap<>() { { put("January", 31); put("February", 28); put("March", 31); put("April", 30); put("May", 31); put("June", 30); put("July", 31); put("August", 31); put("September", 30); put("October", 31); put("November", 30); put("December", 31); } }; public int hash_month(String month) { return monthMap.getOrDefault(month, -1); } ``` ### 運算 Key - 基數 & Modulo 模數 * 基數：以某個數為基底，每進一位就多一次方 * **Base10**：一般我們人類在運算時使用 Base10，也就是 10 為底的基數來做運算；e.g：9527 = 9\*10^3^ + 5\*10^2^ + 2\*10^1^ + 7\*10^0^；寫成程式就如下 ```java= public long base10(String value) { int base = 10; // 基數 10 long result = 0; char[] chars = value.toCharArray(); int index = chars.length - 1; while (index >= 0) { // ASCII Code 運算出當前數字 int curNum = (int) chars[index] - '0'; result = result + (int) (curNum * Math.pow(base, chars.length-1 - index)); index -= 1; } return result; } ``` > ![](https://i.imgur.com/baYP7RW.png) * **Base26**：則是將英文字母 26 數字再加上 26 基於底去運算出一個數 (推算方法如上) ```java= public long base26(String value) { int base = 26; long result = 0; char[] chars = value.toLowerCase().toCharArray(); int index = chars.length - 1; while (index >= 0) { int curNum = (int) chars[index] - 'a'; result = result + (int) (curNum * Math.pow(base, chars.length-1 - index)); index -= 1; } return result; } ``` > ![](https://i.imgur.com/Drt1IjO.png) * Module 模數：基於上面的基數算出的數字過大，我們可以使用 `%` 符號運算除模數 (也就是餘數)；我們已照月份 `August` 來取 Base64 * Base64 運算結果：9258983 * 將結果取模 % 34：1 > ![](https://i.imgur.com/TyPtm66.png) > > 這個 34 就是一個陣列，用來儲存取模後的結果，將結果放置陣列中 * 將 12 月份 String 作為 Key 進行取模運算後，在將每個月的日期作為 Value 數值填入 ```java= public static void main(String[] args) { String[] month_key_array = new String[] { "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December" }; int[] month_length = new int[] { 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31 }; int[] result = new int[34]; Arrays.fill(result, -1); // 初始化為 -1 BaseValue baseValue = new BaseValue(); for (int i = 0; i < month_key_array.length; i++) { String mon = month_key_array[i]; // KEY：取模運算，算出 index int index = (int) (baseValue.base26(mon) % 34); // Value：將月份天數作為 Value 填入 result[index] = month_length[i]; } System.out.println("Finally: " + Arrays.toString(result)); } ``` > ![](https://i.imgur.com/tHao9qy.png) :::info * 雜湊值可能是負數 * 將 Key 值進行簡單運算（取模），算出關聯的相關引索；簡單來說看到 1 就代表了 `August 天數 31` > **原本的 Key `August` 與計算出的 hashKey `1` 沒有關係** ::: * 這裡要注意可能會發生另一個問題：雜湊衝突 :::danger * **雜湊衝突** **雜湊值不必是唯一值**，在 Key 取模運算後，發現兩者個值相同，但是該位置已經被使用，這就是雜湊衝突 ::: * 上面的 base10、base26 就是雜湊函數，雜湊函數的特點：兩個相同的物件，必須算出相同的雜湊值 (這代表 **雜湊值並非亂數**) * 取模時 `hash(key) % M` 計算出 Key 的雜湊值，其中取模的 M 必須至少跟 Key 一樣大，也就是 `M >= len(keyArray)` ## 雜湊 - 低效至高效以下實作一個低效的雜奏，並且該雜湊也還沒解決雜湊衝突 ```java= public class InefficientHash<K, V> { static final class Entry<K, V> { final K key; V val; Entry(K key, V val) { this.key = key; this.val = val; } } // 雜湊 Array，儲存計算出的 hashKey !! private final Entry<K, V>[] array; // 模數 private final int modulo; public InefficientHash(int modulo) { array = new Entry[modulo]; // 創建一個與模大小相同的 Array this.modulo = modulo; } public V get(K key) { int hashKey = hashCode(key) % modulo; // 取模 if(array[hashKey] != null) { // 查看該 hashkey 是否已經有 Entry return array[hashKey].val; // 返回對應 Entry 的 val } return null; } public void put(K key, V val) { int hashKey = hashCode(key) % modulo; // 取模 Entry<K, V> item = array[hashKey]; // 判斷 array 中是否已經有 Entry if(item != null) { // 已經有 Entry // 判斷 Entry 中的 key if(item.key == key) { item.val = val; } else { // 衝突拋異常 throw new RuntimeException(key + " conflict with: " + item.key + ", key: " + hashKey); } } else { // 沒有 Entry array[hashKey] = new Entry<>(key, val); } } // 由於使用 Java，避免取模算出負數，所以在這裡處理 public int hashCode(K key) { // 使用 Object 預設的雜驟函數 `hashCode` return key.hashCode() & 0x7fff_ffff; } } ``` 測試後，發生衝突 ```java= public static void main(String[] args) { InefficientHash<String, Integer> table = new InefficientHash<>(10); table.put("April", 30); table.put("May", 31); table.put("September", 30); System.out.println(table.get("April")); System.out.println(table.get("May")); System.out.println(table.get("September")); } ``` > ![](https://i.imgur.com/r1Xwg6q.png) ### 開放定址 - 線性探測 * 當遇到雜湊衝突時，使用探測的手法來搜尋新的位置，這種方式稱為 **開放定址 (open addressing)** * **線性搜尋**：運作時機就是雜湊衝突時，透過衝突的開始點，往更高引索搜尋，如果都搜尋不到則從 0 開始搜尋，如果都滿了才會拋出異常 > ![](https://i.imgur.com/YnuklJD.png) * **環繞搜尋**：在線性搜尋的過程中，發現該 bucket 已經占據 (collision)，這時就必須往下搜尋，這種行為就稱為環繞搜尋 > ![](https://i.imgur.com/83a3Az1.png) * **搜尋方式**：**將算出的 hashKey + 1 後反覆取模**，`(hashKey + 1) % M`；這樣是否會進入死循環呢？**會，所以需要一個 ++紀錄桶的參數++** ```java= // 假設模值 10，要 put(93, "Value") 93 % 10 = 3 // 3 + 1 再 while 判斷 4 % 10 = 4 // 4 + 1 再 while 判斷 5 % 10 = 5 // 5 + 1 再 while 判斷 6 % 10 = 6 // 6 + 1 再 while 判斷 7 % 10 = 7 // 7 + 1 再 while 判斷 8 % 10 = 8 // 8 + 1 再 while 判斷 9 % 10 = 9 // 9 + 1 再 while 判斷 10 % 10 = 0 // 0 + 1 再 while 判斷 1 % 10 = 1 // 1 + 1 再 while 判斷 2 % 10 = 2 // 2 + 1 再 while 判斷 3 % 10 = 3 // 重複 ! ``` * **開放定址實作**：使用上面的低效雜湊，**針對衝突點 (put)、取值 (get) 改造** ```java= // 開放定址實作 private final Entry<K, V>[] array; private final int modulo; private int usedBucket; // 紀錄已使用的 Bucket，保證一定會留下一個空桶 public OpenAddressing(int modulo) { array = new Entry[modulo]; this.modulo = modulo; usedBucket = 0; } public V get(K key) { int hashKey = hashCode(key) % modulo; while (array[hashKey] != null) { // 環繞搜尋 if(array[hashKey].key == key) { return array[hashKey].val; } hashKey = (hashKey + 1) % modulo; // +1 再取模 } // 都找不到 return null; } public void put(K key, V val) { int hashKey = hashCode(key) % modulo; // 找空位 while (array[hashKey] != null) { // 環繞搜尋 if(array[hashKey].key == key) { array[hashKey].val = val; // 更新 return; } hashKey = (hashKey + 1) % modulo; // +1 再取模 } // `-1` 可以保證永遠會有一個空桶 if(usedBucket >= array.length - 1) { throw new RuntimeException(key + " conflict with: " + array[hashKey] + ", key: " + hashKey); } array[hashKey] = new Entry<>(key, val); usedBucket += 1; } ``` :::info * **開放定址效能 ?** **最差情況 BIG O**：假設每次都插入都衝突，那第一個插入桶數為 1(colliosn 0 次)，第二個插入桶數為 2(colliosn 1 次)，第三個插入桶數為 3(colliosn 2 次)... > 最後一桶為 `N - 1`，因為永遠會保留一個空桶 * 桶比較次數：N 桶，每桶比較的次數相加，為 `1 + 2 + 3 + ... (N - 1)`，計算可以簡化為 `N * (N - 1) / 2` * 平均值：由於並不是每一桶都需要比較如此多次數，所以總桶數要再除以 `N`，來計算出每個桶所耗費的比較次數效率為 `N * (N - 1) / (2 * N)` 結果為 `(N - 1) / 2`，效能可以說是 O(N) ::: :::success * 另外經過他人實驗後，可以得出一個有趣的結論，當桶數 M 為 N 的兩倍時，效率最好，也就是 **M >= 2N 效率最好** ::: ### 鏈結 HashTable - get/put * 我們知道串鍊結的特性 * `頭尾添加`、`頭尾刪除`(prepending, appending)：不管是頭還是尾都只需要常數時間就可以完成 **O(1)** * `中間插入`、`中間移除`(deleting)：稍微麻煩一點，它需要先 **搜尋到目標節點 O(N)**，再替換/刪除數值 `O(1)`，也就是 O(N + 1)，可以說耗費 **O(N)** * 當 Array 遇到雜湊衝突時，使用 LinkedList 的方式解決，將衝突點放在串列頭 > ![](https://i.imgur.com/WbmLUxl.png) * 在這裡我們忽略 multi thread 不同步所造成的問題，關注於鏈結特性的 HashTable ```java= // LinkedHashTable.java public class LinkedHashTable<K, V> { private static final class Node<K, V> { private final K key; private V value; private Node<K, V> next; private Node(K key, V value) { this.key = key; this.value = value; } } private final int modulo; private final Node<K, V>[] hashArray; public LinkedHashTable(int modulo) { this.modulo = modulo; this.hashArray = new Node[modulo]; } private int getHashNum(Object obj) { return obj.hashCode() & 0x7FFF_FFFF; } public V get(K key) { int hashValue = getHashNum(key) % modulo; Node<K, V> node = hashArray[hashValue]; while (node != null) { if(node.key == key) { return node.value; } node = node.next; } return null; } public void put(K key, V value) { int hashValue = getHashNum(key) % modulo; Node<K, V> node = hashArray[hashValue]; // 搜尋衝突點是否有相同 key while (node != null) { // 有相同 key 就替換值 if(node.key == key) { node.value = value; return; } node = node.next; } Node<K, V> newNode = new Node<>(key, value); // 新節點 Node<K, V> oldNode = hashArray[hashValue]; // 判斷舊節點是否有值 if(oldNode != null) { // 有值則將新節點串在，舊節點的 Header newNode.next = oldNode; } hashArray[hashValue] = newNode; } public static void main(String[] args) { // 目標儲存學生 name & age LinkedHashTable<String, Integer> classStudent = new LinkedHashTable<>(); classStudent.put("Alien", 10); classStudent.put("Pan", 13); classStudent.put("Wind", 6); classStudent.put("Jasper", 7); System.out.println("Alien age: " + classStudent.get("Alien")); System.out.println("Pan age: " + classStudent.get("Pan")); System.out.println("Wind age: " + classStudent.get("Wind")); System.out.println("Jasper age: " + classStudent.get("Jasper")); } } ``` > ![](https://i.imgur.com/7hTaEyg.png) :::success * **串列鏈結 HashTable 的 `put`、`get` 耗費效能與開放定址相同! 都需耗費 O(N) 的時間** ::: ### 鏈結 HashTable - remove * 替上面的 `LinkedHashTable` 類，新增 remove 方法，remove 需要考慮兩個可能性 1. 需移除的點在 **鏈結開頭**：直接將 EntryNode 的下一個 Node 設定為 Header > Header 改變 > > ![](https://i.imgur.com/84pLih4.png) 2. 需移除的點在 **鏈結中間**：需要保存 PreNode (上一個節點)，並將找到的 EntryNode 2的下一個 Node 串上 PreNode > Header 不會變 > > ![](https://i.imgur.com/Nt4ZRON.png) ```java= // LinkedHashTable.java public Node<K, V> remove(K key) { int hashValue = getHashNum(key) % modulo; Node<K, V> node = hashArray[hashValue]; Node<K, V> pre = null; // 可用來判斷是否是第一個元素，如果是第一個元素 pre 就是 null while (node != null) { if(node.key == key) { if(pre != null) { // 狀況 2 pre.next = node.next; // 替換中間 } else { // 狀況 1 hashArray[hashValue] = node.next; // 替換 Header } // 移除 Entry 的下一個節點 node.next = null; return node; } pre = node; node = node.next; } return null; } ``` ### 開放定址 vs. 鏈結 Hash * **空間複雜度**：假設雜湊表 (HashKey) 大小為 M，而目標陣列長為 N * 開放定址：在 `N < M - 1` 的情況下成立 (超出則會拋出異常)，並且耗費空間固定為 M (雜湊表大小) > 空間複雜度為 `O(M)` * 鏈結 Hash：`N > M` 也成立，儲存雜湊值耗費空間為 M，並且每個雜湊值的原素都需要多保存一個 M (指向下一個) 所以基礎耗費空間為 2M，並且它可以無限延長下去，所以額外儲存空間也與 N 成正比 > 空間複雜度為 `O(M + N)` --- * **時間複雜度**：以最差情況下 BIG O 來看；假設雜湊表 (HashKey) 大小為 M，而目標陣列長為 N * 開放定址：假設 N = M - 1，每次加入新元素都衝突，所以比較的次數為 `1 + ... + N`，再除上陣列長度 `N`，效率為 `N * (N - 1) / (2 * N)` 結果為 `(N - 1) / 2` > 時間複雜度為 `O(N)` * 鏈結 Hash：每次加入新元素都衝突，基本上與開放定址相同，put/get 效能都是 O(N) > 時間複雜度為 `O(N)` ### 負載因子 - load factor * 假設雜湊表 (HashKey) 大小為 M，而目標陣列長為 N； **N / M 的比例就稱為負載因子** * 開放定址：負載因子一定小於 1 (N 最多只能接受 M - 1 的大小，也就是 `M-1 / M`) * 鏈結 Hash：負載因子可能超出 1 :::success **多年研究發現，負載因子超出 ==0.75== 後，效率就會降低** ::: ### 擴充雜湊表 - 幾何調整 :::info 幾何調整：將原數值 * 2；幾何調整完後數值 + 1 是為了讓雜湊表效率更好 ::: * 由於鏈表示可以動態擴充所以不用擔心，所以這邊主要是擴充不易調整的 **雜湊值 (Hash Array)** 的大小 1. 需記錄當前已經有多少個 HashKey 在數組中 2. 使用 Threshold 為 0.75 伐值 ```java= // 新增 Threshold // 不再為 final 必需可改 private int modulo; private Node<K, V>[] hashArray; // 記錄當前雜湊表上的原素大小 private int curHashKeyCount; private static final float threshold = 0.75f; // 固定伐值 0.75 public Threshold(int modulo) { this.curHashKeyCount = 0; this.modulo = modulo; this.hashArray = new Node[modulo]; } ``` * 調整後效能分析：由於調整後，需要將舊 Array 資料複製到新 Array，所以基本需要耗費 O(M) 的時間，並且 **==原本雜湊值可能會失效! 必須將舊值重新排列==** * 錯誤寫法：直接複製陣列，會導致舊值失效 ```java= // 新增 Threshold public void put(K key, V value) { int hashValue = getHashNum(key) % modulo; Node<K, V> node = hashArray[hashValue]; while (node != null) { if(node.key == key) { node.value = value; return; } node = node.next; } Node<K, V> newNode = new Node<>(key, value); Node<K, V> oldNode = hashArray[hashValue]; if(oldNode != null) { newNode.next = oldNode; } hashArray[hashValue] = newNode; curHashKeyCount += 1; // 計算當前雜湊大小 double curThreshold = ((double) (curHashKeyCount)) / ((double) (modulo)); // 超越伐值則進行調整 if(curThreshold > threshold) { // 幾何調整 int newSize = modulo * 2 + 1; Node<K, V>[] newArray = new Node[newSize]; // ERROR !! System.arraycopy(hashArray, 0, newArray, 0, hashArray.length); hashArray = newArray; } } ``` * 正確寫法：透過創建一個新對象作為暫存資料，再將舊有數據重新放入後，才將新 Array 賦予原先的 Array ```java= public void put(K key, V value) { int hashValue = getHashNum(key) % modulo; Node<K, V> node = hashArray[hashValue]; while (node != null) { if(node.key == key) { node.value = value; return; } node = node.next; } Node<K, V> newNode = new Node<>(key, value); Node<K, V> oldNode = hashArray[hashValue]; if(oldNode != null) { newNode.next = oldNode; } hashArray[hashValue] = newNode; curHashKeyCount += 1; double curThreshold = ((double) (curHashKeyCount)) / ((double) (modulo)); if(curThreshold > threshold) { int newSize = modulo * 2 + 1; // 暫存資料 Threshold<K, V> tmp = new Threshold<>(newSize); // 依照舊有雜湊值表，依序放入 for (Node<K, V> item : hashArray) { // Linked 可能會有深度，所以必須 while 取出 while (item != null) { // 放入 tmp.put(item.key, item.value); item = item.next; } } this.hashArray = tmp.hashArray; this.modulo = tmp.modulo; } } ``` ### 動態雜湊效能 - 均攤 O(1) * 從上面我們知道 `get`、`put` 存取 Hashtable 最差情況下表現皆為 O(N)，然而基於普遍的簡單均於雜湊 (simple uniform hashing)，**每個鑑 (key) 被至於任一桶中的機率相等** :::success * 每個 HashKey 下 Linked 的長度 ? 這個數學家已經證明，**每個 Linked 下的平均長度為 `N/M`** (M 是雜湊 Array、N 原數據是長度) 基於我們對於 threshold 的設計為 0.75，**讓 N 永遠小於 M，來達到 Linked 下的長度平均為 O(1)** ::: ## 完美雜湊所謂的完美雜湊就是在 **每個 HashKey 下，只會有一個 HashValue** (透過神奇的數學運算...) 為固定數量的鍵 (key) 設計獨特的桶引索位置，避免衝突，相對來說空間成本就比較高 ### 完美雜湊 - 範例 ```java= public class Perfect_hash { public final int[] G = {0, 8, 1, 4, 7,10, 2, 0, 9, 11, 1, 5}; // 兩個協助數組 (設計獨特的桶引索位置) public final int[] s1 = {9, 4, 8, 6, 6}; public final int[] s2 = {2, 10, 6, 3, 5}; public int hashCode(String str, int[] t) { int result = 0; char[] chars = str.toCharArray(); for(int i = 0; i < chars.length; i++) { char item = chars[i]; result += (t[i % t.length] * item); } return result % G.length; } // 完美雜湊：透過兩個子數組來計算 public int perfectHashCode(String key) { return (G[hashCode(key, s1)] + G[hashCode(key, s2)]) % G.length; } } ``` :::warning * 在這不清楚 s1、s2、G 數組是如何產生的 ? ::: * 測試幾個單辭 `a`、`by` ```java= public static void main(String[] args) { Perfect_hash ph = new Perfect_hash(); System.out.println("`a` use `s1` hashCode: " + ph.hashCode("a", ph.s1)); System.out.println("`a` use `s2` hashCode: " + ph.hashCode("a", ph.s2)); System.out.println("`a` perfect hashCode: " + ph.perfectHashCode("a") + "\n"); System.out.println("`by` use `s1` hashCode: " + ph.hashCode("by", ph.s1)); System.out.println("`by` use `s2` hashCode: " + ph.hashCode("by", ph.s2)); System.out.println("`by` perfect hashCode: " + ph.perfectHashCode("by")); } ``` > ![](https://i.imgur.com/cpaqZXz.png) * 測試字串 `a rose by any other name would smell as sweet`，在每個獨自的字串都可以求出一個獨一無二的雜湊值 ( 神奇數學 ~? ```java= public static void main(String[] args) { Perfect_hash ph = new Perfect_hash(); String testStr = "a rose by ant other name would smell as sweet"; for (String item : testStr.split(" ")) { System.out.println("`" + item + "` perfect hashCode: " + ph.perfectHashCode(item)); } } ``` > ![](https://i.imgur.com/eVpuJiS.png) ## Appendix & FAQ :::info ::: ###### tags: `資料結構`