KMP & Z-algo - HackMD

# KMP && Z-algorithm ## KMP ### 1. 基本概念： KMP 算法的基本思想是利用已知的匹配結果，避免在文本串中的不必要的匹配。這是通過 LPS (Longest Prefix Suffix) 陣列或表來實現的。 ### 2. LPS 陣列的建立： LPS 陣列對每個字符位置提供了一個數值，該數值表示當前字符之前的子串中最長的前綴和後綴的長度。此前綴和後綴不能包括整個子串。例如，對於字符串 "AAACAAAAC", LPS 陣列是 [0, 1, 2, 0, 1, 2, 3, 3, 4]。 ### 3. 使用 LPS 進行匹配：當在文本中查找模式時，如果發生失敗（即某個位置的字符不匹配），則不需要從文本的下一個字符重新開始匹配。通過查看模式的 LPS 陣列，我們可以知道其中的部分已經被匹配了。這允許我們跳過一些檢查。 ### 4. KMP 的優勢：與簡單的暴力搜尋相比，KMP 算法可以在 O(n) 的時間內完成字符串匹配，其中 n 是文本的長度。這是因為它跳過了已知匹配的部分，不需要重新檢查它們。 ### KMP 模板 ``` // 創建前綴後綴表 (LPS Array) 的函數 vector<int> buildLps(string s) { // 初始化 LPS 陣列，所有元素初值為 0 vector<int> lps(s.size(), 0); int i = 1; // 遍歷字符串的索引 int len = 0; // 當前最長的前墜和後墜的長度 while (i < s.size()) { // 如果當前字符和 len 對應的字符匹配，則增加 len if (s[i] == s[len]) { len++; lps[i] = len; i++; } // 如果不匹配且 len 不為 0，則回退 len 到前一個最長的匹配位置 else if (len != 0) { len = lps[len - 1]; } // 如果不匹配且 len 為 0，則繼續遍歷下一個字符 else { i++; } } return lps; } ``` ### 舉例 ``` #include <iostream> #include <vector> #include <string> using namespace std; // 創建前綴後綴表 (LPS Array) 的函數 vector<int> buildLps(string pattern) { vector<int> lps(pattern.size(), 0); int len = 0; int i = 1; while (i < pattern.size()) { if (pattern[i] == pattern[len]) { len++; lps[i] = len; i++; } else if (len != 0) { len = lps[len - 1]; } else { i++; } } return lps; } // KMP 字符串匹配 void KMPsearch(string text, string pattern) { int n = text.size(); int m = pattern.size(); vector<int> lps = buildLps(pattern); int i = 0; // index for text int j = 0; // index for pattern while (i < n) { if (pattern[j] == text[i]) { i++; j++; } if (j == m) { cout << "Pattern found at index " << i - j << endl; j = lps[j - 1]; } else if (i < n && pattern[j] != text[i]) { if (j != 0) { j = lps[j - 1]; } else { i++; } } } } int main() { string text = "ABABDABACDABABCABAB"; string pattern = "ABABCABAB"; KMPsearch(text, pattern); return 0; } ``` 影片說明: https://www.youtube.com/watch?v=af1oqpnH1vA&ab_channel=%E5%A5%87%E4%B9%90%E7%BC%96%E7%A8%8B%E5%AD%A6%E9%99%A2 <hr/> ## Z-algorithm ### 1. 基本概念：對於字符串 s 的每個位置，其 Z 值表示從該位置開始的子字符串與整個字符串的最長公共前綴的長度。例如，s = "aabcaabxaaaz" 的 Z 值為 [0, 1, 0, 0, 3, 1, 0, 0, 2, 2, 1, 0]。 ### 2. 如何計算 Z 值：當遍歷字符串以計算 Z 值時，我們維護一個窗口 [l, r]。這個窗口代表到目前為止知道的右邊界最遠的子字符串，該子字符串具有與開頭的子字符串一樣的最長公共前綴。當當前位置 i 落在 [l, r] 範圍內時，我們可以使用已知的 Z 值 z[i - l] 作為起始值，因為它和前面的子字符串有相同的最長公共前綴。然後，我們嘗試擴展從當前位置 i 開始的子字符串，並比較它與開頭的子字符串直到失敗或達到字符串的末尾。 ### 3. 使用 Z 值進行匹配： Z-algorithm 可以用於快速查找模式 p 在文本 t 中的所有出現。我們只需構造新的字符串 s = p + '$' + t，然後計算其 Z 值。對於每個與模式長度相等的 Z 值，都表示模式在文本中的一個匹配。 ### 4. Z-algorithm 的優勢： Z-algorithm 可以在 O(n) 的時間內計算整個字符串的 Z 值，其中 n 是字符串的長度。這是因為它利用已知的 Z 值進行部分匹配，避免了不必要的比較。 ## Z-algorithm ``` // 創建 z-algorithm table vector<int> buildZ(string s){ int n = s.size(); vector<int> z(n, 0); int l = 0, r = 0; for(int i=1; i<n; i++){ z[i] = max(0, min(r-i, z[i-l])); while(i+z[i]<n && s[z[i]] == s[i+z[i]]){ z[i]++; } if(i+z[i] > r){ r = i+z[i]; l = i; } } return z; } ``` ### 舉例 ``` #include <iostream> #include <vector> #include <string> using namespace std; // 創建 z-algorithm table vector<int> buildZ(string s){ int n = s.size(); vector<int> z(n, 0); int l = 0, r = 0; for(int i=1; i<n; i++){ z[i] = max(0, min(r-i, z[i-l])); while(i+z[i]<n && s[z[i]] == s[i+z[i]]){ z[i]++; } if(i+z[i] > r){ r = i+z[i]; l = i; } } return z; } // 使用 Z-algorithm 找到字符串中的所有模式匹配 void searchPattern(string text, string pattern) { string concat = pattern + "$" + text; vector<int> z = buildZ(concat); for(int i = 0; i < z.size(); i++) { if(z[i] == pattern.size()) { cout << "Pattern found at index: " << i - pattern.size() - 1 << endl; } } } int main() { string text = "ABABDABACDABABCABAB"; string pattern = "ABABCABAB"; searchPattern(text, pattern); return 0; } ``` ## 題目 [28. Find the Index of the First Occurrence in a String](https://leetcode.com/problems/find-the-index-of-the-first-occurrence-in-a-string/) => [解答](https://hackmd.io/@jacklee20499/BJJRrXgka) [796. Rotate String](https://leetcode.com/problems/rotate-string/) => [解答](https://hackmd.io/@jacklee20499/S1A7Umgkp) [459. Repeated Substring Pattern](https://leetcode.com/problems/repeated-substring-pattern/) => [解答](https://hackmd.io/@jacklee20499/HkASUQxJ6) [214. Shortest Palindrome](https://leetcode.com/problems/shortest-palindrome/) => [解答](https://hackmd.io/@jacklee20499/BkiJvXgJp) [1392. Longest Happy Prefix](https://leetcode.com/problems/longest-happy-prefix/) => [解答](https://hackmd.io/@jacklee20499/SJibv7ekT) ## Mock Interview Questions ### 1. Leetcode like ``` Description: Given two strings a and b we define a*b to be their concatenation. For example, if a = "abc" and b = "def" then a*b = "abcdef". If we think of concatenation as multiplication, exponentiation by a non-negative integer is defined in the normal way: a^0 = "" (the empty string) and a^(n+1) = a*(a^n). Input: String s. The characters in the strings are all lower-case letters., 1 <= s.length() <= 10^6 Output: Largest integer n such that s = a^n for some string a Example: s = “abd”, n = 1 s = “aaa”, n = 3 s = “abababab”, n = 4 s = “abacab”, n = 1 ``` ### 2. Additional ``` Description: In last question, if we get n = 1 for string s, find the minimum number k that if we add k characters behind s, it can make n of s become 2 Input: s represent a string with ascii characters, 1 <= s.length() <= 10^6 Output: Minimum integer k to make such that s = a^n for some string a and n >= 2 Example: s = “abc”, k = 3, add “abc” => “abcabc” s = “aabaa”, k = 1, add “b” => “aabaab” s = “aabbaabb”, k = 0 ``` ### 3. Little hard ``` Description: It is well known that AekdyCoin is good at string problems as well as number theory problems. When given a string s, we can write down all the non-empty prefixes of this string. For example: s: "abab" The prefixes are: "a", "ab", "aba", "abab" For each prefix, we can count the times it matches in s. So we can see that prefix "a" matches twice, "ab" matches twice too, "aba" matches once, and "abab" matches once. Now you are asked to calculate the sum of the match times for all the prefixes. For "abab", it is 2 + 2 + 1 + 1 = 6. Input: String s. The characters in the strings are all lower-case letters. Output: The sum of the match times ```