String Matching Algorithms — Comprehensive Notes

# unit 3: String Matching Algorithms — Comprehensive Notes --- ## 3.1 Knuth-Morris-Pratt (KMP) Algorithm ### 1. Definition The **Knuth-Morris-Pratt (KMP)** algorithm is a linear-time string matching algorithm that finds all occurrences of a pattern `P` of length `m` within a text `T` of length `n`. It achieves **O(n + m)** time complexity by preprocessing the pattern to build a **failure function (LPS array)** that allows the algorithm to skip redundant comparisons after a mismatch, without ever moving the text pointer backward. ### 2. Intuition The naive string matching algorithm has a fundamental inefficiency: after a mismatch, it restarts the comparison from the next character in the text, **discarding all information** gathered from previous matches. KMP's insight is simple but powerful: > When several characters of the pattern have already matched the text, we already know exactly what those text characters are — they are the same as the beginning of the pattern. We can use this knowledge to skip ahead intelligently. **Analogy:** Imagine you're looking for the word `"ABABC"` in a text. Suppose you've matched `"ABAB"` so far, and then the next character mismatches. Instead of starting from the beginning, you notice that the last `"AB"` in your matched portion is the same as the first `"AB"` of the pattern. So you can continue matching from position 2 of the pattern, saving comparisons. The **LPS array** formalizes this idea by precomputing, for every prefix of the pattern, the length of the longest proper prefix that is also a suffix. ### 3. Key Concepts / Properties **Proper Prefix / Proper Suffix:** - A *proper prefix* of a string is any prefix except the string itself. - A *proper suffix* is any suffix except the string itself. - Example: For `"ABCAB"`, proper prefixes are `{"A", "AB", "ABC", "ABCA"}` and proper suffixes are `{"B", "AB", "CAB", "BCAB"}`. **LPS (Longest Proper Prefix which is also Suffix) Array:** - `lps[i]` = length of the longest proper prefix of `P[0..i]` that is also a suffix of `P[0..i]`. - `lps[0] = 0` always (a single character has no proper prefix). **Example:** For pattern `P = "ABABCABAB"`: | Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |-------|---|---|---|---|---|---|---|---|---| | Char | A | B | A | B | C | A | B | A | B | | LPS | 0 | 0 | 1 | 2 | 0 | 1 | 2 | 3 | 4 | **Key Properties:** - The LPS array never "wastes" work — each character in the pattern is processed a constant number of times. - `lps[i] <= i` always. - If `lps[i] = k`, then `P[0..k-1] == P[i-k+1..i]`. - The text pointer `i` **never decreases** during matching — this is the core reason for linear time. **Edge Cases:** - Empty pattern → match at every position (handle separately). - Pattern longer than text → no matches. - Pattern equal to text → exactly one match. - Overlapping matches (e.g., pattern `"AA"` in text `"AAAA"` has matches at indices 0, 1, 2). ### 4. Algorithm / Technique The algorithm has **two phases**: #### Phase 1 — Build the LPS Array (Preprocessing) ``` 1. Initialize lps[0] = 0, len = 0, i = 1 2. While i < m: a. If P[i] == P[len]: len++ lps[i] = len i++ b. Else if len != 0: len = lps[len - 1] // fall back, do NOT increment i c. Else: lps[i] = 0 i++ ``` The crucial step is **2b**: when there's a mismatch but we've made some progress, we don't restart — we fall back using the already-computed LPS values. #### Phase 2 — Matching ``` 1. i = 0 (text index), j = 0 (pattern index) 2. While i < n: a. If T[i] == P[j]: i++, j++ If j == m: // full match found record match at position i - j j = lps[j - 1] b. Else: If j != 0: j = lps[j - 1] // use LPS to skip Else: i++ ``` ### 5. Time & Space Complexity | Aspect | Complexity | |---|---| | LPS preprocessing | **O(m)** | | Matching | **O(n)** | | **Total time** | **O(n + m)** | | Space | **O(m)** for the LPS array | **Why O(n + m)?** In the matching phase, `i` only increases. `j` can decrease, but it can only decrease as many times as it increased. So total operations are bounded by `2n`. The same amortized argument applies to LPS construction. **Best / Average / Worst cases are all O(n + m)** — KMP is consistent, which is a major strength over naive matching (which is O(nm) worst-case). ### 6. Code Implementation (C++) ```cpp #include <bits/stdc++.h> using namespace std; // Build LPS array for pattern vector<int> buildLPS(const string &pat) { int m = pat.size(); vector<int> lps(m, 0); int len = 0, i = 1; while (i < m) { if (pat[i] == pat[len]) { lps[i++] = ++len; } else if (len != 0) { len = lps[len - 1]; // fall back } else { lps[i++] = 0; } } return lps; } // Returns all starting indices where pat occurs in txt vector<int> KMP(const string &txt, const string &pat) { vector<int> matches; int n = txt.size(), m = pat.size(); if (m == 0 || m > n) return matches; vector<int> lps = buildLPS(pat); int i = 0, j = 0; while (i < n) { if (txt[i] == pat[j]) { i++; j++; if (j == m) { matches.push_back(i - j); j = lps[j - 1]; // continue searching for overlaps } } else if (j != 0) { j = lps[j - 1]; } else { i++; } } return matches; } int main() { string txt = "ABABDABACDABABCABAB"; string pat = "ABABCABAB"; vector<int> res = KMP(txt, pat); for (int idx : res) cout << "Match at index " << idx << "\n"; return 0; } ``` ### 7. Example Walkthrough **Pattern:** `P = "ABABC"`, **Text:** `T = "ABABABC"` **Step 1 — Build LPS for `"ABABC"`:** | i | P[i] | len | Comparison | lps[i] | |---|------|-----|------------|--------| | 0 | A | 0 | — | 0 | | 1 | B | 0 | B ≠ A, len=0 | 0 | | 2 | A | 0 | A == A, len=1 | 1 | | 3 | B | 1 | B == B, len=2 | 2 | | 4 | C | 2 | C ≠ A, fallback len=lps[1]=0, then C≠A, len=0 | 0 | **LPS = [0, 0, 1, 2, 0]** **Step 2 — Matching:** | i | j | T[i] | P[j] | Action | |---|---|------|------|--------| | 0 | 0 | A | A | match → i=1, j=1 | | 1 | 1 | B | B | match → i=2, j=2 | | 2 | 2 | A | A | match → i=3, j=3 | | 3 | 3 | B | B | match → i=4, j=4 | | 4 | 4 | A | C | mismatch, j=lps[3]=2 | | 4 | 2 | A | A | match → i=5, j=3 | | 5 | 3 | B | B | match → i=6, j=4 | | 6 | 4 | C | C | match → i=7, j=5 ⇒ **match at index 7−5=2** | Notice: at step 5, instead of backtracking `i` to 2, we simply kept `i=4` and jumped `j` back to 2 using `lps[3]=2`. This is the key optimization. ### 8. Common Mistakes 1. **Incrementing `i` inside the fallback branch of LPS construction.** When `P[i] != P[len]` and `len > 0`, we must only update `len = lps[len-1]` — **not** increment `i`. Otherwise the LPS values become wrong. 2. **Using `lps[j]` instead of `lps[j-1]`.** After matching `j` characters (positions `0..j-1`), the relevant LPS value is `lps[j-1]`, not `lps[j]`. 3. **Not handling overlapping matches.** After finding a match, set `j = lps[j-1]` rather than `j = 0` to continue finding overlaps like `"AA"` in `"AAAA"`. 4. **Off-by-one in match position.** The starting index of a match is `i - m`, not `i - m + 1`. 5. **Confusing LPS with Z-function.** Both are prefix-based preprocessing tools but they are not interchangeable. 6. **Forgetting edge cases** such as empty pattern or pattern longer than text. ### 9. Practice Problems | Problem | Platform | Difficulty | |---|---|---| | Implement strStr() | LeetCode 28 | Easy | | Prefix Function — Knuth-Morris-Pratt | CSES (1732) | Easy–Medium | | Password (pattern as both prefix and suffix) | Codeforces 126B | Medium | | Period (compression using LPS) | UVa 455 / POJ 1961 | Medium | | Shortest Palindrome | LeetCode 214 | Hard | --- ## 3.2 Rabin-Karp Algorithm ### 1. Definition The **Rabin-Karp algorithm** is a string matching algorithm that uses **hashing** to find a pattern `P` of length `m` within a text `T` of length `n`. Instead of comparing substrings character-by-character, it compares **hash values**, and only performs character comparison when hashes match. Using a **rolling hash**, each window's hash is computed from the previous in O(1), giving an expected total time of **O(n + m)**. ### 2. Intuition Comparing two strings of length `m` takes O(m) time. But comparing two integers takes O(1). If we can represent every substring of length `m` by a single integer (its hash), then substring comparison becomes constant time. The key challenge: computing the hash of every window `T[i..i+m-1]` naively costs O(m) per window, giving O(nm) total — no better than naive matching. The breakthrough is the **rolling hash**: if we know `hash(T[i..i+m-1])`, we can compute `hash(T[i+1..i+m])` in **O(1)** by removing the contribution of the leaving character and adding the contribution of the entering character. > Think of the hash as a number in base `b`. Sliding the window one position to the right is like dropping the most significant digit and appending a new least significant digit. Since hash collisions are possible, whenever hashes match we **verify** character-by-character. With a well-chosen hash, collisions are rare, keeping expected time linear. ### 3. Key Concepts / Properties **Polynomial Rolling Hash:** For a string `S = s₀ s₁ ... s_{m-1}` and base `b`, modulus `M`: ``` hash(S) = (s₀·b^(m−1) + s₁·b^(m−2) + ... + s_{m−1}·b^0) mod M ``` **Rolling Update Formula:** Given `hash(T[i..i+m-1])`, compute `hash(T[i+1..i+m])`: ``` new_hash = ((old_hash − T[i]·b^(m−1)) · b + T[i+m]) mod M ``` **Design Choices:** - **Base `b`**: typically a prime larger than the alphabet size (e.g., 31, 53, 131, 257). - **Modulus `M`**: a large prime to minimize collisions (e.g., 10⁹+7, 10⁹+9, or 2⁶¹−1). - Always take `mod M` and handle negative values: `((x % M) + M) % M`. **Collision Handling:** - A single hash has a collision probability roughly `1/M`. For large `M` this is small but not zero. - **Double hashing** (two independent hashes) reduces collision probability to ~`1/M²`. - Always **verify** on hash match unless you are doing probabilistic matching. **Edge Cases:** - Integer overflow if base and modulus are not carefully combined — use `long long` in C++. - Using weak moduli (like `2^32`) is vulnerable to adversarial anti-hash tests on Codeforces. - Pattern longer than text → no matches. ### 4. Algorithm / Technique **Single Pattern Matching:** ``` 1. Compute hash(P) = hP 2. Compute hash(T[0..m-1]) = hT 3. Precompute b^(m-1) mod M 4. For i = 0 to n - m: If hT == hP: Verify T[i..i+m-1] == P character-by-character If equal → record match If i < n - m: Roll the hash: hT = ((hT − T[i]·b^(m-1)) · b + T[i+m]) mod M (normalize to be non-negative) ``` **Multiple Pattern Matching (Rabin-Karp's real strength):** Given `k` patterns all of length `m`: 1. Compute hashes of all patterns and insert them into a `hash_set` (or `unordered_map<hash, pattern>`). 2. Slide a single window over the text, computing the rolling hash. 3. For each window hash, check if it exists in the set. 4. On hit, verify character-by-character (to handle collisions). This runs in **O(n + k·m)** expected time, far better than running KMP `k` times (which would be `O(k·n)`). *Note:* If patterns have different lengths, run Rabin-Karp separately per distinct length, or use more advanced methods like Aho-Corasick. ### 5. Time & Space Complexity | Case | Time | |---|---| | Best / Average | **O(n + m)** | | Worst (many collisions) | **O(n · m)** | | Multiple patterns (k patterns, length m) | **O(n + k·m)** expected | | Space | **O(m)** (or **O(k)** for the hash set) | The worst case occurs with adversarial input or a poorly chosen hash function. Using double hashing with large primes makes worst case extremely unlikely in practice. ### 6. Code Implementation (C++) ```cpp #include <bits/stdc++.h> using namespace std; vector<int> rabinKarp(const string &txt, const string &pat) { vector<int> matches; int n = txt.size(), m = pat.size(); if (m == 0 || m > n) return matches; const long long BASE = 131; const long long MOD = 1000000007LL; // Precompute BASE^(m-1) % MOD long long h = 1; for (int i = 0; i < m - 1; i++) h = (h * BASE) % MOD; // Initial hashes long long hashPat = 0, hashTxt = 0; for (int i = 0; i < m; i++) { hashPat = (hashPat * BASE + pat[i]) % MOD; hashTxt = (hashTxt * BASE + txt[i]) % MOD; } for (int i = 0; i <= n - m; i++) { if (hashPat == hashTxt) { // Verify to avoid false positives if (txt.compare(i, m, pat) == 0) { matches.push_back(i); } } if (i < n - m) { hashTxt = ( (hashTxt - txt[i] * h) * BASE + txt[i + m] ) % MOD; if (hashTxt < 0) hashTxt += MOD; // normalize } } return matches; } int main() { string txt = "GEEKS FOR GEEKS"; string pat = "GEEK"; vector<int> res = rabinKarp(txt, pat); for (int idx : res) cout << "Match at index " << idx << "\n"; return 0; } ``` **For production-grade competitive programming, use double hashing:** ```cpp pair<long long,long long> computeHash(const string &s, long long B1, long long M1, long long B2, long long M2) { long long h1 = 0, h2 = 0; for (char c : s) { h1 = (h1 * B1 + c) % M1; h2 = (h2 * B2 + c) % M2; } return {h1, h2}; } ``` ### 7. Example Walkthrough **Pattern:** `P = "CAB"`, **Text:** `T = "ABCCABC"`, **Base:** `b = 10`, **Modulus:** `M = 13`. (Using simplified values where `'A'=1, 'B'=2, 'C'=3` for illustration.) **Step 1 — Pattern hash:** ``` hash("CAB") = (3·10² + 1·10¹ + 2·10⁰) mod 13 = (300 + 10 + 2) mod 13 = 312 mod 13 = 0 ``` **Step 2 — First window `T[0..2] = "ABC"`:** ``` hash("ABC") = (1·100 + 2·10 + 3) mod 13 = 123 mod 13 = 6 ``` 6 ≠ 0 → no match. **Step 3 — Roll to window `T[1..3] = "BCC"`:** ``` h = b^(m-1) = 100 new_hash = ((6 − 1·100)·10 + 3) mod 13 = (−94·10 + 3) mod 13 = −937 mod 13 ``` Normalize: `−937 mod 13 = 11` (since `−937 + 72·13 = −1`, then `+13 = 12`... let's recompute: `13·73 = 949`, so `−937 + 949 = 12`). So hash = 12 ≠ 0. **Step 4 — Roll to window `T[2..4] = "CCA"`:** ``` new_hash = ((12 − 2·100)·10 + 1) mod 13 = ...= some value ≠ 0 ``` **Step 5 — Continue rolling. Eventually window `T[3..5] = "CAB"`:** ``` hash("CAB") = 0 = hashPat → candidate match ``` **Verify** character-by-character: `T[3..5] = "CAB" == P` → **match at index 3**. This demonstrates how hashing lets us skip full string comparison on windows where hashes differ. ### 8. Common Mistakes 1. **Negative mod values in C++.** In C++, `(-5) % 13 == -5`, not `8`. Always normalize with `((x % M) + M) % M` after subtraction. 2. **Forgetting to verify on hash match.** Hash collisions exist. Without verification, you get false positives. 3. **Weak hash functions.** Using `M = 2^32` (natural overflow) is unsafe on Codeforces — adversarial tests can force collisions. Use large primes (`10⁹+7`, `10⁹+9`, `2⁶¹−1`). 4. **Not precomputing `b^(m-1)`.** Recomputing it inside the loop makes the algorithm O(n·m). 5. **Integer overflow.** `BASE * hash` can overflow 32-bit `int`. Use `long long` consistently. 6. **Using the same base and mod for two hashes in "double hashing".** They must be independent primes to give real protection. 7. **Off-by-one in the rolling formula.** Make sure `h = BASE^(m-1)`, not `BASE^m`. 8. **Choosing base ≤ alphabet size.** A base like 26 can cause collisions between shifted strings. Prefer a prime > alphabet size. ### 9. Practice Problems | Problem | Platform | Difficulty | |---|---|---| | Repeated DNA Sequences | LeetCode 187 | Medium | | String Matching | CSES (1753) | Medium | | Longest Duplicate Substring | LeetCode 1044 | Hard | | Password (hashing + periodicity) | Codeforces 126B | Medium | | Good Substrings / Distinct Substrings | CSES (2105) | Medium–Hard | --- ## Quick Comparison: KMP vs Rabin-Karp | Criterion | KMP | Rabin-Karp | |---|---|---| | Preprocessing | O(m) LPS array | O(m) hash | | Matching time | O(n) **deterministic** | O(n + m) **expected**, O(nm) worst | | Multiple patterns | Run per pattern: O(k·n) | O(n + k·m) with a hash set — **much better** | | Implementation | Slightly tricky (LPS logic) | Easier, but hashing pitfalls | | Determinism | Always linear | Probabilistic unless verified carefully | | Extensibility | Hard to adapt | Easy (substring hashing, comparing substrings in O(1)) | **When to use which:** - **Single pattern, guaranteed linear time required** → KMP. - **Multiple patterns of same length** → Rabin-Karp with a hash set. - **Need to compare arbitrary substrings of a string** → Rabin-Karp (polynomial hashing) is far more flexible. - **Competitive programming** → Rabin-Karp (with double hashing) is often preferred due to flexibility; KMP remains essential for problems built around the failure function (period, border, compression).