# unit 3: String Matching Algorithms — Comprehensive Notes
---
## 3.1 Knuth-Morris-Pratt (KMP) Algorithm
### 1. Definition
The **Knuth-Morris-Pratt (KMP)** algorithm is a linear-time string matching algorithm that finds all occurrences of a pattern `P` of length `m` within a text `T` of length `n`. It achieves **O(n + m)** time complexity by preprocessing the pattern to build a **failure function (LPS array)** that allows the algorithm to skip redundant comparisons after a mismatch, without ever moving the text pointer backward.
### 2. Intuition
The naive string matching algorithm has a fundamental inefficiency: after a mismatch, it restarts the comparison from the next character in the text, **discarding all information** gathered from previous matches.
KMP's insight is simple but powerful:
> When several characters of the pattern have already matched the text, we already know exactly what those text characters are — they are the same as the beginning of the pattern. We can use this knowledge to skip ahead intelligently.
**Analogy:** Imagine you're looking for the word `"ABABC"` in a text. Suppose you've matched `"ABAB"` so far, and then the next character mismatches. Instead of starting from the beginning, you notice that the last `"AB"` in your matched portion is the same as the first `"AB"` of the pattern. So you can continue matching from position 2 of the pattern, saving comparisons.
The **LPS array** formalizes this idea by precomputing, for every prefix of the pattern, the length of the longest proper prefix that is also a suffix.
### 3. Key Concepts / Properties
**Proper Prefix / Proper Suffix:**
- A *proper prefix* of a string is any prefix except the string itself.
- A *proper suffix* is any suffix except the string itself.
- Example: For `"ABCAB"`, proper prefixes are `{"A", "AB", "ABC", "ABCA"}` and proper suffixes are `{"B", "AB", "CAB", "BCAB"}`.
**LPS (Longest Proper Prefix which is also Suffix) Array:**
- `lps[i]` = length of the longest proper prefix of `P[0..i]` that is also a suffix of `P[0..i]`.
- `lps[0] = 0` always (a single character has no proper prefix).
**Example:** For pattern `P = "ABABCABAB"`:
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|-------|---|---|---|---|---|---|---|---|---|
| Char | A | B | A | B | C | A | B | A | B |
| LPS | 0 | 0 | 1 | 2 | 0 | 1 | 2 | 3 | 4 |
**Key Properties:**
- The LPS array never "wastes" work — each character in the pattern is processed a constant number of times.
- `lps[i] <= i` always.
- If `lps[i] = k`, then `P[0..k-1] == P[i-k+1..i]`.
- The text pointer `i` **never decreases** during matching — this is the core reason for linear time.
**Edge Cases:**
- Empty pattern → match at every position (handle separately).
- Pattern longer than text → no matches.
- Pattern equal to text → exactly one match.
- Overlapping matches (e.g., pattern `"AA"` in text `"AAAA"` has matches at indices 0, 1, 2).
### 4. Algorithm / Technique
The algorithm has **two phases**:
#### Phase 1 — Build the LPS Array (Preprocessing)
```
1. Initialize lps[0] = 0, len = 0, i = 1
2. While i < m:
a. If P[i] == P[len]:
len++
lps[i] = len
i++
b. Else if len != 0:
len = lps[len - 1] // fall back, do NOT increment i
c. Else:
lps[i] = 0
i++
```
The crucial step is **2b**: when there's a mismatch but we've made some progress, we don't restart — we fall back using the already-computed LPS values.
#### Phase 2 — Matching
```
1. i = 0 (text index), j = 0 (pattern index)
2. While i < n:
a. If T[i] == P[j]:
i++, j++
If j == m: // full match found
record match at position i - j
j = lps[j - 1]
b. Else:
If j != 0:
j = lps[j - 1] // use LPS to skip
Else:
i++
```
### 5. Time & Space Complexity
| Aspect | Complexity |
|---|---|
| LPS preprocessing | **O(m)** |
| Matching | **O(n)** |
| **Total time** | **O(n + m)** |
| Space | **O(m)** for the LPS array |
**Why O(n + m)?** In the matching phase, `i` only increases. `j` can decrease, but it can only decrease as many times as it increased. So total operations are bounded by `2n`. The same amortized argument applies to LPS construction.
**Best / Average / Worst cases are all O(n + m)** — KMP is consistent, which is a major strength over naive matching (which is O(nm) worst-case).
### 6. Code Implementation (C++)
```cpp
#include <bits/stdc++.h>
using namespace std;
// Build LPS array for pattern
vector<int> buildLPS(const string &pat) {
int m = pat.size();
vector<int> lps(m, 0);
int len = 0, i = 1;
while (i < m) {
if (pat[i] == pat[len]) {
lps[i++] = ++len;
} else if (len != 0) {
len = lps[len - 1]; // fall back
} else {
lps[i++] = 0;
}
}
return lps;
}
// Returns all starting indices where pat occurs in txt
vector<int> KMP(const string &txt, const string &pat) {
vector<int> matches;
int n = txt.size(), m = pat.size();
if (m == 0 || m > n) return matches;
vector<int> lps = buildLPS(pat);
int i = 0, j = 0;
while (i < n) {
if (txt[i] == pat[j]) {
i++; j++;
if (j == m) {
matches.push_back(i - j);
j = lps[j - 1]; // continue searching for overlaps
}
} else if (j != 0) {
j = lps[j - 1];
} else {
i++;
}
}
return matches;
}
int main() {
string txt = "ABABDABACDABABCABAB";
string pat = "ABABCABAB";
vector<int> res = KMP(txt, pat);
for (int idx : res) cout << "Match at index " << idx << "\n";
return 0;
}
```
### 7. Example Walkthrough
**Pattern:** `P = "ABABC"`, **Text:** `T = "ABABABC"`
**Step 1 — Build LPS for `"ABABC"`:**
| i | P[i] | len | Comparison | lps[i] |
|---|------|-----|------------|--------|
| 0 | A | 0 | — | 0 |
| 1 | B | 0 | B ≠ A, len=0 | 0 |
| 2 | A | 0 | A == A, len=1 | 1 |
| 3 | B | 1 | B == B, len=2 | 2 |
| 4 | C | 2 | C ≠ A, fallback len=lps[1]=0, then C≠A, len=0 | 0 |
**LPS = [0, 0, 1, 2, 0]**
**Step 2 — Matching:**
| i | j | T[i] | P[j] | Action |
|---|---|------|------|--------|
| 0 | 0 | A | A | match → i=1, j=1 |
| 1 | 1 | B | B | match → i=2, j=2 |
| 2 | 2 | A | A | match → i=3, j=3 |
| 3 | 3 | B | B | match → i=4, j=4 |
| 4 | 4 | A | C | mismatch, j=lps[3]=2 |
| 4 | 2 | A | A | match → i=5, j=3 |
| 5 | 3 | B | B | match → i=6, j=4 |
| 6 | 4 | C | C | match → i=7, j=5 ⇒ **match at index 7−5=2** |
Notice: at step 5, instead of backtracking `i` to 2, we simply kept `i=4` and jumped `j` back to 2 using `lps[3]=2`. This is the key optimization.
### 8. Common Mistakes
1. **Incrementing `i` inside the fallback branch of LPS construction.** When `P[i] != P[len]` and `len > 0`, we must only update `len = lps[len-1]` — **not** increment `i`. Otherwise the LPS values become wrong.
2. **Using `lps[j]` instead of `lps[j-1]`.** After matching `j` characters (positions `0..j-1`), the relevant LPS value is `lps[j-1]`, not `lps[j]`.
3. **Not handling overlapping matches.** After finding a match, set `j = lps[j-1]` rather than `j = 0` to continue finding overlaps like `"AA"` in `"AAAA"`.
4. **Off-by-one in match position.** The starting index of a match is `i - m`, not `i - m + 1`.
5. **Confusing LPS with Z-function.** Both are prefix-based preprocessing tools but they are not interchangeable.
6. **Forgetting edge cases** such as empty pattern or pattern longer than text.
### 9. Practice Problems
| Problem | Platform | Difficulty |
|---|---|---|
| Implement strStr() | LeetCode 28 | Easy |
| Prefix Function — Knuth-Morris-Pratt | CSES (1732) | Easy–Medium |
| Password (pattern as both prefix and suffix) | Codeforces 126B | Medium |
| Period (compression using LPS) | UVa 455 / POJ 1961 | Medium |
| Shortest Palindrome | LeetCode 214 | Hard |
---
## 3.2 Rabin-Karp Algorithm
### 1. Definition
The **Rabin-Karp algorithm** is a string matching algorithm that uses **hashing** to find a pattern `P` of length `m` within a text `T` of length `n`. Instead of comparing substrings character-by-character, it compares **hash values**, and only performs character comparison when hashes match. Using a **rolling hash**, each window's hash is computed from the previous in O(1), giving an expected total time of **O(n + m)**.
### 2. Intuition
Comparing two strings of length `m` takes O(m) time. But comparing two integers takes O(1). If we can represent every substring of length `m` by a single integer (its hash), then substring comparison becomes constant time.
The key challenge: computing the hash of every window `T[i..i+m-1]` naively costs O(m) per window, giving O(nm) total — no better than naive matching.
The breakthrough is the **rolling hash**: if we know `hash(T[i..i+m-1])`, we can compute `hash(T[i+1..i+m])` in **O(1)** by removing the contribution of the leaving character and adding the contribution of the entering character.
> Think of the hash as a number in base `b`. Sliding the window one position to the right is like dropping the most significant digit and appending a new least significant digit.
Since hash collisions are possible, whenever hashes match we **verify** character-by-character. With a well-chosen hash, collisions are rare, keeping expected time linear.
### 3. Key Concepts / Properties
**Polynomial Rolling Hash:**
For a string `S = s₀ s₁ ... s_{m-1}` and base `b`, modulus `M`:
```
hash(S) = (s₀·b^(m−1) + s₁·b^(m−2) + ... + s_{m−1}·b^0) mod M
```
**Rolling Update Formula:** Given `hash(T[i..i+m-1])`, compute `hash(T[i+1..i+m])`:
```
new_hash = ((old_hash − T[i]·b^(m−1)) · b + T[i+m]) mod M
```
**Design Choices:**
- **Base `b`**: typically a prime larger than the alphabet size (e.g., 31, 53, 131, 257).
- **Modulus `M`**: a large prime to minimize collisions (e.g., 10⁹+7, 10⁹+9, or 2⁶¹−1).
- Always take `mod M` and handle negative values: `((x % M) + M) % M`.
**Collision Handling:**
- A single hash has a collision probability roughly `1/M`. For large `M` this is small but not zero.
- **Double hashing** (two independent hashes) reduces collision probability to ~`1/M²`.
- Always **verify** on hash match unless you are doing probabilistic matching.
**Edge Cases:**
- Integer overflow if base and modulus are not carefully combined — use `long long` in C++.
- Using weak moduli (like `2^32`) is vulnerable to adversarial anti-hash tests on Codeforces.
- Pattern longer than text → no matches.
### 4. Algorithm / Technique
**Single Pattern Matching:**
```
1. Compute hash(P) = hP
2. Compute hash(T[0..m-1]) = hT
3. Precompute b^(m-1) mod M
4. For i = 0 to n - m:
If hT == hP:
Verify T[i..i+m-1] == P character-by-character
If equal → record match
If i < n - m:
Roll the hash: hT = ((hT − T[i]·b^(m-1)) · b + T[i+m]) mod M
(normalize to be non-negative)
```
**Multiple Pattern Matching (Rabin-Karp's real strength):**
Given `k` patterns all of length `m`:
1. Compute hashes of all patterns and insert them into a `hash_set` (or `unordered_map<hash, pattern>`).
2. Slide a single window over the text, computing the rolling hash.
3. For each window hash, check if it exists in the set.
4. On hit, verify character-by-character (to handle collisions).
This runs in **O(n + k·m)** expected time, far better than running KMP `k` times (which would be `O(k·n)`).
*Note:* If patterns have different lengths, run Rabin-Karp separately per distinct length, or use more advanced methods like Aho-Corasick.
### 5. Time & Space Complexity
| Case | Time |
|---|---|
| Best / Average | **O(n + m)** |
| Worst (many collisions) | **O(n · m)** |
| Multiple patterns (k patterns, length m) | **O(n + k·m)** expected |
| Space | **O(m)** (or **O(k)** for the hash set) |
The worst case occurs with adversarial input or a poorly chosen hash function. Using double hashing with large primes makes worst case extremely unlikely in practice.
### 6. Code Implementation (C++)
```cpp
#include <bits/stdc++.h>
using namespace std;
vector<int> rabinKarp(const string &txt, const string &pat) {
vector<int> matches;
int n = txt.size(), m = pat.size();
if (m == 0 || m > n) return matches;
const long long BASE = 131;
const long long MOD = 1000000007LL;
// Precompute BASE^(m-1) % MOD
long long h = 1;
for (int i = 0; i < m - 1; i++) h = (h * BASE) % MOD;
// Initial hashes
long long hashPat = 0, hashTxt = 0;
for (int i = 0; i < m; i++) {
hashPat = (hashPat * BASE + pat[i]) % MOD;
hashTxt = (hashTxt * BASE + txt[i]) % MOD;
}
for (int i = 0; i <= n - m; i++) {
if (hashPat == hashTxt) {
// Verify to avoid false positives
if (txt.compare(i, m, pat) == 0) {
matches.push_back(i);
}
}
if (i < n - m) {
hashTxt = ( (hashTxt - txt[i] * h) * BASE + txt[i + m] ) % MOD;
if (hashTxt < 0) hashTxt += MOD; // normalize
}
}
return matches;
}
int main() {
string txt = "GEEKS FOR GEEKS";
string pat = "GEEK";
vector<int> res = rabinKarp(txt, pat);
for (int idx : res) cout << "Match at index " << idx << "\n";
return 0;
}
```
**For production-grade competitive programming, use double hashing:**
```cpp
pair<long long,long long> computeHash(const string &s,
long long B1, long long M1,
long long B2, long long M2) {
long long h1 = 0, h2 = 0;
for (char c : s) {
h1 = (h1 * B1 + c) % M1;
h2 = (h2 * B2 + c) % M2;
}
return {h1, h2};
}
```
### 7. Example Walkthrough
**Pattern:** `P = "CAB"`, **Text:** `T = "ABCCABC"`, **Base:** `b = 10`, **Modulus:** `M = 13`.
(Using simplified values where `'A'=1, 'B'=2, 'C'=3` for illustration.)
**Step 1 — Pattern hash:**
```
hash("CAB") = (3·10² + 1·10¹ + 2·10⁰) mod 13
= (300 + 10 + 2) mod 13
= 312 mod 13 = 0
```
**Step 2 — First window `T[0..2] = "ABC"`:**
```
hash("ABC") = (1·100 + 2·10 + 3) mod 13 = 123 mod 13 = 6
```
6 ≠ 0 → no match.
**Step 3 — Roll to window `T[1..3] = "BCC"`:**
```
h = b^(m-1) = 100
new_hash = ((6 − 1·100)·10 + 3) mod 13
= (−94·10 + 3) mod 13
= −937 mod 13
```
Normalize: `−937 mod 13 = 11` (since `−937 + 72·13 = −1`, then `+13 = 12`... let's recompute: `13·73 = 949`, so `−937 + 949 = 12`). So hash = 12 ≠ 0.
**Step 4 — Roll to window `T[2..4] = "CCA"`:**
```
new_hash = ((12 − 2·100)·10 + 1) mod 13 = ...= some value ≠ 0
```
**Step 5 — Continue rolling. Eventually window `T[3..5] = "CAB"`:**
```
hash("CAB") = 0 = hashPat → candidate match
```
**Verify** character-by-character: `T[3..5] = "CAB" == P` → **match at index 3**.
This demonstrates how hashing lets us skip full string comparison on windows where hashes differ.
### 8. Common Mistakes
1. **Negative mod values in C++.** In C++, `(-5) % 13 == -5`, not `8`. Always normalize with `((x % M) + M) % M` after subtraction.
2. **Forgetting to verify on hash match.** Hash collisions exist. Without verification, you get false positives.
3. **Weak hash functions.** Using `M = 2^32` (natural overflow) is unsafe on Codeforces — adversarial tests can force collisions. Use large primes (`10⁹+7`, `10⁹+9`, `2⁶¹−1`).
4. **Not precomputing `b^(m-1)`.** Recomputing it inside the loop makes the algorithm O(n·m).
5. **Integer overflow.** `BASE * hash` can overflow 32-bit `int`. Use `long long` consistently.
6. **Using the same base and mod for two hashes in "double hashing".** They must be independent primes to give real protection.
7. **Off-by-one in the rolling formula.** Make sure `h = BASE^(m-1)`, not `BASE^m`.
8. **Choosing base ≤ alphabet size.** A base like 26 can cause collisions between shifted strings. Prefer a prime > alphabet size.
### 9. Practice Problems
| Problem | Platform | Difficulty |
|---|---|---|
| Repeated DNA Sequences | LeetCode 187 | Medium |
| String Matching | CSES (1753) | Medium |
| Longest Duplicate Substring | LeetCode 1044 | Hard |
| Password (hashing + periodicity) | Codeforces 126B | Medium |
| Good Substrings / Distinct Substrings | CSES (2105) | Medium–Hard |
---
## Quick Comparison: KMP vs Rabin-Karp
| Criterion | KMP | Rabin-Karp |
|---|---|---|
| Preprocessing | O(m) LPS array | O(m) hash |
| Matching time | O(n) **deterministic** | O(n + m) **expected**, O(nm) worst |
| Multiple patterns | Run per pattern: O(k·n) | O(n + k·m) with a hash set — **much better** |
| Implementation | Slightly tricky (LPS logic) | Easier, but hashing pitfalls |
| Determinism | Always linear | Probabilistic unless verified carefully |
| Extensibility | Hard to adapt | Easy (substring hashing, comparing substrings in O(1)) |
**When to use which:**
- **Single pattern, guaranteed linear time required** → KMP.
- **Multiple patterns of same length** → Rabin-Karp with a hash set.
- **Need to compare arbitrary substrings of a string** → Rabin-Karp (polynomial hashing) is far more flexible.
- **Competitive programming** → Rabin-Karp (with double hashing) is often preferred due to flexibility; KMP remains essential for problems built around the failure function (period, border, compression).