---
# System prepended metadata

title: String Matching Algorithms — Comprehensive Notes

---

# unit 3: String Matching Algorithms — Comprehensive Notes

---

## 3.1 Knuth-Morris-Pratt (KMP) Algorithm

### 1. Definition

The **Knuth-Morris-Pratt (KMP)** algorithm is a linear-time string matching algorithm that finds all occurrences of a pattern `P` of length `m` within a text `T` of length `n`. It achieves **O(n + m)** time complexity by preprocessing the pattern to build a **failure function (LPS array)** that allows the algorithm to skip redundant comparisons after a mismatch, without ever moving the text pointer backward.

### 2. Intuition

The naive string matching algorithm has a fundamental inefficiency: after a mismatch, it restarts the comparison from the next character in the text, **discarding all information** gathered from previous matches.

KMP's insight is simple but powerful:

> When several characters of the pattern have already matched the text, we already know exactly what those text characters are — they are the same as the beginning of the pattern. We can use this knowledge to skip ahead intelligently.

**Analogy:** Imagine you're looking for the word `"ABABC"` in a text. Suppose you've matched `"ABAB"` so far, and then the next character mismatches. Instead of starting from the beginning, you notice that the last `"AB"` in your matched portion is the same as the first `"AB"` of the pattern. So you can continue matching from position 2 of the pattern, saving comparisons.

The **LPS array** formalizes this idea by precomputing, for every prefix of the pattern, the length of the longest proper prefix that is also a suffix.

### 3. Key Concepts / Properties

**Proper Prefix / Proper Suffix:**
- A *proper prefix* of a string is any prefix except the string itself.
- A *proper suffix* is any suffix except the string itself.
- Example: For `"ABCAB"`, proper prefixes are `{"A", "AB", "ABC", "ABCA"}` and proper suffixes are `{"B", "AB", "CAB", "BCAB"}`.

**LPS (Longest Proper Prefix which is also Suffix) Array:**
- `lps[i]` = length of the longest proper prefix of `P[0..i]` that is also a suffix of `P[0..i]`.
- `lps[0] = 0` always (a single character has no proper prefix).

**Example:** For pattern `P = "ABABCABAB"`:

| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|-------|---|---|---|---|---|---|---|---|---|
| Char  | A | B | A | B | C | A | B | A | B |
| LPS   | 0 | 0 | 1 | 2 | 0 | 1 | 2 | 3 | 4 |

**Key Properties:**
- The LPS array never "wastes" work — each character in the pattern is processed a constant number of times.
- `lps[i] <= i` always.
- If `lps[i] = k`, then `P[0..k-1] == P[i-k+1..i]`.
- The text pointer `i` **never decreases** during matching — this is the core reason for linear time.

**Edge Cases:**
- Empty pattern → match at every position (handle separately).
- Pattern longer than text → no matches.
- Pattern equal to text → exactly one match.
- Overlapping matches (e.g., pattern `"AA"` in text `"AAAA"` has matches at indices 0, 1, 2).

### 4. Algorithm / Technique

The algorithm has **two phases**:

#### Phase 1 — Build the LPS Array (Preprocessing)

```
1. Initialize lps[0] = 0, len = 0, i = 1
2. While i < m:
     a. If P[i] == P[len]:
          len++
          lps[i] = len
          i++
     b. Else if len != 0:
          len = lps[len - 1]       // fall back, do NOT increment i
     c. Else:
          lps[i] = 0
          i++
```

The crucial step is **2b**: when there's a mismatch but we've made some progress, we don't restart — we fall back using the already-computed LPS values.

#### Phase 2 — Matching

```
1. i = 0 (text index), j = 0 (pattern index)
2. While i < n:
     a. If T[i] == P[j]:
          i++, j++
          If j == m:  // full match found
              record match at position i - j
              j = lps[j - 1]
     b. Else:
          If j != 0:
              j = lps[j - 1]       // use LPS to skip
          Else:
              i++
```

### 5. Time & Space Complexity

| Aspect | Complexity |
|---|---|
| LPS preprocessing | **O(m)** |
| Matching | **O(n)** |
| **Total time** | **O(n + m)** |
| Space | **O(m)** for the LPS array |

**Why O(n + m)?** In the matching phase, `i` only increases. `j` can decrease, but it can only decrease as many times as it increased. So total operations are bounded by `2n`. The same amortized argument applies to LPS construction.

**Best / Average / Worst cases are all O(n + m)** — KMP is consistent, which is a major strength over naive matching (which is O(nm) worst-case).

### 6. Code Implementation (C++)

```cpp
#include <bits/stdc++.h>
using namespace std;

// Build LPS array for pattern
vector<int> buildLPS(const string &pat) {
    int m = pat.size();
    vector<int> lps(m, 0);
    int len = 0, i = 1;
    while (i < m) {
        if (pat[i] == pat[len]) {
            lps[i++] = ++len;
        } else if (len != 0) {
            len = lps[len - 1];   // fall back
        } else {
            lps[i++] = 0;
        }
    }
    return lps;
}

// Returns all starting indices where pat occurs in txt
vector<int> KMP(const string &txt, const string &pat) {
    vector<int> matches;
    int n = txt.size(), m = pat.size();
    if (m == 0 || m > n) return matches;

    vector<int> lps = buildLPS(pat);
    int i = 0, j = 0;
    while (i < n) {
        if (txt[i] == pat[j]) {
            i++; j++;
            if (j == m) {
                matches.push_back(i - j);
                j = lps[j - 1];   // continue searching for overlaps
            }
        } else if (j != 0) {
            j = lps[j - 1];
        } else {
            i++;
        }
    }
    return matches;
}

int main() {
    string txt = "ABABDABACDABABCABAB";
    string pat = "ABABCABAB";
    vector<int> res = KMP(txt, pat);
    for (int idx : res) cout << "Match at index " << idx << "\n";
    return 0;
}
```

### 7. Example Walkthrough

**Pattern:** `P = "ABABC"`, **Text:** `T = "ABABABC"`

**Step 1 — Build LPS for `"ABABC"`:**

| i | P[i] | len | Comparison | lps[i] |
|---|------|-----|------------|--------|
| 0 | A | 0 | — | 0 |
| 1 | B | 0 | B ≠ A, len=0 | 0 |
| 2 | A | 0 | A == A, len=1 | 1 |
| 3 | B | 1 | B == B, len=2 | 2 |
| 4 | C | 2 | C ≠ A, fallback len=lps[1]=0, then C≠A, len=0 | 0 |

**LPS = [0, 0, 1, 2, 0]**

**Step 2 — Matching:**

| i | j | T[i] | P[j] | Action |
|---|---|------|------|--------|
| 0 | 0 | A | A | match → i=1, j=1 |
| 1 | 1 | B | B | match → i=2, j=2 |
| 2 | 2 | A | A | match → i=3, j=3 |
| 3 | 3 | B | B | match → i=4, j=4 |
| 4 | 4 | A | C | mismatch, j=lps[3]=2 |
| 4 | 2 | A | A | match → i=5, j=3 |
| 5 | 3 | B | B | match → i=6, j=4 |
| 6 | 4 | C | C | match → i=7, j=5 ⇒ **match at index 7−5=2** |

Notice: at step 5, instead of backtracking `i` to 2, we simply kept `i=4` and jumped `j` back to 2 using `lps[3]=2`. This is the key optimization.

### 8. Common Mistakes

1. **Incrementing `i` inside the fallback branch of LPS construction.** When `P[i] != P[len]` and `len > 0`, we must only update `len = lps[len-1]` — **not** increment `i`. Otherwise the LPS values become wrong.
2. **Using `lps[j]` instead of `lps[j-1]`.** After matching `j` characters (positions `0..j-1`), the relevant LPS value is `lps[j-1]`, not `lps[j]`.
3. **Not handling overlapping matches.** After finding a match, set `j = lps[j-1]` rather than `j = 0` to continue finding overlaps like `"AA"` in `"AAAA"`.
4. **Off-by-one in match position.** The starting index of a match is `i - m`, not `i - m + 1`.
5. **Confusing LPS with Z-function.** Both are prefix-based preprocessing tools but they are not interchangeable.
6. **Forgetting edge cases** such as empty pattern or pattern longer than text.

### 9. Practice Problems

| Problem | Platform | Difficulty |
|---|---|---|
| Implement strStr() | LeetCode 28 | Easy |
| Prefix Function — Knuth-Morris-Pratt | CSES (1732) | Easy–Medium |
| Password (pattern as both prefix and suffix) | Codeforces 126B | Medium |
| Period (compression using LPS) | UVa 455 / POJ 1961 | Medium |
| Shortest Palindrome | LeetCode 214 | Hard |

---

## 3.2 Rabin-Karp Algorithm

### 1. Definition

The **Rabin-Karp algorithm** is a string matching algorithm that uses **hashing** to find a pattern `P` of length `m` within a text `T` of length `n`. Instead of comparing substrings character-by-character, it compares **hash values**, and only performs character comparison when hashes match. Using a **rolling hash**, each window's hash is computed from the previous in O(1), giving an expected total time of **O(n + m)**.

### 2. Intuition

Comparing two strings of length `m` takes O(m) time. But comparing two integers takes O(1). If we can represent every substring of length `m` by a single integer (its hash), then substring comparison becomes constant time.

The key challenge: computing the hash of every window `T[i..i+m-1]` naively costs O(m) per window, giving O(nm) total — no better than naive matching.

The breakthrough is the **rolling hash**: if we know `hash(T[i..i+m-1])`, we can compute `hash(T[i+1..i+m])` in **O(1)** by removing the contribution of the leaving character and adding the contribution of the entering character.

> Think of the hash as a number in base `b`. Sliding the window one position to the right is like dropping the most significant digit and appending a new least significant digit.

Since hash collisions are possible, whenever hashes match we **verify** character-by-character. With a well-chosen hash, collisions are rare, keeping expected time linear.

### 3. Key Concepts / Properties

**Polynomial Rolling Hash:**
For a string `S = s₀ s₁ ... s_{m-1}` and base `b`, modulus `M`:

```
hash(S) = (s₀·b^(m−1) + s₁·b^(m−2) + ... + s_{m−1}·b^0) mod M
```

**Rolling Update Formula:** Given `hash(T[i..i+m-1])`, compute `hash(T[i+1..i+m])`:

```
new_hash = ((old_hash − T[i]·b^(m−1)) · b + T[i+m]) mod M
```

**Design Choices:**
- **Base `b`**: typically a prime larger than the alphabet size (e.g., 31, 53, 131, 257).
- **Modulus `M`**: a large prime to minimize collisions (e.g., 10⁹+7, 10⁹+9, or 2⁶¹−1).
- Always take `mod M` and handle negative values: `((x % M) + M) % M`.

**Collision Handling:**
- A single hash has a collision probability roughly `1/M`. For large `M` this is small but not zero.
- **Double hashing** (two independent hashes) reduces collision probability to ~`1/M²`.
- Always **verify** on hash match unless you are doing probabilistic matching.

**Edge Cases:**
- Integer overflow if base and modulus are not carefully combined — use `long long` in C++.
- Using weak moduli (like `2^32`) is vulnerable to adversarial anti-hash tests on Codeforces.
- Pattern longer than text → no matches.

### 4. Algorithm / Technique

**Single Pattern Matching:**

```
1. Compute hash(P) = hP
2. Compute hash(T[0..m-1]) = hT
3. Precompute b^(m-1) mod M
4. For i = 0 to n - m:
     If hT == hP:
         Verify T[i..i+m-1] == P character-by-character
         If equal → record match
     If i < n - m:
         Roll the hash: hT = ((hT − T[i]·b^(m-1)) · b + T[i+m]) mod M
         (normalize to be non-negative)
```

**Multiple Pattern Matching (Rabin-Karp's real strength):**

Given `k` patterns all of length `m`:
1. Compute hashes of all patterns and insert them into a `hash_set` (or `unordered_map<hash, pattern>`).
2. Slide a single window over the text, computing the rolling hash.
3. For each window hash, check if it exists in the set.
4. On hit, verify character-by-character (to handle collisions).

This runs in **O(n + k·m)** expected time, far better than running KMP `k` times (which would be `O(k·n)`).

*Note:* If patterns have different lengths, run Rabin-Karp separately per distinct length, or use more advanced methods like Aho-Corasick.

### 5. Time & Space Complexity

| Case | Time |
|---|---|
| Best / Average | **O(n + m)** |
| Worst (many collisions) | **O(n · m)** |
| Multiple patterns (k patterns, length m) | **O(n + k·m)** expected |
| Space | **O(m)** (or **O(k)** for the hash set) |

The worst case occurs with adversarial input or a poorly chosen hash function. Using double hashing with large primes makes worst case extremely unlikely in practice.

### 6. Code Implementation (C++)

```cpp
#include <bits/stdc++.h>
using namespace std;

vector<int> rabinKarp(const string &txt, const string &pat) {
    vector<int> matches;
    int n = txt.size(), m = pat.size();
    if (m == 0 || m > n) return matches;

    const long long BASE = 131;
    const long long MOD  = 1000000007LL;

    // Precompute BASE^(m-1) % MOD
    long long h = 1;
    for (int i = 0; i < m - 1; i++) h = (h * BASE) % MOD;

    // Initial hashes
    long long hashPat = 0, hashTxt = 0;
    for (int i = 0; i < m; i++) {
        hashPat = (hashPat * BASE + pat[i]) % MOD;
        hashTxt = (hashTxt * BASE + txt[i]) % MOD;
    }

    for (int i = 0; i <= n - m; i++) {
        if (hashPat == hashTxt) {
            // Verify to avoid false positives
            if (txt.compare(i, m, pat) == 0) {
                matches.push_back(i);
            }
        }
        if (i < n - m) {
            hashTxt = ( (hashTxt - txt[i] * h) * BASE + txt[i + m] ) % MOD;
            if (hashTxt < 0) hashTxt += MOD;   // normalize
        }
    }
    return matches;
}

int main() {
    string txt = "GEEKS FOR GEEKS";
    string pat = "GEEK";
    vector<int> res = rabinKarp(txt, pat);
    for (int idx : res) cout << "Match at index " << idx << "\n";
    return 0;
}
```

**For production-grade competitive programming, use double hashing:**

```cpp
pair<long long,long long> computeHash(const string &s, 
                                      long long B1, long long M1, 
                                      long long B2, long long M2) {
    long long h1 = 0, h2 = 0;
    for (char c : s) {
        h1 = (h1 * B1 + c) % M1;
        h2 = (h2 * B2 + c) % M2;
    }
    return {h1, h2};
}
```

### 7. Example Walkthrough

**Pattern:** `P = "CAB"`, **Text:** `T = "ABCCABC"`, **Base:** `b = 10`, **Modulus:** `M = 13`.
(Using simplified values where `'A'=1, 'B'=2, 'C'=3` for illustration.)

**Step 1 — Pattern hash:**
```
hash("CAB") = (3·10² + 1·10¹ + 2·10⁰) mod 13
            = (300 + 10 + 2) mod 13
            = 312 mod 13 = 0
```

**Step 2 — First window `T[0..2] = "ABC"`:**
```
hash("ABC") = (1·100 + 2·10 + 3) mod 13 = 123 mod 13 = 6
```
6 ≠ 0 → no match.

**Step 3 — Roll to window `T[1..3] = "BCC"`:**
```
h = b^(m-1) = 100
new_hash = ((6 − 1·100)·10 + 3) mod 13
         = (−94·10 + 3) mod 13
         = −937 mod 13
```
Normalize: `−937 mod 13 = 11` (since `−937 + 72·13 = −1`, then `+13 = 12`... let's recompute: `13·73 = 949`, so `−937 + 949 = 12`). So hash = 12 ≠ 0.

**Step 4 — Roll to window `T[2..4] = "CCA"`:**
```
new_hash = ((12 − 2·100)·10 + 1) mod 13 = ...= some value ≠ 0
```

**Step 5 — Continue rolling. Eventually window `T[3..5] = "CAB"`:**
```
hash("CAB") = 0 = hashPat → candidate match
```
**Verify** character-by-character: `T[3..5] = "CAB" == P` → **match at index 3**.

This demonstrates how hashing lets us skip full string comparison on windows where hashes differ.

### 8. Common Mistakes

1. **Negative mod values in C++.** In C++, `(-5) % 13 == -5`, not `8`. Always normalize with `((x % M) + M) % M` after subtraction.
2. **Forgetting to verify on hash match.** Hash collisions exist. Without verification, you get false positives.
3. **Weak hash functions.** Using `M = 2^32` (natural overflow) is unsafe on Codeforces — adversarial tests can force collisions. Use large primes (`10⁹+7`, `10⁹+9`, `2⁶¹−1`).
4. **Not precomputing `b^(m-1)`.** Recomputing it inside the loop makes the algorithm O(n·m).
5. **Integer overflow.** `BASE * hash` can overflow 32-bit `int`. Use `long long` consistently.
6. **Using the same base and mod for two hashes in "double hashing".** They must be independent primes to give real protection.
7. **Off-by-one in the rolling formula.** Make sure `h = BASE^(m-1)`, not `BASE^m`.
8. **Choosing base ≤ alphabet size.** A base like 26 can cause collisions between shifted strings. Prefer a prime > alphabet size.

### 9. Practice Problems

| Problem | Platform | Difficulty |
|---|---|---|
| Repeated DNA Sequences | LeetCode 187 | Medium |
| String Matching | CSES (1753) | Medium |
| Longest Duplicate Substring | LeetCode 1044 | Hard |
| Password (hashing + periodicity) | Codeforces 126B | Medium |
| Good Substrings / Distinct Substrings | CSES (2105) | Medium–Hard |

---

## Quick Comparison: KMP vs Rabin-Karp

| Criterion | KMP | Rabin-Karp |
|---|---|---|
| Preprocessing | O(m) LPS array | O(m) hash |
| Matching time | O(n) **deterministic** | O(n + m) **expected**, O(nm) worst |
| Multiple patterns | Run per pattern: O(k·n) | O(n + k·m) with a hash set — **much better** |
| Implementation | Slightly tricky (LPS logic) | Easier, but hashing pitfalls |
| Determinism | Always linear | Probabilistic unless verified carefully |
| Extensibility | Hard to adapt | Easy (substring hashing, comparing substrings in O(1)) |

**When to use which:**
- **Single pattern, guaranteed linear time required** → KMP.
- **Multiple patterns of same length** → Rabin-Karp with a hash set.
- **Need to compare arbitrary substrings of a string** → Rabin-Karp (polynomial hashing) is far more flexible.
- **Competitive programming** → Rabin-Karp (with double hashing) is often preferred due to flexibility; KMP remains essential for problems built around the failure function (period, border, compression).