KMP algorithm - HackMD

# KMP algorithm KMP (Knuth, Morrison and Pratt algorithm) is an algorithm for string matching. :::success ## TL;DR The already compared charactors are important information. By creating a **Longest Prefix Suffix (LPS)** array, both the needle and the haystack will be scaned once, thus the complexity can be reduced from $O(mn)$ to $O(m+n)$. ::: ## Relative leetcode problem ([Problem 28](https://leetcode.com/problems/find-the-index-of-the-first-occurrence-in-a-string/description/)) Given two strings ***needle*** and ***haystack***, return the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack. ## Think through :::info I often struggle to understand algorithms like this, so I’d like to share my thought process in hopes it will help others. You can skip directly to the implementation by jumping [here](#Step-4-create-LPS-) . I would like to first explain the need of LPS. Then I'll explain the actual process of generating the LPS. ::: Let's use the following case as example ``` haystack = "abaababaababab" needle = "abaabab" ``` ### Step 1: Brute Force :muscle: We'll start with a straightforward brute force approach. This method loops through all possible positions of ***haystack*** in order, comparing it with the ***needle*** and returns the position as soon as the first match is found: ![full case](https://hackmd.io/_uploads/Hkn2_E5RR.gif) ```python= class Solution: def strStr(self, haystack: str, needle: str) -> int: m = len(haystack) n = len(needle) ## i indicates the position where ## haystack starts to compare with needle for i in range(m-n+1): for j in range(n): if haystack[i+j] != needle[j]: break if j == (n-1): return i return -1 ``` Time complexity: $O((m-n)*n)$ --- ### Step 2: What can be improved :brain: ? Let’s take a closer look at the brute force method. What happens when a mismatch occurs? ![image](https://hackmd.io/_uploads/HyPu3WHCA.png) The ***needle*** moves to the next start position $i+1$, and the inner loop begins comparing the ***needle*** from the beginning again. ![image](https://hackmd.io/_uploads/SyNXsFSRA.png) So what is the problem? With a small example, we can enlarge the issue: <img src=https://hackmd.io/_uploads/r1vMx7SRR.gif style="width: 65%"> As shown, the brute force method repeatedly compars the triple a's of the ***needle***. Is there a way we can avoid these unnecessary comparisons? --- :::warning :exclamation: **Here comes the key idea of KMP algorithm** :exclamation: When a mismatch happens, the algorithm doesn’t start over from the next character in the text. Instead, it uses the information about how much of the pattern has already been matched to skip ahead. ::: ### Step 3:　What and Why LPS ? In the brute force method, we see the pointer on the ***haystack*** ($i+j$) keeps getting reset to the next start position of the ***needle*** ($i+1$) when a mismatch happens, which is the root cause of the redundant comparisons shown above. What if the pointer on the ***haystack*** ($i+j$) goes only forward ? This means that each charactor of ***haystack*** is compared only once. But how? Let’s explore how we can avoid resetting the pointer $i$. Consider the first mismatch at $i$==8 (from now on, we’ll refer to the pointer on the haystack as $i$ instead of $i+j$ for convenience): ![image](https://hackmd.io/_uploads/H1zS52ICA.png) Why are we able skip those positions that are crossed off? Remember, we wish to keep comparing the substring from the position $i$==8. Yet, the substring of the ***needle*** before pointer $j$ at those positions does not match the substring of ***haystack*** until postion $i$==8. This guarantees that the answer cannot exist in those positions, allowing us to skip them without missing any valid matches. With the skipping technique, the comparison now looks like this: ![full case](https://hackmd.io/_uploads/HyYIDV90C.gif) Before we dive deeper, to make sure we are still on the right track, let's analyze the time complexity of this improved matching algorithm using the skipping technique. Suppose we magically know the correct position to move the ***needle*** for each mismatch. As shown, the two remaining operations are: (1)comparing the next character ($i=i+1$) and (2)moving the ***needle*** to the correct position. Since both of the operations can only happen at most $m$ times (where $m$ is the length of the ***haystack***), the matching algorithm using this skipping technique runs in $O(m)$ time, great! This means that if we can compute the correct next position of the ***needle*** in linear time, the entire string matching process will also run in linear time. So, how do we find the next position of the ***needle*** for each mismatch? While we’ve been talking about moving the ***needle***, it’s actually the pointer $j$ on the ***needle*** that determines the next position. Skipping the ***needle*** to the next position actually equals to pushing the pointer $j$ one step forward ($j-1$): ![image](https://hackmd.io/_uploads/B19aoio00.png) On the other hand, the destination of pointer $j$ can also be seen as how many charactors of the ***needle*** are skipped in the next position, in the above case, three ($aba$). Thus, the question "Where is the next position of the ***needle***?" turns into "How many charactors in the ***needle*** can be skipped?" If we look only at the substring before the position $i$: <img src=https://hackmd.io/_uploads/HyVJRhi0A.png style="width: 70%"> We should observe a few properties. (1) the prefix of the ***needle*** before $j$ must match the suffix of the already compared ***haystack*** , $aba$ and $abaababa$ in this case, respectively. Moreover, not any match will do, the longest prefix and suffix match specifically, in order to not miss any possible position. (2) the already compared ***haystack*** is actually the same as the already compared ***needle*** (or what else are we doing?). With this two property, by looking only at the ***needle***, we should know where the pointer $j$ should jump to when a mismatch happens for every value of $j$. This is where the LPS (Longest Prefix Suffix) array comes in. The LPS[j] records the length of longest prefix that is also a suffix for ***needle***[0: j]. This length also tells us how much charactors can be skipped, which is the destination of the pointer $j$ if mismatch happens! Now we just need to create LPS in linear time and we are done. :::success ## Summary LPS records the length of the longest prefix suffix match of the ***needle***. This tells us where the next comparison should start, which helps push back only the pointer $j$ and not the pointer $i$. ::: --- ### Step 4: create LPS :memo: Longest Prefix Suffix (LPS) <img src=https://hackmd.io/_uploads/HJdBHy3AA.gif style="width: 80%"> For this section, I would like to go directly to the algoritm, since the idea is quite straight forward. Let's create the LPS array one by one. The pointer $j$ indicates the position of the current added charator and the pointer $i$ indicates the end position of the longest matched prefix from the previous step. LPS[0] is always 0 and we will start with LPS[1] ($i, j$=0, 1): <img src=https://hackmd.io/_uploads/BJzYQRsR0.png style="width: 80%"> Since ***needle***[1] != ***needle***[0] and $i$ is already at the start ($i$==0), there is no matching prefix and suffix for substring $ab$ (***needle***[0:j+1]). Next position, LPS[2] ($i, j$=0, 2): <img src=https://hackmd.io/_uploads/S12MgJhAR.png style="width: 80%"> The potential prefix suffix pair is $ab$ / $ba$ and $a$ / $a$. Do we need to test all of them? Actually NO! Because LPS[1] == 0 already tells us ***needle***[1] != ***needle***[0], which is what we have done in the previous step. So we only need to compare ***needle***[2] and ***needle***[0]. Next position, LPS[3]: <img src=https://hackmd.io/_uploads/SJAq-12AR.png style="width: 80%"> The potential prefix suffix pair is $aba$ / $baa$, $ab$ / $aa$ and $a$ / $a$. These pairs are actually extended from the potential prefix suffix pair of the previous position. This means that, the matched prefix suffix pair of current position must be extended from a matched prefix suffix pair of the previous position. The already complete LPS tells us where those matched prefix suffix pairs are. In this case, LPS[2] == 1 tells us that the length of the longest prefix suffix is as shown in the figure above. Thus we only need to compare ***needle***[j] and ***needle***[i] to see if the previous longest prefix suffix is extendable or not, which is not in this case. So we move the pointer $i$ to LPS[i-1]. Why? Because LPS[i-1] tells us where the next possible prefix suffix is. TBD... ```python= ## Create LPS in O(n) i, j, n = 0, 1, len(needle) LPS = [0] * n while j < n: if needle[j]==needle[i]: ## Case 1 LPS[j] = i+1 i += 1 j += 1 else: if i==0: ## Case 2 LPS[j] = 0 j += 1 else: ## Case 3 i = LPS[i-1] ``` --- ### Step 5: full algorithm Now if you combine the LPS generation and the search algorithm, the full code for KMP algorithm should be: ```python= class Solution: def strStr(self, haystack: str, needle: str) -> int: ## Create LPS in O(n) i, j = 0, 1 LPS = [0] * len(needle) while j < len(needle): # print(f'{j=}, {needle[j]=}, {i=}, {needle[i]=}, {LPS=}') if needle[j]==needle[i]: LPS[j] = i+1 i += 1 j += 1 else: if i==0: LPS[j] = 0 j += 1 else: i = LPS[i-1] ## String matching with LPS in O(m) i, j = 0, 0 while j < len(haystack): # print(f'{j=}, {haystack[j]=}, {i=}, {needle[i]=}, {LPS=}') if haystack[j]==needle[i]: i += 1 j += 1 else: if i==0: j += 1 else: i = LPS[i-1] # print(f'{j=}, {i=}') if i == len(needle): return j - len(needle) return -1 ``` Hope this artical had help you understand the core ideas of KMP algorithm !