KMP Algorithm(Knuth–Morris–Pratt Algorithm)

# KMP Algorithm(Knuth–Morris–Pratt Algorithm) ###### tags: `KMP algorithm` >In computer science territory, string matching is an extremely important topic. Such as searching engine, string matching play an important role in those implementations. Therefore, it is essential to reduce the complexity of string matching. ## Title [ToC] ## Problem Statement Given two strings s[0...N] and pat[0...M], finding the all occurrences of pat[] in s[]. ### Examples ``` Input: s[] = "ABCDABBABDCABBA" pat[] = "ABBA" Output: pattern found at index 4 pattern found at index 11 Input: s[] = "Hello I am Bob" pat[] = "Bob" Output: pattern found at index 11 Input: s[] = "ABABDABACDABABCABAB" pat[] = "ABABCABAB" Output: pattern found at index 10 ``` ## Naive Solution A naive solution use two loops to check whether a pattern is found at s[i]. Let size of string s[] be N and size of pat[] be M. Therefore, as the following pseudo code: ``` for i = 0 to N-1: for j = 0 to M-1 if s[i+j] == pat[j] && j == M-1: return index i; else if s[i+j] != pat[j]: break; ``` It's easy for us to see that we go through every index i in string s[]. When s[i+j] = pat[j] (for j = 0...M-1) , we will find a pattern pat that occurs in string s. If we go through the inner loop, we will break when we meet the unequal character and move to i+1. In the worst case, we will need to go through every loop operation; therefore, we will have complexity O(MN) in the worst case. ## KMP solution The naive approach is not a linear solution for string matching. We expect a linear solution to solve the string matching problem. Therefore, we now introduce the KMP algorithm. First, we consider a condition: ``` s[] = "ABABABDBCCA" pat = "ABABD" index: 0123456789 s[]: ABABABDBCCA pat: ABABD ``` When comparing two strings, we can see that we will fail to get a pattern found in index 4. In the previous naive solution, we will go back to index 2 to continue the pattern comparison. ``` index: 0123456789 s[]: ABABABDBCCA pat: ABABD index: 0123456789 s[]: ABABABDBCCA **&& pat: ABABD index: 0123456789 s[]: ABABABDBCCA **&& pat: ABABD ``` Obviously, we can directly move to compare index 2 of string s rather than index 1 in the naive approach. If we consider the substring of s s[0,3], we can assure that it is the same as pat[0,3]. Also, we can see s[0,1] = s[2,3]. Besides, s[0,1] is the prefix of the pat[0,3]; s[2,3] is the suffix of pat[0,3]. If we encounter an unmatched character at index j of string pat, we can see the longest prefix and suffix in the substring pat[0, j-1]. As we can see we need to be able to calculate the lps ( longest prefix which is also proper suffix) in string s. We will now introduce the algorithm to find out lps in every index of string pat. ### LPS (Longest Prefix Which Is Also Proper Suffix) ``` function lps(pat): M = pat.length; len = 0; lps[0] = 0; i = 1; while i < M: if pat[i] == pat[len]: len++; lps[i] = len; i++; else: if len == 0: lps[i] = 0; i++; else: len = lps[i-1]; return lps[]; ``` For example, if we have a string pat with size M and with the knowledge of lps[0,i-1] of string pat. And variable len indicates the length of previous lps. We will need to calculate lps[i]. Obviously, pat[0, len-1] will be lps with length len. The next character of the prefix we want to compare with pat[i] is pat[j]. Therefore, if pat[i] = pat[len], we will get a new lps with length len+1. Else, we will change current len to lps[i-1] which indicates that we need to compare pat[i] with the lps[0, i-1]. If len = 0, let the current index lps[i] = 0, and move to the next index. ### KMP Search In KMP search, we will need two indexes i, j to represent the current index of string s and pat. ``` KMPsearch(s[], pat[]): i = 0; j = 0; lps = lps(pat); while i < N: if s[i] == pat[j]: i++; j++; if j == M: return pattern found; else: if j == 0: i++; else: j = lps[j-1]; ``` The only place that confuses you may be the condition that s[i] != pat[j]. If the condition happens, according to the previous statement, when we encounter an unmatched character, we can use the lps[] to check which index we should visit. Since we can calculate lps[] in complexity O(M), and skip several index in KMP search. Therefore, the overall cmplexity will be O(M+N). ## C++ Code ```cpp=0 #include <iostream> #include <string> using namespace std; void naive(char pat[], char s[]){ int M = strlen(pat); int N = strlen(s); for(int i = 0; i < N; i++){ for(int j = 0; j < M; j++){ if(s[i+j] == pat[j] && j == M-1){ cout << "index at: " << i << endl; } else if(s[i+j] != pat[j]) break; } } } void LPSfun(char pat[], int lps[], int M){ int i = 1; int len = 0; lps[0] = 0; while(i < M){ if(pat[i] == pat[len]){ len++; lps[i] = len; i++; } else{ if(len == 0){ lps[i] = 0; i++; } else{ len = lps[len-1]; } } } } void KMP(char pat[], char s[]){ int M = strlen(pat); int N = strlen(s); int i = 0; int j = 0; int lps[M]; LPSfun(pat, lps, M); while(i < N){ if(s[i] == pat[j]){ i++; j++; if(j == M){ cout << "index at: " << i-j << endl; j = lps[j-1]; } } else { if(j == 0) i = i+1; else j = lps[j-1]; } } } int main(int argc, const char * argv[]) { char s[] = "ABABDABACDABABCABAB"; char pat[] = "ABABCABAB"; KMP(pat, s); naive(pat, s); return 0; } ``` ## Reference