There are several algorithm to solve pattern search from a provided string problem.
To solve such problem using Naive method(for all the char of the provided string check whether the pattern contains or not ). It takes O(m*n),here m = string length of pattern and n= string length of the text.
There are several algorithm to solve such problem ,like KMP (Knuth-Morris-Pratt) agorithm,Boyer-Moore String Search algorithm,Rabin-Karp String Search algorithm,Aho-corasick String Match algorithm .
KMP (Knuth-Morris-Pratt) agorithm:
The two string,
txt[] = “AAAAAAAAAAAAAAAAAB”
pat[] = “AAAAB”
If we want to check ,is Contain pat[] in txt[] using KMP algorithm?
To avoid single sliding procedure like naive algorithm we first determine a prefix array of pattern,
What is Prefix array?
An array which indicates the longest proper prefix length which is also the longest proper suffix length.
Proper prefix/suffix is the prefix/suffix that is not the string itself.
So our first step is to find the longest common (proper)prefix and suffix of the sub patterns starting from 0 to m-1.
So if the pattern[]=”ABABAB”; length = 6;
Our target Prefix array will be of size 6,PreArr[6]; Each of the entry of the array indicates the longest common proper prefix and suffix of the sub pattern length [0 to i].
Now, PreArr[0]=0 ,as it is the empty string,
PreArr[1] = 0, cause: in “AB” prefix is “A” but suffix is “B”, so no common prefix/suffix;
PreArr[2]=1;cause: in “ABA” prefix : A,AB ; suffix : BA,A ; so common prefix/suffix = A;
PreArr[3]=2;cause: in “ABAB” prefix : A,AB,ABA ; suffix : BAB,A,AB ; so common prefix/suffix = AB;
PreArr[4]=3;cause: in “ABABA” prefix : A,AB,ABA,ABAB ; suffix : ABA,BABA,BA,A ; so common prefix/suffix = ABA;
PreArr[5]=0,cause: in “ABABAB” prefix : A,AB,ABA,ABAB,ABABA ; suffix : BABAB,ABAB,BAB,BA,B,; so no common prefix/suffix ;
To determine the common longest prefix and suffix array length require O(m*m) time complexity.
There is an O(m) time complexity way to determine prefix array.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
void computeLPSArray(char *pat, int M, int *lps) { int len = 0; // lenght of the previous longest prefix suffix int i; lps[0] = 0; // lps[0] is always 0 i = 1; // the loop calculates lps[i] for i = 1 to M-1 while (i < M) { if (pat[i] == pat[len]) { len++; lps[i] = len; i++; } else // (pat[i] != pat[len]) { if (len != 0) { // This is tricky. Consider the example AAACAAAA and i = 7. len = lps[len-1]; // Also, note that we do not increment i here } else // if (len == 0) { lps[i] = 0; i++; } } } } |
After computing the prefix array we use that array to find the pattern.
How we use the prefix array?
when we find a match we increment both txt[] string index and pattern string index.
If we find a mismatch we lookup the prefix array to find the longest common prefix and suffix for the matched index that is j-1(as previously matched) and we start to match checking from that lps[j-1] index of the pattern.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
void KMPSearch(char *pat, char *txt) { int M = strlen(pat); int N = strlen(txt); // create lps[] that will hold the longest prefix suffix values for pattern int *lps = (int *)malloc(sizeof(int)*M); int j = 0; // index for pat[] // Preprocess the pattern (calculate lps[] array) computeLPSArray(pat, M, lps); int i = 0; // index for txt[] while (i < N) { if (pat[j] == txt[i]) { j++; i++; } if (j == M) { printf("Found pattern at index %d \n", i-j); j = lps[j-1]; } // mismatch after j matches else if (i < N && pat[j] != txt[i]) { // Do not match lps[0..lps[j-1]] characters, // they will match anyway if (j != 0) j = lps[j-1]; else i = i+1; } } free(lps); // to avoid memory leak } |
At worst case: when all index values of prefix array is 0 time complexity is as naive method that is
O(m*m).
Related Problems are: