There are several algorithm to solve pattern search from a provided string problem.
To solve such problem using Naive method(for all the char of the provided string check whether the pattern contains or not ). It takes O(m*n),here m = string length of pattern and n= string length of the text.
There are several algorithm to solve such problem ,like KMP (Knuth-Morris-Pratt) agorithm,Boyer-Moore String Search algorithm,Rabin-Karp String Search algorithm,Aho-corasick String Match algorithm .
KMP (Knuth-Morris-Pratt) agorithm:
The two string,
txt[] = “AAAAAAAAAAAAAAAAAB”
pat[] = “AAAAB”
If we want to check ,is Contain pat[] in txt[] using KMP algorithm?
To avoid single sliding procedure like naive algorithm we first determine a prefix array of pattern,
What is Prefix array?
An array which indicates the longest proper prefix length which is also the longest proper suffix length.
Proper prefix/suffix is the prefix/suffix that is not the string itself.
So our first step is to find the longest common (proper)prefix and suffix of the sub patterns starting from 0 to m-1.
So if the pattern[]=”ABABAB”; length = 6;
Our target Prefix array will be of size 6,PreArr[6]; Each of the entry of the array indicates the longest common proper prefix and suffix of the sub pattern length [0 to i].
Now, PreArr[0]=0 ,as it is the empty string,
PreArr[1] = 0, cause: in “AB” prefix is “A” but suffix is “B”, so no common prefix/suffix;
PreArr[2]=1;cause: in “ABA” prefix : A,AB ; suffix : BA,A ; so common prefix/suffix = A;
PreArr[3]=2;cause: in “ABAB” prefix : A,AB,ABA ; suffix : BAB,A,AB ; so common prefix/suffix = AB;
PreArr[4]=3;cause: in “ABABA” prefix : A,AB,ABA,ABAB ; suffix : ABA,BABA,BA,A ; so common prefix/suffix = ABA;
PreArr[5]=0,cause: in “ABABAB” prefix : A,AB,ABA,ABAB,ABABA ; suffix : BABAB,ABAB,BAB,BA,B,; so no common prefix/suffix ;
To determine the common longest prefix and suffix array length require O(m*m) time complexity.
There is an O(m) time complexity way to determine prefix array.
void computeLPSArray(char *pat, int M, int *lps)
{
int len = 0; // lenght of the previous longest prefix suffix
int i;
lps[0] = 0; // lps[0] is always 0
i = 1;
// the loop calculates lps[i] for i = 1 to M-1
while (i < M)
{
if (pat[i] == pat[len])
{
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
if (len != 0)
{
// This is tricky. Consider the example AAACAAAA and i = 7.
len = lps[len-1];
// Also, note that we do not increment i here
}
else // if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
}
After computing the prefix array we use that array to find the pattern.
How we use the prefix array?
when we find a match we increment both txt[] string index and pattern string index.
If we find a mismatch we lookup the prefix array to find the longest common prefix and suffix for the matched index that is j-1(as previously matched) and we start to match checking from that lps[j-1] index of the pattern.
void KMPSearch(char *pat, char *txt)
{
int M = strlen(pat);
int N = strlen(txt);
// create lps[] that will hold the longest prefix suffix values for pattern
int *lps = (int *)malloc(sizeof(int)*M);
int j = 0; // index for pat[]
// Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);
int i = 0; // index for txt[]
while (i < N)
{
if (pat[j] == txt[i])
{
j++;
i++;
}
if (j == M)
{
printf("Found pattern at index %d \n", i-j);
j = lps[j-1];
}
// mismatch after j matches
else if (i < N && pat[j] != txt[i])
{
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j-1];
else
i = i+1;
}
}
free(lps); // to avoid memory leak
}
At worst case: when all index values of prefix array is 0 time complexity is as naive method that is
O(m*m).
Related Problems are: