Pattern Search Problem

There are several algorithm to solve pattern search from a provided string problem.

To solve such problem using Naive method(for all the char of the provided string check whether the pattern  contains or not ). It takes O(m*n),here m = string length of pattern and n= string length of the text.

There are several algorithm to solve such problem ,like KMP (Knuth-Morris-Pratt) agorithm,Boyer-Moore String Search algorithm,Rabin-Karp String Search algorithm,Aho-corasick String Match algorithm .

KMP (Knuth-Morris-Pratt) agorithm:

The two string,

txt[] = “AAAAAAAAAAAAAAAAAB”

pat[] = “AAAAB”

If we want to check ,is Contain pat[] in txt[] using KMP algorithm?

To avoid single sliding procedure like naive algorithm we first determine a prefix array of pattern,

What is Prefix array?

An array which indicates the longest proper prefix length which is also the longest proper suffix length.

Proper prefix/suffix is the prefix/suffix that is not the string itself.

So our first step is to find the longest common (proper)prefix and suffix of the sub patterns starting from 0 to m-1.

So if the pattern[]=”ABABAB”; length = 6;

Our target Prefix array will be of size 6,PreArr[6]; Each of the entry of the array indicates the longest common proper prefix and suffix of the sub pattern length [0 to i].

Now, PreArr[0]=0 ,as it is the empty string,

PreArr[1] = 0, cause: in “AB” prefix is “A” but suffix is “B”, so no common prefix/suffix;

PreArr[2]=1;cause: in “ABA” prefix : A,AB ; suffix : BA,A ; so common prefix/suffix = A;

PreArr[3]=2;cause: in “ABAB” prefix : A,AB,ABA ; suffix : BAB,A,AB ; so common prefix/suffix = AB;

PreArr[4]=3;cause: in “ABABA” prefix : A,AB,ABA,ABAB ; suffix : ABA,BABA,BA,A ; so common prefix/suffix = ABA;

PreArr[5]=0,cause: in “ABABAB” prefix : A,AB,ABA,ABAB,ABABA ; suffix : BABAB,ABAB,BAB,BA,B,; so no common prefix/suffix ;

To determine the common longest prefix and suffix array length require O(m*m) time complexity.

There is an O(m) time complexity way to determine prefix array.

 

void computeLPSArray(char *pat, int M, int *lps)
{
    int len = 0;  // lenght of the previous longest prefix suffix
    int i;

    lps[0] = 0; // lps[0] is always 0
    i = 1;

    // the loop calculates lps[i] for i = 1 to M-1
    while (i < M)
    {
       if (pat[i] == pat[len])
       {
         len++;
         lps[i] = len;
         i++;
       }
       else // (pat[i] != pat[len])
       {
         if (len != 0)
         {
           // This is tricky. Consider the example AAACAAAA and i = 7.
           len = lps[len-1];

           // Also, note that we do not increment i here
         }
         else // if (len == 0)
         {
           lps[i] = 0;
           i++;
         }
       }
    }
}

After computing the prefix array we use that array to find the pattern.

How we use the prefix array?

when we find a match we increment both txt[] string index and pattern string index.

If we find a mismatch we lookup the prefix array to find the longest common prefix and suffix for the matched index that is j-1(as previously matched) and we start to match checking from that lps[j-1] index of the pattern.

void KMPSearch(char *pat, char *txt)
{
    int M = strlen(pat);
    int N = strlen(txt);

    // create lps[] that will hold the longest prefix suffix values for pattern
    int *lps = (int *)malloc(sizeof(int)*M);
    int j  = 0;  // index for pat[]

    // Preprocess the pattern (calculate lps[] array)
    computeLPSArray(pat, M, lps);

    int i = 0;  // index for txt[]
    while (i < N)
    {
      if (pat[j] == txt[i])
      {
        j++;
        i++;
      }

      if (j == M)
      {
        printf("Found pattern at index %d \n", i-j);
        j = lps[j-1];
      }

      // mismatch after j matches
      else if (i < N && pat[j] != txt[i])
      {
        // Do not match lps[0..lps[j-1]] characters,
        // they will match anyway
        if (j != 0)
         j = lps[j-1];
        else
         i = i+1;
      }
    }
    free(lps); // to avoid memory leak
}

At worst case: when all index values of prefix array is 0 time complexity is as naive method that is

O(m*m).

Related Problems are:

uva 10679 , uva 11512 , uva 719 ,uva 11107 , uva 11507

Leave a Reply

Your email address will not be published. Required fields are marked *