A Novel Mathematical Model for Similarity Search in Pattern Matching Algorithms

Show more

1. Introduction

Similarity searches commonly known as approximate string matching allow for some mismatches between the text and the pattern. Similarity searches are widely used in computational biology, search engines, data mining, signal processing, digital dictionaries, and many other applications where exact and similar patterns are to be searched over a given text. Most of the algorithms developed prior to the 1990’s such as Morris and Pratt [1], Aho and Corasick [2], Knuth et al. [3], Boyer and Moore [4], Horspool [5], and Karp and Rabin [6], were designed to search for exact pattern matches in the text. However, with some modifications, they can also be used to find approximate matches.

The term “distance” is often used when comparing two strings for similarity. A lesser distance is expected for a greater similarity. The Hamming distance [7] approximates the similarity between two strings of equal length by measuring the number of character mismatches at corresponding locations. Let *R* and *S* be two non-empty equal-length strings of size *M* such that
$R={r}_{0}{r}_{1}\cdots {r}_{M-1}$ and
$S={s}_{0}{s}_{1}\cdots {s}_{M-1}$ *.* Then, the Hamming distance between *R* and *S *is given by *ham*(*R*,* S*) = number of locations where
${r}_{i}\ne {s}_{i}$,
$0\le i\le M-1$. Clearly, if *R *and *S *are identical, then* ham*(*R*,* S*)* *=* *0*. *For a given pattern *P* modern applications may require database to be searched for exact or approximate matches of *P*. In the literature, this problem is sometimes referred as “*string matching with k mis**matches*” which can be stated as follows: Given the text
$T={t}_{0}{t}_{1}\cdots {t}_{N-1}$ of size N, and the pattern
$P={p}_{0}{p}_{1}\cdots {p}_{M-1}$ of size M defined over a finite set of text character called alphabet λ. Let hd_{i} be the Hamming distance such that
$h{d}_{i}=ham\left(P,{t}_{i}{t}_{i+1}\cdots {t}_{i+M-1}\right)$, where,
$0\le i\le \left(N-M\right)$. Then, for a given integer k such that
$0\le k\le M$, report all locations iin T where
$h{d}_{i}\le k$.

To solve the “k-mismatch” problem, Landau and Vishkin [8] proposed a suffix-tree-based algorithm. They used a suffix tree to preprocess the text and the pattern in *O*(*N *+* M*) time, and then report *k *mismatches in *O*(*kN*) time. However, the algorithm requires *O*(*k*(*M *+* N*)) space, which is a concern when N becomes large. Galil and Giancarlo [9] presented an algorithm that uses *O*(*kN*) time and *O*(*M*) space to solve the same problem. Amir et al. [10] developed an algorithm to identify all such locations in
$O\left(N\sqrt{k}{\mathrm{log}}_{2}k\right)$ * *time. As “k” increases the performance of these algorithms deteriorates, and approaches *O*(*MN*) as “*k*” approaches *M*. For *k*=*M*, the problem becomes independent of *k *and reduces to “string matching with mismatches”, which can be stated as follows: Given the text
$T={t}_{0}{t}_{1}\cdots {t}_{N-1}$, and the pattern
$P={p}_{0}{p}_{1}\cdots {p}_{M-1}$,* *for every i such that
$0\le i\le \left(N-M\right)$, output the Hamming distance hd_{i} such that
$h{d}_{i}=ham\left(P,{t}_{i}{t}_{i+1}\cdots {t}_{i+M-1}\right)$. Abrahamson [11] applied a technique known as the Boolean convolution of the pattern and the text to solve the problem in
$O\left(N\sqrt{M\mathrm{log}M}\right)$ time and *O*(*N*) space. Using a linked list, Yates and Perleberg [12] presented *O*(*N* + *Nf*_{max}) time and *O*(2*M *+* σ*) space algorithm, where *f*_{max} is the frequency of the most commonly occurring character in the pattern. Many of these algorithms are covered in Crochemore et al. [13]. For a detailed survey on approximate string matching refer to Navarro [14], Boytsov [15]. Most of the algorithms developed in the past use data-structures and methods outlined above to create indexes over the text or the pattern to accelerate the search process. However, their costly maintenance has always been a cause of concern.

2. Assumptions and Notations

We use the following assumptions and notations. “*λ*” represents a finite non-empty ordered set of characters called an alphabet, such that
$\left|\lambda \right|=\sigma $ * *is the size of the alphabet. “*λ _{i}*”

3. The Model

Traditional, pattern matching algorithms attempts to align P from the first character of T which may result in losing some valuable information regarding the character matches. Let’s consider an example: suppose pattern *P *=* ABCDEF*, and text *T *=* CDEFABCD*. Assuming 0 being the initial index of the strings, a four character match is found when *P* is aligned at the index −2 in *T*. All traditional algorithms would lose this information as attempts are made to align *P* from index 0 in *T*. In other words, traditional algorithms consider the indexes in the range
$0\le i\le N-M$, where *i* represent the text index. However, as we have seen, considering i’s in the range
$\left(1-M\right)\le i\le \left(N-1\right)$ * *may provide additional information, particularly when the pattern is considerably large, and the character matches exist at opposite ends of the pattern or text. The lemma and proof given below are already discussed in our previous work [16]. However, a brief discussion is provided below for a prompt reference.

The Lemma

Let *T* and P be non-empty text and pattern strings defined over an alphabet *λ* such that:
$T={t}_{0}{t}_{1}\cdots {t}_{N-1}$ and
$P={p}_{0}{p}_{1}\cdots {p}_{M-1}$. Corresponding to every index j in P, we define a set *R _{j} *such that
${R}_{j}=\left\{i\u2013j|{t}_{i}={p}_{j},\forall 0\le i\le N-1\right\}$. Further, let

1) Every integer
$s\in S$ represents an index in *T* where an exact match of *P* is found when *P* is aligned at *s* in *T*.

2) The cardinality |*S*| represents the number of occurrences of *P* in *T*. If |*S*| =0 then *P* is not present in *T*.

Proof: Suppose *P* is found in *T* when *P* is aligned at index *s* in *T*. Then, we have to show that
$s\in S$ *. *If *P* appears in *T* at shift (index)*s* that means all *M* characters of pattern
$P={p}_{0}{p}_{1}\cdots {p}_{M-1}$ * *can be successfully matched with
$T={t}_{s}{t}_{s+1}{t}_{s+2}\cdots {t}_{s+M-1}$. Hence,
$\forall j$ in P such that
$0\le j\le M-1$, we have
${T}_{s+j}={p}_{j}$ *. *Now, from the definition of *R _{j}*, we get
${R}_{j}=\left\{\left(s+j\right)-j=s\right\}$ ⇒
$\forall 0\le j\le M-1$,

Example 1: Let *T *=* GCABABABCBA* be a text array and *P *=* ABAB *be a pattern array of size 4. For the given text we have i such that
$0\le i\le 10$ *. *For each shift j (
$0\le j\le 3$ )* *in P we create a set *R _{j} *such that
${R}_{j}=\left\{i-j|T\left[i\right]={p}_{j},0\le i\le 10\right\}$

Corollary 1: Given the M sets *R _{j} *defined as in the lemma given above

Proof: We have already proved that any
$s\in S={R}_{0}\cap {R}_{1}\cap {R}_{2}\cap \cdots \cap {R}_{M-1}$ * *represents exactly *M* character matches of
$P={p}_{0},{p}_{1},{p}_{2},\cdots ,{p}_{M-1}$ * *at shift *s* in T ⇒ each
$s\in {R}_{j}$ represents a single character match *p _{j} *of

Example 2: Consider example 1, the integer 6 appears in three sets: *R*_{0},* R*_{1} and *R*_{3}. Hence, the frequency of the integer 6 is equal to *f*_{6}* *=* *3. Which shows three character matches of P when *P* is aligned with shift 6 in *T*. Similarly,
${f}_{0}={f}_{8}=2$ reveal two character matches of *P* when *P* is aligned with locations 0 and 80 in *T*.

4. Model Implementation

We present another example to see how the model described above can be implemented successfully. Consider the pattern array P = FCTHZCTZCF, and the text array T = SKRFCTHZCTZCFTYCTZGHTTCTHZTHZFCTHZCTZCFT. The first column of Table 1 below summarizes all unique characters of the pattern. The second and third columns of the table present the shifts of the corresponding pattern characters in the text and in the pattern respectively. Each row of the last column represents set *R _{j}* as defined in the lemma.

As noted above, the frequency of occurrence *f _{s} *of an integer “s” in all set represents the number of character matches of P at shift

Table 1. Hit index.

Figure 1. The hit array.

Figure 2. The re-indexed hit array.

has been hit 5 times, location 14 has been hit 4 times, representing 5 and 4 character matches of P at alignment locations 21 and 14 respectively.

The method described above has two shortcomings. First, we need an array of size N, which is undesirable for large values of N. To resolve this, the hit[] can be assumed as cyclic that allow us to reuse the previously used array cells. Second, as we have seen in previous examples that we might get negative values of *s* for which the array cell does not exist. For example, *s *=* *−1 in the two sets *R*_{5} and *R*_{6} in Table 1, suggesting that we can obtain two character matches if *P* is aligned at location −1 in T. Such hits are ignored in Figure 1 because as we cannot record hits at negative array locations. This issue can be resolved by assuming the initial index of the text file to be ≥M − 1 rather than 0. This will ensure that s ≥ 0 for all
$s\in {R}_{j}$. Figure 2 shows the re-indexed version of the array, where, each text character location is hyped by M. Therefore, index M = 10 in the re-indexed array corresponds to location 0 of the actual text array, index 9 corresponds to location −1, and so on. The benefit is straightforward; with the re-indexed array, we can say that 2 hits are found if the pattern is aligned at location 9, which corresponds to location -1 in the actual text array.

5. Conclusion

Highly practical solutions can be drawn based on the model we have presented in this paper. The novel approach we have followed may become an alternative to existing solutions.

Acknowledgements

We thank anonymous reviewers for their comments and suggestions.

References

[1] Morris, J. and Pratt, V. (1970) A Linear Pattern Matching Algorithm. Technical Report 40, Computing Center, University of California, Berkeley.

[2] Aho, A. and Corasick, M. (1975) Efficient String Searching: An Aid to Bibliographic Search. Communications of the ACM, 18, 333-340.

https://doi.org/10.1145/360825.360855

[3] Knuth, D., Morris, J. and Pratt, V. (1977) Fast Pattern Matching in Strings. SIAM Journal on Computing, 6, 323-350.

https://doi.org/10.1137/0206024

[4] Boyer, R. and Moore, S. (1977) A Fast String Searching Algorithm. Communications of the ACM, 20, 762-772.

https://doi.org/10.1145/359842.359859

[5] Horspool, N. (1980) Practical Fast Searching in Strings. Software Practice and Experience, 10, 501-506.

https://doi.org/10.1002/spe.4380100608

[6] Karp, R. and Rabin, M. (1987) Efficient Randomized Pattern-Matching Algorithms. IBM Journal Research and Development, 31, 249-260.

https://doi.org/10.1147/rd.312.0249

[7] Hamming, R. (1950) Error Detecting and Error Correcting Codes. Bell System Technical Journal, 29, 147-160.

https://doi.org/10.1002/j.1538-7305.1950.tb00463.x

[8] Landau, G. and Vishkin, U. (1986) Efficient String with k Mismatches. Theoretical Computer Science, 43, 239-249.

https://doi.org/10.1016/0304-3975(86)90178-7

[9] Galil, Z. and Giancarlo, R. (1986) Improved String Matching with k Mismatches. SIGACT News, 17, 52-54.

https://doi.org/10.1145/8307.8309

[10] Amir, A., Lewenstein, M. and Porat, E. (2004) A Faster Algorithms for String Matching with k Mismatches. Journal of Algorithms, 50, 257-275.

https://doi.org/10.1016/S0196-6774(03)00097-X

[11] Abrahamson, K. (1987) Generalized String Matching. SIAM Journal of Computing, 16, 1039-1051.

https://doi.org/10.1137/0216067

[12] Baeza-Yates, R. and Perleberg, C.H. (1996) Fast and Practical String Matching. Information Processing Letters, 59, 21-27.

https://doi.org/10.1016/0020-0190(96)00083-X

[13] Crochemore, M., Hancart, C. and Lecroq, T. (2007) Algorithms on Strings. Cambridge University Press, Cambridge.

https://doi.org/10.1017/CBO9780511546853

[14] Navarro, G. (2001) A Guided Tour to Approximate String Matching. ACM Computing Surveys, 33, 31-88.

https://doi.org/10.1145/375360.375365

[15] Boytsov, L. (2011) Indexing Methods for Approximate Dictionary Searching: Comparative Analysis. ACM Journal of Experimental Algorithmics, 16, Article No. 1.1.

https://doi.org/10.1145/1963190.1963191

[16] Vinod-Prasad, P. (2016) A Novel Algorithm for String Matching with Mismatches. 5th International Conference of Pattern Recognitions Applications and Methods, Rome, February 2016, 638-644.

https://doi.org/10.5220/0005752006380644