New and Old Problems in Pattern Matching

Wojcieh Szpankowski

Computer Science Department, Purdue University (USA)

Algorithms Seminar

June 25, 2001

[summary by Mireille Régnier]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).
Abstract
This talk presents three problems in pattern matching and their analysis. Different methods are used, that rely on complex analysis and probability theory.

## 1  Statement of the Problems

Some pattern H (or a set H of patterns) is searched in a text T. The text T is generated by a random probabilistic source that is either a Bernoulli source or a Markov source or a mixing source. In the string matching and the subsequence matching problems, H is given: the model is deterministic. In the repetitive patterns problem, in Section 4, H is a string of T repeated elsewhere.

## 2  String Matching

One counts the number of occurrences of a given word H or a given finite set of words, H, in a text of size n. This number is denoted On(H) or On(H). This counting relies on the decomposition of the text T onto languages, the so-called initial, minimal, and tail languages.
Definition 1   Given two strings H and F, the overlap set is the set of suffixes of H that are also prefixes of F. The suffixes of F in the associated factorizations of F form the correlation set AH,F. In the Bernoulli model, one defines the correlation polynomial of H and F as
AH,F(z) =
 å w Î AH,F
P(w) z|w|.
When H is equal to F, AH,H is named the autocorrelation set and denoted AH,H; the autocorrelation polynomial is defined as
AH(z) =
 å w Î AH,H
P(w) z|w|.

For example, let H = 11011 and F=1110. Then the overlap set of H and F is {11,1} and the correlation set is AH,F={10,110}. Similarly, AF,H={11}. It is worth noticing that AF,H ¹ AH,F. Intuitively, the concatenation of a word in AH,F to H creates an (overlapping) occurrence of F.
Definition 2   Let H be a given word.
1. The initial language R is the set of words containing only one occurrence of H, located at the right end.
2. The tail language U is defined as the set of words u such that Hu has exactly one occurrence of H, which occurs at the left end.
3. The minimal language M is the set of words w such that Hw has exactly two occurrences of H, located at its left and right ends.

With these notations, any text that contains exactly k occurrences of H, k ³ 1, rewrites unambiguously as
rm1... mk-1u
where rÎR, miÎM, and uÎU. In other words, this set Tk of words satisfies Tk=RMk-1U. The power of this approach comes from the equations that can be written on these languages, that translate into equations on their generating functions in the Bernoulli model and the Markov model. Moreover, it turns out that these generating functions---hence the whole counting problem---only depend on the probability of H, denoted P(H), and the so-called correlation set.
Theorem 1   Let H be a given pattern of size m, and T be a random text generated by a Bernoulli model. The generating function of the set Tk satisfies
Tk(z)
=zm P(H)
(DH(z)+1-z)k-1
DH(z)k+1
,     k³1,

T0(z)
=
AH(z)
DH(z)

where
DH(z) = (1-z) AH(z) + zm P(H).
Moreover, the bivariate generating function satisfies
T(z,u) =
 å k
Tk(z) uk =
u
1-u
DH(z)+1-z
DH(z)
zm P(H)
DH(z) 2

These results extend to the Markovian model and to the case of multiple pattern matching .

## 3  Subsequence Matching

A pattern W = w1 ... wm is hidden in a text T if there exist indices 1 £ i1 < ... <im £ n such that ti1=w1, ..., tim=wm. For example, date is hidden 4 times in the text hidden pattern but it is not a substring. We focus on cases where the sequence of indices satisfies additional constraints ij+1 -ij £ dj, where dj is either an integer or ¥. Such a sequence is called an occurrence. One denotes (d1 ,... ,dm-1) by D. For example, when D = (3,2,¥ , 1,¥ ,¥ ,4, ¥) the set I =(5,7,9,18,19,22,30,33,50), satisfies the constraints.

The number of occurrences, Wn, is asymptotically Gaussian. This is proved in  by the moments method: all moments of the properly normalized random variable converge to the corresponding moments of the Gaussian law. For any sequence I that satisfies the constraints, one denotes XI the random variable that is 1 if ti1=w1, ..., tim=wm. Then,
Wn =
 å I
XI.

The computation of the moments relies on a generalization of correlation sets. Let
U ={u1, ...,ub-1}
be the subset of indices j for which dj= ¥. Any occurrence I satisfying the constraints can be divided into b blocks:
[i1,iu1],[iu1 +1,iu2], ..., [iub-1+1,im].
The collection of these blocks is called the aggregate of I and denoted a (I). In the example above, the aggregate a (I) is
a (I) =[5,9],[18,19],,[30,33],.

##### Deriving the mean
The collection of occurrences of W can be described as
A * × {w1} × A £ d1 × {w2 } × ... × A £ dm-1 × {wm} × A *,
where A is the alphabet and A £ dj is the collection of words of size less than or equal to dj. It follows that the generating function of expectations is
 å n
E (Wn) zn =
1
(1-z)b-1
×
 m Õ i=1
pwi z ×
 Õ i ÏU
1-zdi
1-z
,
where pw-i is the probablity of character wi. Hence, the expectation satisfies
E (Wn) =
nb
b!
 Õ i ÏU
di
 m Õ i=1
pwi æ
ç
ç
è
1+O æ
ç
ç
è
1
n
ö
÷
÷
ø
ö
÷
÷
ø
(1)

##### Deriving the variance and higher moments
The variance rewrites
Var (Wn) =
 å I,J
E (XI XJ) -E (XI)E (XJ).
In the Bernoulli model, the two random variables XI and XJ are independent whenever the blocks of I and J do not overlap. Hence, the contribution to the variance is zero. If a (I) and a (J) overlap, one defines the agreggate a (I,J) as the set of blocks obtained by merging the blocks of a (I) and a (J) that overlap. The number of blocks in a (I,J), denoted b (I,J), is upper bounded by 2b-1. For such a pair (I,J), the text can be rewritten as an element of the language
A*×B1×A*×...×Bb(I,J)×A*
and the generating function of the covariance rewrites
 å n
Var (Wn) zn =
 å p ³ 1
 å b(I,J) =2b-p
1
(1-z)2b-p
Pp(z),
where Pp are polynomials of the variable z that generalize the correlation polynomials defined in  (see Definition 1). The asymptotic order of each term is n2b-p. Hence, the dominating contribution is due to the intersecting pairs such that b (I,J)=2b-1, and
Var (Wn) ~ n2b-1 s 2
where the variance coefficient s can be easily evaluated for any given pattern by dynamic programming.

The proof is similar for higher moments.

## 4  Repetitive Pattern Matching

Given a pattern H found in a text T, one searches for a second approximate occurrence of H. A word F is a D-approximate occurrence of a word H if the Hamming distance between F and H is smaller than D. Recall that the Hamming distance between two words of size m, say H = H1 ... Hm and F =F1 ... Fm is
dH(H,F)=
 m å i=1
1Hi ¹ Fi.
The usual parameters on trees, such as the depth of insertion, height, fill-up, ..., are extended in the approximate case. Notably:
Definition 3   The depth Ln is the largest integer K such that
 min { d ( Tii-K+1,Tnn+K ) | 1£ i£ n-K+1 } £ D.

Rényi's entropy is generalized. Given a word H, the D-ball with center H, denoted BD(H), is the set of words that are within distance D.
Definition 4   Given a text T, Rényi's entropy of order 0 is
r0(D) =
 lim k ® ¥
 -E [ logP ( BD(T1k) ) ]
k
,
when this limit exists.

Asymptotic properties are proved for the depth, the heigth and the fill-up, that depend on Rényi's entropy. Notably, the convergence in probability of the depth of insertion in a trie extends for this approximate scheme:
Ln
logn
®
1
r0(D)
,     n®¥.

The proof relies on the subadditive ergodic theorem and asymptotic equipartition property.

## References


Flajolet (P.), Guivarc'h (Y.), Vallée (V.), and Szpankowski (W.). -- Hidden patterns statistics. In ICALP'01. -- 2001. Proceedings of the 28th ICALP Conference, Crete, Greece, July 2001.


Guibas (L. J.) and Odlyzko (A. M.). -- String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory. Series A, vol. 30, n°2, 1981, pp. 183--208.


Régnier (M.) and Szpankowski (W.). -- On pattern frequency occurrences in a Markovian sequence. Algorithmica, vol. 22, n°4, 1998, pp. 631--649. -- Average-case analysis of algorithms.

This document was translated from LATEX by HEVEA.