New and Old Problems in Pattern Matching
Wojcieh Szpankowski
Computer Science Department, Purdue University (USA)
Algorithms Seminar
June 25, 2001
[summary by Mireille Régnier]
A properly typeset version of this document is available in
postscript and in
pdf.
If some fonts do not look right on your screen, this might be
fixed by configuring your browser (see the documentation here).
Abstract
This talk presents three problems in pattern matching and their
analysis. Different methods are used, that rely on complex analysis
and probability theory.
1 Statement of the Problems
Some pattern H (or a set H of patterns) is searched in a text
T. The text T is generated by a random probabilistic source
that is either a Bernoulli source or a Markov source or a mixing
source. In the string matching and the subsequence matching problems,
H is given: the model is deterministic. In the repetitive
patterns problem, in Section 4, H is a string of T
repeated elsewhere.
2 String Matching
One counts the number of occurrences of a given word H or a given
finite set of words, H, in a text of size n. This number is
denoted O_{n}(H) or O_{n}(H). This counting relies on the
decomposition of the text T onto languages, the socalled initial, minimal, and tail languages.
Definition 1
Given two strings H and F, the overlap set
is the set of
suffixes of H that are also prefixes of F.
The suffixes of F in the associated factorizations of
F form the correlation set
A_{H,F}.
In the Bernoulli model, one defines the correlation polynomial
of
H and F as
A_{H,F}(z) = 

P(w) z^{w}.

When H is equal to F, A_{H,H} is named the
autocorrelation set
and denoted A_{H,H};
the autocorrelation polynomial
is defined as
For example, let H = 11011 and F=1110. Then the overlap set of H
and F is {11,1} and the correlation set is
A_{H,F}={10,110}. Similarly, A_{F,H}={11}. It is worth
noticing that A_{F,H} ¹ A_{H,F}. Intuitively, the
concatenation of a word in A_{H,F} to H creates an (overlapping)
occurrence of F.
Definition 2
Let H be a given word.

The initial language R is the set of words containing
only one occurrence of H, located at the right end.
 The tail language U is defined as the set of words u
such that Hu has exactly one occurrence of H, which occurs at the
left end.
 The minimal language M is the set of words w such that
Hw has exactly two occurrences of H, located at its left and right
ends.
With these notations, any text that contains exactly k occurrences
of H, k ³ 1, rewrites unambiguously as
rm_{1}... m_{k1}u
where rÎR, m_{i}ÎM, and uÎU. In other words, this
set T_{k} of words satisfies T_{k}=RM^{k1}U. The power of
this approach comes from the equations that can be written on these
languages, that translate into equations on their generating functions
in the Bernoulli model and the Markov model. Moreover, it
turns out that these generating functionshence the whole counting
problemonly depend on the probability of H, denoted P(H), and
the socalled correlation set.
Theorem 1
Let H be a given pattern of size m, and T be a random text
generated by a Bernoulli model. The generating function of the set
T_{k} satisfies

T_{k}(z) 
=z^{m} P(H) 
(D_{H}(z)+1z)^{k1} 

D_{H}(z)^{k+1} 

, k³1, 










T_{0}(z) 











where
D_{H}(z) = (1z) A_{H}(z) + z^{m} P(H).
Moreover, the bivariate generating function satisfies
T(z,u) = 

T_{k}(z) u^{k} = 



These results extend to the Markovian model and to the case of multiple
pattern matching [3].
3 Subsequence Matching
A pattern W = w_{1} ... w_{m} is hidden in a text T if there
exist indices 1 £ i_{1} < ... <i_{m} £ n such that
t_{i1}=w_{1}, ..., t_{im}=w_{m}. For example, date is
hidden 4 times in the text hidden pattern but it is not a
substring. We focus on cases where the sequence of indices satisfies
additional constraints i_{j+1} i_{j} £ d_{j}, where d_{j} is either
an integer or ¥. Such a sequence is called an occurrence. One denotes (d_{1} ,... ,d_{m1}) by D. For
example, when D = (3,2,¥ , 1,¥ ,¥ ,4, ¥) the
set I =(5,7,9,18,19,22,30,33,50), satisfies the constraints.
The number of occurrences, W_{n}, is asymptotically
Gaussian. This is proved in [1] by the moments method:
all moments of the properly normalized random variable converge to the
corresponding moments of the Gaussian law. For any sequence I that
satisfies the constraints, one denotes X_{I} the random variable that
is 1 if t_{i1}=w_{1}, ..., t_{im}=w_{m}. Then,
The computation of the moments relies on a generalization of correlation
sets. Let
U ={u_{1}, ...,u_{b1}}
be the subset of indices j
for which
d_{j}= ¥. Any occurrence I satisfying the constraints can be divided
into b blocks:
[i_{1},i_{u1}],[i_{u1 +1},i_{u2}], ..., [i_{ub1+1},i_{m}].
The collection of these blocks is called the aggregate of I
and denoted a (I).
In the example above, the
aggregate a (I) is
a (I) =[5,9],[18,19],[22],[30,33],[50].
Deriving the mean
The collection of occurrences of W can be described as
A ^{*} × {w_{1}} × A ^{£ d1} × {w_{2} }
× ... ×
A ^{£ dm1} × {w_{m}} × A ^{*},
where A is the alphabet and
A ^{£ dj} is the collection of words of size less than or equal to
d_{j}. It follows that the generating function of expectations is


E (W_{n}) z^{n} = 

× 

p_{wi} z
× 



,

where
p_{wi} is the probablity of character w_{i}. Hence, the expectation
satisfies
E (W_{n}) = 



d_{i} 

p_{wi}

æ
ç
ç
è 
1+O 
æ
ç
ç
è 

ö
÷
÷
ø 

ö
÷
÷
ø 
(1) 
Deriving the variance and higher moments
The variance rewrites
Var 
(W_{n}) = 

E (X_{I} X_{J}) E (X_{I})E (X_{J}).

In the Bernoulli model, the two random variables X_{I} and X_{J} are
independent whenever the blocks of I and J do not overlap. Hence, the
contribution to the variance is zero.
If a (I) and a (J) overlap, one defines the agreggate a (I,J)
as the set of blocks obtained by merging the blocks of a (I)
and a (J) that overlap. The number of blocks in a (I,J),
denoted b (I,J), is upper bounded by 2b1.
For such a pair (I,J), the text can be rewritten as an element of
the language
A^{*}×B_{1}×A^{*}×...×B_{b(I,J)}×A^{*}
and the generating function of the covariance rewrites


Var 
(W_{n}) z^{n} = 





P_{p}(z),

where P_{p} are polynomials of the variable z that generalize the
correlation polynomials defined in [2] (see Definition
1). The asymptotic order of each term is n^{2bp}.
Hence, the dominating contribution is due to the intersecting pairs
such that b (I,J)=2b1, and
Var (W_{n}) ~ n^{2b1} s ^{2}
where the variance coefficient s can be easily evaluated for any given
pattern by dynamic programming.
The proof is similar for higher moments.
4 Repetitive Pattern Matching
Given a pattern H found in a text T, one searches for a second approximate
occurrence of H. A word F is a Dapproximate
occurrence of a word H if the Hamming distance between F and
H is smaller than
D.
Recall that
the Hamming distance between two words of size m, say
H = H_{1} ... H_{m} and F =F_{1} ... F_{m} is
The usual parameters on trees, such as the depth of insertion,
height, fillup, ..., are extended in the approximate
case. Notably:
Definition 3
The depth
L_{n} is the largest integer K
such that
min 
{ 
d 
( 
T_{i}^{iK+1},T_{n}^{n+K} 
) 
 1£ i£ nK+1 
} 
£ D.

Rényi's entropy is generalized. Given a word H, the Dball with
center H, denoted B_{D}(H),
is the set of words that are within distance D.
Definition 4
Given a text T, Rényi's entropy of order 0 is
r_{0}(D) = 


E 
[ 
logP 
( 
B_{D}(T_{1}^{k}) 
) 
] 


k 

,

when this limit exists.
Asymptotic properties are proved for the depth, the heigth and the
fillup, that depend on Rényi's entropy. Notably, the convergence in
probability
of the depth of insertion in a trie extends for this approximate scheme:
The proof relies on the subadditive ergodic theorem and asymptotic
equipartition property.
References
 [1]

Flajolet (P.), Guivarc'h (Y.), Vallée (V.), and Szpankowski (W.). 
Hidden patterns statistics. In ICALP'01. 
2001. Proceedings of the 28th ICALP Conference, Crete,
Greece, July 2001.
 [2]

Guibas (L. J.) and Odlyzko (A. M.). 
String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory. Series A, vol. 30, n°2,
1981, pp. 183208.
 [3]

Régnier (M.) and Szpankowski (W.). 
On pattern frequency occurrences in a Markovian sequence. Algorithmica, vol. 22, n°4, 1998, pp. 631649. 
Averagecase analysis of algorithms.
This document was translated from L^{A}T_{E}X by
H^{E}V^{E}A.