New and Old Problems in Pattern Matching
Wojcieh Szpankowski
Computer Science Department, Purdue University (USA)
Algorithms Seminar
June 25, 2001
[summary by Mireille Régnier]
A properly typeset version of this document is available in
postscript and in
pdf.
If some fonts do not look right on your screen, this might be
fixed by configuring your browser (see the documentation here).
Abstract
This talk presents three problems in pattern matching and their
analysis. Different methods are used, that rely on complex analysis
and probability theory.
1 Statement of the Problems
Some pattern H (or a set H of patterns) is searched in a text
T. The text T is generated by a random probabilistic source
that is either a Bernoulli source or a Markov source or a mixing
source. In the string matching and the subsequence matching problems,
H is given: the model is deterministic. In the repetitive
patterns problem, in Section 4, H is a string of T
repeated elsewhere.
2 String Matching
One counts the number of occurrences of a given word H or a given
finite set of words, H, in a text of size n. This number is
denoted On(H) or On(H). This counting relies on the
decomposition of the text T onto languages, the so-called initial, minimal, and tail languages.
Definition 1
Given two strings H and F, the overlap set
is the set of
suffixes of H that are also prefixes of F.
The suffixes of F in the associated factorizations of
F form the correlation set
AH,F.
In the Bernoulli model, one defines the correlation polynomial
of
H and F as
When H is equal to F, AH,H is named the
autocorrelation set
and denoted AH,H;
the autocorrelation polynomial
is defined as
For example, let H = 11011 and F=1110. Then the overlap set of H
and F is {11,1} and the correlation set is
AH,F={10,110}. Similarly, AF,H={11}. It is worth
noticing that AF,H ¹ AH,F. Intuitively, the
concatenation of a word in AH,F to H creates an (overlapping)
occurrence of F.
Definition 2
Let H be a given word.
-
The initial language R is the set of words containing
only one occurrence of H, located at the right end.
- The tail language U is defined as the set of words u
such that Hu has exactly one occurrence of H, which occurs at the
left end.
- The minimal language M is the set of words w such that
Hw has exactly two occurrences of H, located at its left and right
ends.
With these notations, any text that contains exactly k occurrences
of H, k ³ 1, rewrites unambiguously as
rm1... mk-1u
where rÎR, miÎM, and uÎU. In other words, this
set Tk of words satisfies Tk=RMk-1U. The power of
this approach comes from the equations that can be written on these
languages, that translate into equations on their generating functions
in the Bernoulli model and the Markov model. Moreover, it
turns out that these generating functions---hence the whole counting
problem---only depend on the probability of H, denoted P(H), and
the so-called correlation set.
Theorem 1
Let H be a given pattern of size m, and T be a random text
generated by a Bernoulli model. The generating function of the set
Tk satisfies
where
DH(z) = (1-z) AH(z) + zm P(H).
Moreover, the bivariate generating function satisfies
These results extend to the Markovian model and to the case of multiple
pattern matching [3].
3 Subsequence Matching
A pattern W = w1 ... wm is hidden in a text T if there
exist indices 1 £ i1 < ... <im £ n such that
ti1=w1, ..., tim=wm. For example, date is
hidden 4 times in the text hidden pattern but it is not a
substring. We focus on cases where the sequence of indices satisfies
additional constraints ij+1 -ij £ dj, where dj is either
an integer or ¥. Such a sequence is called an occurrence. One denotes (d1 ,... ,dm-1) by D. For
example, when D = (3,2,¥ , 1,¥ ,¥ ,4, ¥) the
set I =(5,7,9,18,19,22,30,33,50), satisfies the constraints.
The number of occurrences, Wn, is asymptotically
Gaussian. This is proved in [1] by the moments method:
all moments of the properly normalized random variable converge to the
corresponding moments of the Gaussian law. For any sequence I that
satisfies the constraints, one denotes XI the random variable that
is 1 if ti1=w1, ..., tim=wm. Then,
The computation of the moments relies on a generalization of correlation
sets. Let
U ={u1, ...,ub-1}
be the subset of indices j
for which
dj= ¥. Any occurrence I satisfying the constraints can be divided
into b blocks:
[i1,iu1],[iu1 +1,iu2], ..., [iub-1+1,im].
The collection of these blocks is called the aggregate of I
and denoted a (I).
In the example above, the
aggregate a (I) is
a (I) =[5,9],[18,19],[22],[30,33],[50].
Deriving the mean
The collection of occurrences of W can be described as
A * × {w1} × A £ d1 × {w2 }
× ... ×
A £ dm-1 × {wm} × A *,
where A is the alphabet and
A £ dj is the collection of words of size less than or equal to
dj. It follows that the generating function of expectations is
where
pw-i is the probablity of character wi. Hence, the expectation
satisfies
E (Wn) = |
|
|
|
di |
|
pwi
|
æ
ç
ç
è |
1+O |
æ
ç
ç
è |
|
ö
÷
÷
ø |
|
ö
÷
÷
ø |
(1) |
Deriving the variance and higher moments
The variance rewrites
Var |
(Wn) = |
|
E (XI XJ) -E (XI)E (XJ).
|
In the Bernoulli model, the two random variables XI and XJ are
independent whenever the blocks of I and J do not overlap. Hence, the
contribution to the variance is zero.
If a (I) and a (J) overlap, one defines the agreggate a (I,J)
as the set of blocks obtained by merging the blocks of a (I)
and a (J) that overlap. The number of blocks in a (I,J),
denoted b (I,J), is upper bounded by 2b-1.
For such a pair (I,J), the text can be rewritten as an element of
the language
A*×B1×A*×...×Bb(I,J)×A*
and the generating function of the covariance rewrites
where Pp are polynomials of the variable z that generalize the
correlation polynomials defined in [2] (see Definition
1). The asymptotic order of each term is n2b-p.
Hence, the dominating contribution is due to the intersecting pairs
such that b (I,J)=2b-1, and
Var (Wn) ~ n2b-1 s 2
where the variance coefficient s can be easily evaluated for any given
pattern by dynamic programming.
The proof is similar for higher moments.
4 Repetitive Pattern Matching
Given a pattern H found in a text T, one searches for a second approximate
occurrence of H. A word F is a D-approximate
occurrence of a word H if the Hamming distance between F and
H is smaller than
D.
Recall that
the Hamming distance between two words of size m, say
H = H1 ... Hm and F =F1 ... Fm is
The usual parameters on trees, such as the depth of insertion,
height, fill-up, ..., are extended in the approximate
case. Notably:
Definition 3
The depth
Ln is the largest integer K
such that
min |
{ |
d |
( |
Tii-K+1,Tnn+K |
) |
| 1£ i£ n-K+1 |
} |
£ D.
|
Rényi's entropy is generalized. Given a word H, the D-ball with
center H, denoted BD(H),
is the set of words that are within distance D.
Definition 4
Given a text T, Rényi's entropy of order 0 is
when this limit exists.
Asymptotic properties are proved for the depth, the heigth and the
fill-up, that depend on Rényi's entropy. Notably, the convergence in
probability
of the depth of insertion in a trie extends for this approximate scheme:
The proof relies on the subadditive ergodic theorem and asymptotic
equipartition property.
References
- [1]
-
Flajolet (P.), Guivarc'h (Y.), Vallée (V.), and Szpankowski (W.). --
Hidden patterns statistics. In ICALP'01. --
2001. Proceedings of the 28th ICALP Conference, Crete,
Greece, July 2001.
- [2]
-
Guibas (L. J.) and Odlyzko (A. M.). --
String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory. Series A, vol. 30, n°2,
1981, pp. 183--208.
- [3]
-
Régnier (M.) and Szpankowski (W.). --
On pattern frequency occurrences in a Markovian sequence. Algorithmica, vol. 22, n°4, 1998, pp. 631--649. --
Average-case analysis of algorithms.
This document was translated from LATEX by
HEVEA.