New and Old Problems in Pattern Matching

New and Old Problems in Pattern Matching

Wojcieh Szpankowski

Computer Science Department, Purdue University (USA)

Algorithms Seminar

June 25, 2001

[summary by Mireille Régnier]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

Abstract

This talk presents three problems in pattern matching and their analysis. Different methods are used, that rely on complex analysis and probability theory.

1 Statement of the Problems

Some pattern H (or a set H of patterns) is searched in a text T. The text T is generated by a random probabilistic source that is either a Bernoulli source or a Markov source or a mixing source. In the string matching and the subsequence matching problems, H is given: the model is deterministic. In the repetitive patterns problem, in Section 4, H is a string of T repeated elsewhere.

2 String Matching

One counts the number of occurrences of a given word H or a given finite set of words, H, in a text of size n. This number is denoted O_n(H) or O_n(H). This counting relies on the decomposition of the text T onto languages, the so-called initial, minimal, and tail languages.

Definition 1 Given two strings H and F, the overlap set is the set of suffixes of H that are also prefixes of F. The suffixes of F in the associated factorizations of F form the correlation set A_H,F. In the Bernoulli model, one defines the correlation polynomial of H and F as

A_H,F(z) =

w Î A_H,F

P(w) z^|w|.

When H is equal to F, A_H,H is named the autocorrelation set and denoted A_H,H; the autocorrelation polynomial is defined as

A_H(z) =

w Î A_H,H

P(w) z^|w|.

For example, let H = 11011 and F=1110. Then the overlap set of H and F is {11,1} and the correlation set is A_H,F={10,110}. Similarly, A_F,H={11}. It is worth noticing that A_F,H ¹ A_H,F. Intuitively, the concatenation of a word in A_H,F to H creates an (overlapping) occurrence of F.

Definition 2 Let H be a given word.

The initial language R is the set of words containing only one occurrence of H, located at the right end.
The tail language U is defined as the set of words u such that Hu has exactly one occurrence of H, which occurs at the left end.
The minimal language M is the set of words w such that Hw has exactly two occurrences of H, located at its left and right ends.

With these notations, any text that contains exactly k occurrences of H, k ³ 1, rewrites unambiguously as

rm₁... m_k-1u

where rÎR, m_iÎM, and uÎU. In other words, this set T_k of words satisfies T_k=RM^k-1U. The power of this approach comes from the equations that can be written on these languages, that translate into equations on their generating functions in the Bernoulli model and the Markov model. Moreover, it turns out that these generating functions---hence the whole counting problem---only depend on the probability of H, denoted P(H), and the so-called correlation set.

Theorem 1 Let H be a given pattern of size m, and T be a random text generated by a Bernoulli model. The generating function of the set T_k satisfies

T_k(z)

=z^m P(H)

(D_H(z)+1-z)^k-1

D_H(z)^k+1

, k³1,

T₀(z)

A_H(z)

D_H(z)

where

D_H(z) = (1-z) A_H(z) + z^m P(H).

Moreover, the bivariate generating function satisfies

T(z,u) =

T_k(z) u^k =

1-u

D_H(z)+1-z

D_H(z)

z^m P(H)

D_H(z) ²

These results extend to the Markovian model and to the case of multiple pattern matching [3].

3 Subsequence Matching

A pattern W = w₁ ... w_m is hidden in a text T if there exist indices 1 £ i₁ < ... <i_m £ n such that t_i₁=w₁, ..., t_{i_m}=w_m. For example, date is hidden 4 times in the text hidden pattern but it is not a substring. We focus on cases where the sequence of indices satisfies additional constraints i_j+1 -i_j £ d_j, where d_j is either an integer or ¥. Such a sequence is called an occurrence. One denotes (d₁ ,... ,d_m-1) by D. For example, when D = (3,2,¥ , 1,¥ ,¥ ,4, ¥) the set I =(5,7,9,18,19,22,30,33,50), satisfies the constraints.

The number of occurrences, W_n, is asymptotically Gaussian. This is proved in [1] by the moments method: all moments of the properly normalized random variable converge to the corresponding moments of the Gaussian law. For any sequence I that satisfies the constraints, one denotes X_I the random variable that is 1 if t_i₁=w₁, ..., t_{i_m}=w_m. Then,

W_n =

X_I.

The computation of the moments relies on a generalization of correlation sets. Let

U ={u₁, ...,u_b-1}

be the subset of indices j for which d_j= ¥. Any occurrence I satisfying the constraints can be divided into b blocks:

[i₁,i_u₁],[i_{u₁ +1},i_u₂], ..., [i_{u_b-1+1},i_m].

The collection of these blocks is called the aggregate of I and denoted a (I). In the example above, the aggregate a (I) is

a (I) =[5,9],[18,19],[22],[30,33],[50].

Deriving the mean

The collection of occurrences of W can be described as

A ^* × {w₁} × A ^{£ d₁} × {w₂ } × ... × A ^{£ d_m-1} × {w_m} × A ^*,

where A is the alphabet and A ^{£ d_j} is the collection of words of size less than or equal to d_j. It follows that the generating function of expectations is

E (W_n) zⁿ =

(1-z)^b-1

i=1

p_{w_i} z ×

i ÏU

1-z^d_i

1-z

where p_w-i is the probablity of character w_i. Hence, the expectation satisfies

E (W_n) =

n^b

i ÏU

d_i

i=1

p_{w_i}

æ
ç
ç
è

1+O

æ
ç
ç
è

ö
÷
÷
ø

(1)

Deriving the variance and higher moments

The variance rewrites

Var

(W_n) =

I,J

E (X_I X_J) -E (X_I)E (X_J).

In the Bernoulli model, the two random variables X_I and X_J are independent whenever the blocks of I and J do not overlap. Hence, the contribution to the variance is zero. If a (I) and a (J) overlap, one defines the agreggate a (I,J) as the set of blocks obtained by merging the blocks of a (I) and a (J) that overlap. The number of blocks in a (I,J), denoted b (I,J), is upper bounded by 2b-1. For such a pair (I,J), the text can be rewritten as an element of the language

A^*×B₁×A^*×...×B_b(I,J)×A^*

and the generating function of the covariance rewrites

Var

(W_n) zⁿ =

p ³ 1

b(I,J) =2b-p

(1-z)^2b-p

P_p(z),

where P_p are polynomials of the variable z that generalize the correlation polynomials defined in [2] (see Definition 1). The asymptotic order of each term is n^2b-p. Hence, the dominating contribution is due to the intersecting pairs such that b (I,J)=2b-1, and

Var (W_n) ~ n^2b-1 s ²

where the variance coefficient s can be easily evaluated for any given pattern by dynamic programming.

The proof is similar for higher moments.

4 Repetitive Pattern Matching

Given a pattern H found in a text T, one searches for a second approximate occurrence of H. A word F is a D-approximate occurrence of a word H if the Hamming distance between F and H is smaller than D. Recall that the Hamming distance between two words of size m, say H = H₁ ... H_m and F =F₁ ... F_m is

d_H(H,F)=

i=1

1_{H_i ¹ F_i}.

The usual parameters on trees, such as the depth of insertion, height, fill-up, ..., are extended in the approximate case. Notably:

Definition 3 The depth L_n is the largest integer K such that

min

{

(

T_i^i-K+1,T_n^n+K

)

| 1£ i£ n-K+1

}

£ D.

Rényi's entropy is generalized. Given a word H, the D-ball with center H, denoted B_D(H), is the set of words that are within distance D.

Definition 4 Given a text T, Rényi's entropy of order 0 is

r₀(D) =

lim

k ® ¥

-E

[

logP

(

B_D(T₁^k)

)

]

when this limit exists.

Asymptotic properties are proved for the depth, the heigth and the fill-up, that depend on Rényi's entropy. Notably, the convergence in probability of the depth of insertion in a trie extends for this approximate scheme:

L_n

logn

r₀(D)

, n®¥.

The proof relies on the subadditive ergodic theorem and asymptotic equipartition property.

References

[1]: Flajolet (P.), Guivarc'h (Y.), Vallée (V.), and Szpankowski (W.). -- Hidden patterns statistics. In ICALP'01. -- 2001. Proceedings of the 28th ICALP Conference, Crete, Greece, July 2001.
[2]: Guibas (L. J.) and Odlyzko (A. M.). -- String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory. Series A, vol. 30, n°2, 1981, pp. 183--208.
[3]: Régnier (M.) and Szpankowski (W.). -- On pattern frequency occurrences in a Markovian sequence. Algorithmica, vol. 22, n°4, 1998, pp. 631--649. -- Average-case analysis of algorithms.

This document was translated from L^AT_EX by H^EV^EA.