Wiener-Hopf Factorization and Maximal Scores in Biological Sequences

Wiener-Hopf Factorization and Maximal Scores in Biological Sequences

Pierre Nicodème

INRIA-Rocquencourt

Algorithms Seminar

February 10, 1997

[summary by Philippe Robert]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

1 Introduction

In this talk we study a matching problem for two sequences S=(s₁,...,s_n), T=(t₁,...,t_p) where s₁,...,s_n,t₁,...,t_p are elements of some alphabet A. This mathematical model is used in the analysis of some biological sequences. To any couple (x,y)Î A× A, one associates a score H(x,y)Î R which is negative if the two letters do not agree or positive if their affinity is significant. A local matching of length l of these sequences is given by a sequence ((s_i_₁,t_j_₁),(s_i_₁,t_j_₁),...,(s_i_₁,t_j_₁)) with 1£ i₁<i₂< ...<i_l£ n and 1£ j₁<...<j_l£ p. The score of this matching is then defined as

k=1

H(s

i_k

j_k

The main problem considered in this talk is to estimate the maximal score among all the possible matchings of these sequences.

A probabilistic setting is used to give estimates of this optimal score. The letters are assumed to be drawn independently from the alphabet. This hypothesis leads to a formulation of the problem in terms of random walk. The optimal score M(n) for two sequences of size n can be represented as

M(n)=

sup

0£ j£ k£ n

(S_k-S_j), (1)

where S_n=å_i=1ⁿX_i (å₁⁰=0); The variables (X_i) are assumed to be independent and identically distributed. The sequence (S_n) is the random walk starting from 0 associated to the distribution of X₁. Clearly the sequence (M(n)) is non decreasing with n, and as we will see, it converges to infinity as n®+¥. Our goal is to find an asymptotic estimate of M(n) for n large. We prove that if E(X₁)<0, and some other technical conditions, there exists some constant a such that the renormalized sequence M(n)-alog(n) converges in distribution.

2 The relation with a reflected random walk

For n³ 1 we denote by

W_n=

sup

0£ k£ n

(S_n-S_k),

then M(n) can be expressed as

M(n)=

sup

0£ k£ n

W_k. (2)

It is easy to see that the sequence (W_n) satisfies the following relation

W_n+1=(W_n+X_n+1)⁺, n³ 0 (3)

where x⁺=max(x,0). Now define

n_-=inf{n>0/S_n£ 0},

which is the first time the random walk visits the negative axis. The law of large numbers gives that almost surely lim_{n® +¥}1/nå₁ⁿX_i=E(X₁), and because E(X₁)<0, we have lim_n®
+¥å₁ⁿX_i=-¥, thus the quantity n_- is always finite.

By induction, using (3), one can check that W(n)=S_n for n<n_-. Furthermore, we have W_n_{_-}=(S_n_{_-})⁺=0, by definition of n_-. It is easy to prove that starting from t=n_-, the sequence W_n performs another similar excursion above 0 independently of the previous excursion, and so on. The sequence (W_n) is the reflected random walk at 0.

The method of resolution

To estimate M(n), the maximum of k® W_k on the interval {0,...,n}, we proceed as follows

Estimate the distribution of M_exc, the maximum of the random walk during an excursion above 0.
Count the number of excursions of k® W_k above 0 up to time n.
M(n)=max{M_exc_{_i}} where the maximum is taken on all excursions exc_i before time n. The excursions being independent, the same is true for the (M_exc_{_i}). Estimating the maximum of independent random variables is easy.

Technically, the main tool used to prove convergences is the renewal theorem. The explicit calculation of constants involved in the limiting distribution requires the Wiener-Hopf factorization associated to this random walk. We give a formulation of these results in the next section.

3 The probabilistic tools

3.1 The renewal theorem

We consider a sequence of non negative i.i.d. integrable random variables (Y_i) and denote by T_n=å₁ⁿY_i, the non decreasing random walk associated to the distribution of Y₁. For tÎ R₊, let N_t be the number of T_k between 0 and t. Thus, T_N_{_t}₊₁-t is the length of the interval between t and the first T_k after t.

Proposition 1 Almost surely

lim

t® +¥

N_t

E(Y₁)

where E(Y₁) denotes the expected value of Y₁.

As we can remark at that point the above proposition will solve the second point of our program above. In this case the beginnings of the excursions will define the renewal process.

The main result in renewal theory concerns the solution of the so-called renewal equation. If f is some function, the function Z is the solution of the renewal equation associated to f if, for all x³ 0,

Z(x) = f(x) +

ó
õ

Z(x-y)P(X₁Î dy). (4)

The main theorem is the following

Theorem 1 If f is Riemann integrable, there is a unique solution Z_f of (4) and Z_f satisfies

lim

x®+¥

Z_f(x)=1/E(X₁)

ó
õ

+¥

f(u)du.

This analytical formulation of the renewal theorem can be seen as a consequence of a probabilistic result: the variable T_N_{_t}₊₁-t converges in distribution as t® +¥.

3.2 The Wiener-Hopf factorization

This technique concerns the calculation of the distribution of the hitting times of the positive, negative axis by a random walk and the position of the random walk at these times. We have already seen n_-, we define its positive counterpart n₊,

n₊=inf{n/S_n>0}.

Theorem 2 For uÎC, such that |u|<1, there exist f₊(u,.), f_-(u,.) such that

1

1-uE(e

-x X

)

= f₊(u,x)f_-(u,x), Â (x)=0. (5)
The function f₊(u,.) [resp. f_-(u,.)] is analytic on {Â (x) > 0} [resp. {Â (x)< 0}], continuous, bounded away from 0 and ¥ on {Â (x)³ 0} [resp. {Â (x)£ 0}]. Moreover

lim

Â (x)® +¥
f₊(u,x)= 1.

Such a decomposition is unique.

The following corollary is the probabilistic interpretation of the above theorem.

Corollary 1 The functions of the Wiener-Hopf factorization can be expressed as

f₊(u,x)

1-E(u

n₊

-x S

n₊

)

|u|<1, Â (x)³ 0,

f_-(u,x)

1-E(u

n_-

-x S

n_-

)

|u|<1, Â (x)£ 0.

Thus, if we are able to decompose the function 1/(1-uE(e^-x
X)), the joint distributions of (n₊,S_n_₊) and (n_-,S_n_{_-}) are known through their Fourier-Laplace transforms, E(uⁿ_⁺e^{-x S}_ⁿ_{_⁺}), E(uⁿ_^-e^-x
S_ⁿ_{_^-}).

4 The main results

Using theorem 1, one can prove the proposition about the tail distribution of the maximum of the random walk during an excursion.

Proposition 2 If the following conditions are satisfied,

E(X₁)<0, P(X₁>0)>0 and X₁ is non arithmetic;
There exists q>0 such that E(e^{q X}_¹)<+¥;
E(|X₁|e^{g X}_¹)<+¥, where g is the positive solution of E(e^{g X}_¹)=1;

then

lim

x® +¥

g x

(

M_exc³ x

)

= C_exc =

P(n₊=+¥)(1-E(e

g S

n_-

))

g E(X₁e

g X₁

) E(n₊e

g S

n₊

{n₊<+¥}

)

Notice that to make the constant C_exc explicit, one has to know some functionals of n₊, n_-. This is the place where the Wiener-Hopf decomposition is useful.

At that point cases 1 and 2 of our program are solved. For point 3 it remains to integrate these results. This gives our final theorem.

Theorem 3 Under the hypotheses of the proposition 2, there exist two constant K,l such that

lim

n® +¥

P(M(n)-

log(n)

£ x)=e

-Ke

-l x

References

[1]: Asmussen (Søren). -- Applied Probability and Queues. -- John Wiley & Sons, Chichester, 1987, Wiley Series in Probability and Mathematical Statistics.
[2]: Feller (William). -- An introduction to probability theory and its applications. -- John Wiley & Sons, New York, 1971, 2nd edition, vol. II.
[3]: Iglehart (Donald L.). -- Extreme values in the GI/G/1 queue. Annals of Mathematical Statistics, vol. 43, n°2, 1972, pp. 627--635.
[4]: Karlin (Samuel) and Altschul (Stephen F.). -- Methods for assesing the statistical significance of molecular sequences features by using general scoring schemes. Proceedings of the National Academy of Sciences of the USA, vol. 87, 1990, pp. 2264--2268.
[5]: Karlin (Samuel) and Dembo (Amir). -- Strong limit theorems of empirical functionals for large exedances of partial sums of i.i.d. variables. Annals of Probability, vol. 19, n°4, 1991, pp. 1737--1755.

This document was translated from L^AT_EX by H^EV^EA.