Wiener-Hopf Factorization and Maximal Scores in Biological Sequences

Pierre Nicodme

INRIA-Rocquencourt

Algorithms Seminar

February 10, 1997

[summary by Philippe Robert]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

1   Introduction

In this talk we study a matching problem for two sequences S=(s1,...,sn), T=(t1,...,tp) where s1,...,sn,t1,...,tp are elements of some alphabet A. This mathematical model is used in the analysis of some biological sequences. To any couple (x,y) A A, one associates a score H(x,y) R which is negative if the two letters do not agree or positive if their affinity is significant. A local matching of length l of these sequences is given by a sequence ((si1,tj1),(si1,tj1),...,(si1,tj1)) with 1 i1<i2< ...<il n and 1 j1<...<jl p. The score of this matching is then defined as
l
k=1
H(s
 
ik
,t
 
jk
).
The main problem considered in this talk is to estimate the maximal score among all the possible matchings of these sequences.

A probabilistic setting is used to give estimates of this optimal score. The letters are assumed to be drawn independently from the alphabet. This hypothesis leads to a formulation of the problem in terms of random walk. The optimal score M(n) for two sequences of size n can be represented as
M(n)=
 
sup
0 j k n
(Sk-Sj),     (1)
where Sn=i=1nXi (10=0); The variables (Xi) are assumed to be independent and identically distributed. The sequence (Sn) is the random walk starting from 0 associated to the distribution of X1. Clearly the sequence (M(n)) is non decreasing with n, and as we will see, it converges to infinity as n+. Our goal is to find an asymptotic estimate of M(n) for n large. We prove that if E(X1)<0, and some other technical conditions, there exists some constant a such that the renormalized sequence M(n)-alog(n) converges in distribution.

2   The relation with a reflected random walk

For n 1 we denote by
Wn=
 
sup
0 k n
(Sn-Sk),
then M(n) can be expressed as
M(n)=
 
sup
0 k n
Wk.     (2)
It is easy to see that the sequence (Wn) satisfies the following relation
Wn+1=(Wn+Xn+1)+,     n 0     (3)
where x+=max(x,0). Now define
n-=inf{n>0/Sn 0},
which is the first time the random walk visits the negative axis. The law of large numbers gives that almost surely limn +1/n1nXi=E(X1), and because E(X1)<0, we have limn +1nXi=-, thus the quantity n- is always finite.

By induction, using (3), one can check that W(n)=Sn for n<n-. Furthermore, we have Wn-=(Sn-)+=0, by definition of n-. It is easy to prove that starting from t=n-, the sequence Wn performs another similar excursion above 0 independently of the previous excursion, and so on. The sequence (Wn) is the reflected random walk at 0.



The method of resolution
To estimate M(n), the maximum of k Wk on the interval {0,...,n}, we proceed as follows
  1. Estimate the distribution of Mexc, the maximum of the random walk during an excursion above 0.
  2. Count the number of excursions of k Wk above 0 up to time n.
  3. M(n)=max{Mexci} where the maximum is taken on all excursions exci before time n. The excursions being independent, the same is true for the (Mexci). Estimating the maximum of independent random variables is easy.
Technically, the main tool used to prove convergences is the renewal theorem. The explicit calculation of constants involved in the limiting distribution requires the Wiener-Hopf factorization associated to this random walk. We give a formulation of these results in the next section.

3   The probabilistic tools

3.1   The renewal theorem

We consider a sequence of non negative i.i.d. integrable random variables (Yi) and denote by Tn=1nYi, the non decreasing random walk associated to the distribution of Y1. For t R+, let Nt be the number of Tk between 0 and t. Thus, TNt+1-t is the length of the interval between t and the first Tk after t.

Proposition 1   Almost surely
 
lim
t +
Nt
t
=
1
E(Y1)
,
where E(Y1) denotes the expected value of Y1.

As we can remark at that point the above proposition will solve the second point of our program above. In this case the beginnings of the excursions will define the renewal process.

The main result in renewal theory concerns the solution of the so-called renewal equation. If f is some function, the function Z is the solution of the renewal equation associated to f if, for all x 0,
Z(x) = f(x) +
x


0
Z(x-y)P(X1 dy).     (4)
The main theorem is the following
Theorem 1   If f is Riemann integrable, there is a unique solution Zf of (4) and Zf satisfies
 
lim
x+
Zf(x)=1/E(X1)
+


0
f(u)du.
This analytical formulation of the renewal theorem can be seen as a consequence of a probabilistic result: the variable TNt+1-t converges in distribution as t +.

3.2   The Wiener-Hopf factorization

This technique concerns the calculation of the distribution of the hitting times of the positive, negative axis by a random walk and the position of the random walk at these times. We have already seen n-, we define its positive counterpart n+,
n+=inf{n/Sn>0}.

Theorem 2   For uC, such that |u|<1, there exist f+(u,.), f-(u,.) such that
  1. 1
    1-uE(e
    -x X
     
    )
    = f+(u,x)f-(u,x),      (x)=0.     (5)

  2. The function f+(u,.) [resp. f-(u,.)] is analytic on { (x) > 0} [resp. { (x)< 0}], continuous, bounded away from 0 and on { (x) 0} [resp. { (x) 0}]. Moreover
     
    lim
    (x) +
    f+(u,x)= 1.
Such a decomposition is unique.

The following corollary is the probabilistic interpretation of the above theorem.
Corollary 1   The functions of the Wiener-Hopf factorization can be expressed as
f+(u,x)
=
1
1-E(u
n+
 
e
-x S
 
n+
 
)
,
     |u|<1,   (x) 0,
f-(u,x)
=
1
1-E(u
n-
 
e
-x S
 
n-
 
)
    |u|<1,   (x) 0.
Thus, if we are able to decompose the function 1/(1-uE(e-x X)), the joint distributions of (n+,Sn+) and (n-,Sn-) are known through their Fourier-Laplace transforms, E(un+e-x Sn+), E(un-e-x Sn-).

4   The main results

Using theorem 1, one can prove the proposition about the tail distribution of the maximum of the random walk during an excursion.
Proposition 2   If the following conditions are satisfied, then
 
lim
x +
e
g x
 
P ( Mexc x ) = Cexc =
P(n+=+)(1-E(e
g S
 
n-
 
))
g E(X1e
g X1
 
) E(n+e
g S
 
n+
 
1
 
{n+<+}
)
.

Notice that to make the constant Cexc explicit, one has to know some functionals of n+, n-. This is the place where the Wiener-Hopf decomposition is useful.

At that point cases 1 and 2 of our program are solved. For point 3 it remains to integrate these results. This gives our final theorem.

Theorem 3   Under the hypotheses of the proposition 2, there exist two constant K,l such that
 
lim
n +
P(M(n)-
log(n)
l
x)=e
-Ke
-l x
 
 
.

References

[1]
Asmussen (Sren). -- Applied Probability and Queues. -- John Wiley & Sons, Chichester, 1987, Wiley Series in Probability and Mathematical Statistics.

[2]
Feller (William). -- An introduction to probability theory and its applications. -- John Wiley & Sons, New York, 1971, 2nd edition, vol. II.

[3]
Iglehart (Donald L.). -- Extreme values in the GI/G/1 queue. Annals of Mathematical Statistics, vol. 43, n°2, 1972, pp. 627--635.

[4]
Karlin (Samuel) and Altschul (Stephen F.). -- Methods for assesing the statistical significance of molecular sequences features by using general scoring schemes. Proceedings of the National Academy of Sciences of the USA, vol. 87, 1990, pp. 2264--2268.

[5]
Karlin (Samuel) and Dembo (Amir). -- Strong limit theorems of empirical functionals for large exedances of partial sums of i.i.d. variables. Annals of Probability, vol. 19, n°4, 1991, pp. 1737--1755.

This document was translated from LATEX by HEVEA.