A properly typeset version of this document is available in postscript and in pdf.
If some fonts do not look right on your screen, this might be
fixed by configuring your browser (see the documentation here).
1 Introduction
In this talk we study a matching problem for two sequences
S=(s1,...,sn), T=(t1,...,tp) where
s1,...,sn,t1,...,tp are elements of some alphabet
A. This mathematical model is used in the analysis of some
biological sequences. To any couple (x,y)Î A× A, one
associates a score H(x,y)Î R which is
negative if the two letters do not agree or positive if their
affinity is significant. A local matching of length l of these
sequences is given by a sequence
((si1,tj1),(si1,tj1),...,(si1,tj1))
with 1£ i1<i2< ...<il£ n and 1£ j1<...<jl£
p. The score of this matching is then defined as
l
å
k=1
H(s
ik
,t
jk
).
The main problem considered in this talk is to estimate the maximal
score among all the possible matchings of these sequences.
A probabilistic setting is used to give estimates of this optimal
score. The letters are assumed to be drawn independently from the
alphabet. This hypothesis leads to a formulation of the problem in
terms of random walk. The optimal score M(n) for two sequences of
size n can be represented as
M(n)=
sup
0£ j£ k£ n
(Sk-Sj),
(1)
where Sn=åi=1nXi (å10=0); The variables (Xi)
are assumed to be independent and identically distributed. The
sequence (Sn) is the random walk starting from 0 associated to the
distribution of X1. Clearly the sequence (M(n)) is non
decreasing with n, and as we will see, it converges to infinity as
n®+¥. Our goal is to find an asymptotic estimate of M(n)
for n large. We prove that if E(X1)<0, and some other
technical conditions, there exists some constant a such that
the renormalized sequence M(n)-alog(n) converges in
distribution.
2 The relation with a reflected random walk
For n³ 1 we denote by
Wn=
sup
0£ k£ n
(Sn-Sk),
then M(n) can be expressed as
M(n)=
sup
0£ k£ n
Wk.
(2)
It is easy to see that the sequence (Wn) satisfies the following
relation
Wn+1=(Wn+Xn+1)+, n³ 0
(3)
where x+=max(x,0). Now define
n-=inf{n>0/Sn£ 0},
which is the first time the random walk visits the negative
axis. The law of large numbers gives
that almost surely limn® +¥1/nå1nXi=E(X1),
and because E(X1)<0, we have limn®
+¥å1nXi=-¥, thus the quantity n- is always
finite.
By induction, using (3), one can check that W(n)=Sn for
n<n-. Furthermore, we have Wn-=(Sn-)+=0, by
definition of n-. It is easy to prove that starting from
t=n-, the sequence Wn performs another similar excursion above
0 independently of the previous excursion, and so on. The sequence
(Wn) is the reflected random walk at 0.
The method of resolution
To estimate M(n), the maximum of k® Wk on the interval
{0,...,n}, we proceed as follows
Estimate the distribution of Mexc, the maximum of the
random walk during an excursion above 0.
Count the number of excursions of k® Wk above 0 up to
time n.
M(n)=max{Mexci} where the maximum is taken on all
excursions exci before time n. The excursions being independent,
the same is true for the (Mexci). Estimating the maximum of
independent random variables is easy.
Technically, the main tool used to prove convergences is the renewal
theorem. The explicit calculation of constants involved in the
limiting distribution requires the Wiener-Hopf factorization associated
to this random walk. We give a formulation of these results in the
next section.
3 The probabilistic tools
3.1 The renewal theorem
We consider a sequence of non negative i.i.d. integrable random
variables (Yi) and denote by Tn=å1nYi, the non
decreasing random walk associated to the distribution of Y1. For
tÎ R+, let Nt be the number of Tk between 0 and t. Thus,
TNt+1-t is the length of the interval between t and the first
Tk after t.
Proposition 1
Almost surely
lim
t® +¥
Nt
t
=
1
E(Y1)
,
where E(Y1) denotes the expected value of Y1.
As we can remark at that point the above proposition will solve the
second point of our program above. In this case the beginnings of the
excursions will define the renewal process.
The main result in renewal theory concerns the solution of the
so-called renewal equation. If f is some function, the function Z
is the solution of the renewal equation associated to f if, for all
x³ 0,
Z(x) = f(x) +
ó õ
x
0
Z(x-y)P(X1Î dy).
(4)
The main theorem is the following
Theorem 1
If f is Riemann integrable, there is a unique solution Zf of
(4) and Zf satisfies
lim
x®+¥
Zf(x)=1/E(X1)
ó õ
+¥
0
f(u)du.
This analytical formulation of the renewal theorem can be seen as a
consequence of a probabilistic result: the variable TNt+1-t
converges in distribution as t® +¥.
3.2 The Wiener-Hopf factorization
This technique concerns the calculation of the distribution of the
hitting times of the positive, negative axis by a random walk and the
position of the random walk at these times. We have already seen
n-, we define its positive counterpart n+,
n+=inf{n/Sn>0}.
Theorem 2
For uÎC, such that |u|<1, there exist f+(u,.),
f-(u,.) such that
1
1-uE(e
-x X
)
= f+(u,x)f-(u,x),
 (x)=0.
(5)
The function f+(u,.) [resp. f-(u,.)] is analytic
on {Â (x) > 0} [resp.
{Â (x)< 0}], continuous, bounded away from 0 and ¥
on {Â (x)³ 0} [resp. {Â (x)£ 0}]. Moreover
lim
 (x)® +¥
f+(u,x)= 1.
Such a decomposition is unique.
The following corollary is the probabilistic interpretation of the
above theorem.
Corollary 1
The functions of the Wiener-Hopf factorization can be expressed as
f+(u,x)
=
1
1-E(u
n+
e
-x S
n+
)
,
|u|<1,
 (x)³ 0,
f-(u,x)
=
1
1-E(u
n-
e
-x
S
n-
)
|u|<1, Â (x)£ 0.
Thus, if we are able to decompose the function 1/(1-uE(e-x
X)), the joint distributions of (n+,Sn+) and
(n-,Sn-) are known through their Fourier-Laplace
transforms, E(un+e-x Sn+), E(un-e-x
Sn-).
4 The main results
Using theorem 1, one can prove the proposition about
the tail distribution of the maximum of the random walk during an
excursion.
Proposition 2
If the following conditions are satisfied,
E(X1)<0, P(X1>0)>0 and X1 is non arithmetic;
There exists q>0 such that E(eq X1)<+¥;
E(|X1|eg X1)<+¥,
where g is the positive solution of E(eg X1)=1;
then
lim
x® +¥
e
g x
P
(
Mexc³ x
)
= Cexc =
P(n+=+¥)(1-E(e
g S
n-
))
g E(X1e
g X1
)
E(n+e
g S
n+
1
{n+<+¥}
)
.
Notice that to make the constant Cexc explicit, one has to know some
functionals of n+, n-. This is the place where the
Wiener-Hopf decomposition is useful.
At that point cases 1 and 2 of our program are solved. For
point 3 it remains to integrate these results. This gives our final
theorem.
Theorem 3
Under the hypotheses of the proposition 2, there exist
two constant K,l such that
Karlin (Samuel) and Altschul (Stephen F.). --
Methods for assesing the statistical significance of molecular
sequences features by using general scoring schemes. Proceedings of the
National Academy of Sciences of the USA, vol. 87, 1990,
pp. 2264--2268.
Karlin (Samuel) and Dembo (Amir). --
Strong limit theorems of empirical functionals for large exedances of
partial sums of i.i.d. variables. Annals of Probability, vol. 19,
n°4, 1991, pp. 1737--1755.