Generating Functions in Computational Biology
Mireille Régnier
Algorithms Project  Inria
Algorithms Seminar
March 3, 97
[summary by Mireille Régnier]
A properly typeset version of this document is available in
postscript and in
pdf.
If some fonts do not look right on your screen, this might be
fixed by configuring your browser (see the documentation here).
Abstract
We present a few enumeration problems that arose in computational biology.
We point out on several examples how symbolic enumeration methods allow
for simplifying and extending previous results. We also present some
asymptotics.
1 Secondary Structures
As a first example, we enumerate here combinatorial structures associated
to RNA sequences, e.g., secondary structures, hairpins, cloverleaves,...).
This study has been started in [11, 13, 8, 4, 12] with inductions on
the size of the RNA sequences.
We show here that symbolic enumeration
methods allow to find directly equations on generating functions and
extend previous results. More precisely, these inductions may deeply depend on the values of the
parameters involved such as the minimal size h of the helices, the minimal
size b of the loops,...Due to the number of initial conditions, the
recurrence relations may become very intricate. We avoid this unnecessary step.
A secondary structure is a set of helices, i.e., paired subsequences of the same size. Two paired subsequences are separated by an embedded
secondary structure or by a nonpaired subsequence, called a loop.
Secondary structures satisfy three additional conditions. First, the helices
must have a minimal size, h. Second, the loops must have a minimal
size, b. Third, two helices cannot overlap.
We now count the number of different secondary structures that may be
built on the RNA sequences of a given size:
Theorem 1
Let S_{n}^{[h,b]} be the set of RNA secondary structures of size n,
where the minimal helix size and loop size are h and b.
The generating function S^{[h,b]}(z)satisfies the second degree
equation:
(S^{[h,b]})^{2}(z)z^{2h} + S^{[h,b]}(z) [(z1)(1z^{2}+z^{2h})z^{2h} 

]+ 1z^{2}+z^{2h}=0.
(1) 
We consider the external helices, and get
S^{[h,b]} from the following decomposition:
W T W T W T W

h'1 h'1 h'2 h'2 h'k h'k
where W is the set of of nonpaired subsequences, and
T^{[h,b]} the set of secondary structures that do not start or
end with a pairing. Otherwise, such a pairing would be part of the external
helix.
In this scheme, we assume there exist k external helices of sizes
h'_{1} ,... ,h'_{k}. Two external
helices are separated by nonpaired positions, which is coded by language
W. We get the set decomposition:
S^{[h,b]}=
W 
× È 

[ A^{[h]} × T^{[h,b]} × W]^{k}
(2) 
where A^{[h]} is the set of couples of subsequences of the same
size h' ³ h.
One can prove a simple relation between T^{[h,b]} and
S^{[h,b]}. Namely:
S^{[h,b]} =
A^{[h]} ×
T^{[h,b]} +
T^{[h,b]} +
Y^{[b]}
(3)
where Y^{[b]} counts sequences of length smaller than b (no structure is possible). I.e. Y^{[b]}(z) =å_{i=0}^{b1} z^{i}.
We plug (3) into (2) and, applying translation rules, we get
(1) after simplification.
Remark 1
For h=1 and b=1, we get the equation in [10].
The next theorem directly follows from Darboux's theorem and equation
(1).
Theorem 2
When n ¾® ¥,
S_{n}^{[h,b]} ~ 

[rD'_{h,b}(r)] 

·r_{h,b}^{(n+2h)}
(4) 
where r is the smallest positive root of the discriminant
D_{h,b}(z) of (1).
Remark 2
The location of the roots is discussed in [9].
Notably, it is proven that
1/r_{h,b} decreases when h and b increase and that,
for any h and b, D_{h,b}(z) has one root
r_{h,b} in ]0,1[ and that D_{h,b}(z) has one root in ]0,1/2[.
It follows that S_{n}^{[h,b]} grows exponentially and that
S_{n}^{[1,b]} ³ 2^{n}.
Numerical values of r have been computed using Maple.
One also derives functional equations satisfied by the generating functions
of specific structures: the socalled hairpins and cloverleaves.
Asymptotics follow similarly.
2 Counting alignments
2.1 Language decomposition
Definition 1
An alignment of k sequences of size (n_{i})_{i=1,k} is
a sequence of ktuples
(a_{1}^{[j]}, ... ,a _{i}^{[j]} , ... , a _{k}^{[j]})
such that:

each subsequence (a _{i}^{[j]}) is increasing from 0 to n_{i};
 two successive ktuples are not equal.
An aligned position is a an integer j such that
a _{i}^{[j]} = a _{i}^{[j1]} +1, i=1,...,k.
A bblock is a subsequence of aligned positions of length at least b.
The number of kalignments is counted in [2].
A first generalization is proposed by [3], for k=2.
Blocks of size below some threshold b are considered as nonsignificant, and
one eliminates all alignments containing such small blocks.
Nevertheless, the authors still count separately nonaligned letters.
E.g. ¹ . We extend this
counting for any k ³ 2.
A second generalization is proposed in [11] for k=2 and b=1.
One identifies and . We extend it for
matching blocks of size b greater than 1. The motivation is the following.
In [3], one considers only matching blocks of size at least b
as significant. It follows that the difference between
and should be considered as non
significant. Hence, in this
section, we identify all alignments that differ by a set of positions
without a b matching block. When b=1 and k=2, this is the generalization
described in [11].
Theorem 3
Let f(n_{1},...,n_{k}) be the number of alignments between k
sequences of lengths n_{1},...,n_{k}.
The associated multivariate generating function is, in the ordered case:
f(z_{1},...,z_{k}) = 
1t +t^{b} 

1s (1t +t^{b} )t 

(5) 
where
t = Õ_{i=1}^{k}z_{i} and
s = å_{i=1}^{k} z_{i}.
The bivariate generating function is, in the ordered case:
f(z_{1},...,z_{k}) = (1t +t^{b}) 
1 

1(p 1)(1t +t^{b} )t 

(6) 
where p = Õ_{i=1}^{k} (1z_{i}).
When b=1 and k=2, this simplifies into
f (z_{1},z_{2}) =1/[1(z_{1}+z_{2})]. It follows that
g_{1}(n,m) = ()
which is the result proved in [3] by a combinatorial approach.
The proof relies on the derivation of a coding language for
sequences f(n_{1},...,n_{k}).
For example, one proves, for b=1, that
L_{1}=[Õ_{i=1}^{k}(1+Z_{i})1]^{*} [2].
Let us build an alignment from left to right. In each position, one may
choose, for each sequence, either to align one character or to introduce a gap.
This contributes either by Z_{i} or by 1, and we get, for independent
choices, Õ_{i=1}^{k} (1+Z_{i}). The only choice to be excluded is the
choice of a gap in all sequences, which is associated to 1^{k}=1. Hence, we
get Equation (5).
It is worth noticing this is a bivariate function of s and t.
2.2 Asymptotics
A possible, and usual, assumption is that aligned sequences have the same
size. Hence, one is interested in
f(n) = [z_{1}^{n}··· z_{k}^{n}] f(z_{1},...,z_{k}) .
The generating function å_{n} f(n)z^{n} is called the diagonal of the
generating function f(z_{1},...,z_{k}). One can find general results on diagonals in [1, 5]. We provide here a simplified scheme of the
approach in the particular case of the alignment of two sequences. We prove:
Theorem 4
Let us define:
D_{1} (t) 
=(1t)^{2} 4t(1t+t^{b}) 
D _{2} (t) 
= (1t^{2}+t^{b+1})4t(1t+t^{b}) . 
We have
f(n) ~ a n^{1/2} r ^{n} where r is the (positive) root
of smallest modulus of D_{1} (t) (respectively D _{2} (t)) in the
nonordered (respectively ordered) case.
Proof.
When k=2, s and p depend only on two variables, t and
z_{1}. Namely, one has:
s = (z_{1}+t/z_{1}) and
p = 1+t +z_{1} +t/z_{1}. In both cases, one has:
F(t) = 

f(n) t ^{n} =[z_{1}^{0}] f(z_{1},t/z_{1}) .

Applying the Cauchy formula, we get:
F(t) = 


ó õ 

dz_{1}
= 


ó õ 

dz_{1}

where P(t,x) is a polynomial in t and x of degree 2 with respect
to the second variable x. One proves that the discriminant of P(t,x) with
respect to x is D_{1} (t) (respectively D_{2} (t))
in the nonordered (respectively ordered) case. We first compute the integral.
Then, applying Darboux's theorem, we get that f(n) / r ^{n} has a
polynomial growth, where r is the smallest positive root of
d _{1} (t) (respectively D _{2}(t)). Darboux's theorem also
provides the dominating term of f(n)/ r ^{n}. We refer the reader to [11] where this term is explicitly given for the nonordered
case.
3 Miscellaneous problems
Many other parameters of interest to biologists can be studied through a
generating function approach. One can cite the longest runs [11] or filtration methods such as the statistical distance [6]. This talk presented new results for the statistical distance
in a nonuniform probability model. Finally, statistics for the number of
occurrences of a given set of words is also of great interest. We presented a
generating function approach [7]. We proved the limiting
distribution is
asymptotically normal and provided formulæ to compute the moments or
the probability of occurrences in the finite range, for Bernoulli and Markov
models.
References
 [1]

Furstenberg (Harry). 
Algebraic functions over finite fields. Journal of Algebra,
vol. 7, 1967, pp. 271277.
 [2]

Griggs (J. R.), Hanlon (P.), Odlyzko (A. M.), and Waterman (M. S.). 
On the number of alignments of k sequences. Graphs and
Combinatorics, vol. 6, n°2, 1990, pp. 133146.
 [3]

Griggs (Jerrold R.), Hanlon (Philip J.), and Waterman (Michael S.). 
Sequence alignments with matched sections. SIAM Journal on
Algebraic Discrete Methods, vol. 7, n°4, 1986,
pp. 604608.
 [4]

Howell (J. A.), Smith (T. F.), and Waterman (M. S.). 
Computation of generating functions for biological molecules. SIAM Journal on Applied Mathematics, vol. 39, n°1, 1980,
pp. 119133.
 [5]

Litow (B.) and Dumas (Ph.). 
Additive cellular automata and algebraic series. Theoretical
Computer Science, vol. 119, n°2, October 1993,
pp. 345354.
 [6]

Pevzner (P. A.). 
Statistical distance between texts and filtration methods in sequence
comparison. CABIOS, vol. 8, n°2, 1992, pp. 121127.
 [7]

Régnier (M.) and Szpankowski (W.). 
On the approximate pattern occurences in a text, 1997.
In Proceeding SEQUENCE'97, Positano.
 [8]

Stein (P. R.) and Waterman (M. S.). 
On some new sequences generalizing the Catalan and Motzkin
numbers. Discrete Mathematics, vol. 26, n°3, 1978,
pp. 261272.
 [9]

Tahi (F.). 
Méthodes formelles d'analyse des séquences de
nucléotides. 
Thèse de 3e cycle, Université de Paris XI, Orsay,
1997. 155 pages.
 [10]

Viennot (X. G.) and Vauchaussade de Chaumont (M.). 
Enumeration of RNA's secondary structures by complexity. In Capasso
(V.), Grosso (E.), and PavenFontana (S. L.) (editors), Mathematics in
Medicine and Biology. Lecture Notes in Biomathematics, vol. 57. 
SpringerVerlag, 1985.
 [11]

Waterman (M.). 
Introduction to Computational Biology. 
Chapman and Hall, London, 1995.
 [12]

Waterman (Michael S.). 
Secondary structure of singlestranded nucleic acids,
pp. 167212. 
Academic Press, 1978, Advances in Mathematics
Supplementary Studies.
 [13]

Waterman (Michael S.). 
Combinatorics of RNA hairpins and cloverleaves. Studies in
Applied Mathematics, vol. 60, n°2, 1979, pp. 9196.
This document was translated from L^{A}T_{E}X by
H^{E}V^{E}A.