Generating Functions in Computational Biology
Algorithms Project --- Inria
March 3, 97
[summary by Mireille Régnier]
A properly typeset version of this document is available in postscript
and in pdf
If some fonts do not look right on your screen, this might be
fixed by configuring your browser (see the documentation here).
We present a few enumeration problems that arose in computational biology.
We point out on several examples how symbolic enumeration methods allow
for simplifying and extending previous results. We also present some
1 Secondary Structures
As a first example, we enumerate here combinatorial structures associated
to RNA sequences, e.g., secondary structures, hairpins, cloverleaves,...).
This study has been started in [11, 13, 8, 4, 12] with inductions on
the size of the RNA sequences.
We show here that symbolic enumeration
methods allow to find directly equations on generating functions and
extend previous results. More precisely, these inductions may deeply depend on the values of the
parameters involved such as the minimal size h of the helices, the minimal
size b of the loops,...Due to the number of initial conditions, the
recurrence relations may become very intricate. We avoid this unnecessary step.
A secondary structure is a set of helices, i.e., paired subsequences of the same size. Two paired subsequences are separated by an embedded
secondary structure or by a non-paired subsequence, called a loop.
Secondary structures satisfy three additional conditions. First, the helices
must have a minimal size, h. Second, the loops must have a minimal
size, b. Third, two helices cannot overlap.
We now count the number of different secondary structures that may be
built on the RNA sequences of a given size:
Let Sn[h,b] be the set of RNA secondary structures of size n,
where the minimal helix size and loop size are h and b.
The generating function S[h,b](z)satisfies the
(S[h,b])2(z)z2h + S[h,b](z) [(z-1)(1-z2+z2h)-z2h
We consider the external helices, and get
S[h,b] from the following decomposition:
W T W T W T W
h'1 h'1 h'2 h'2 h'k h'k
where W is the set of of non-paired subsequences, and
T[h,b] the set of secondary structures that do not start or
end with a pairing. Otherwise, such a pairing would be part of the external
In this scheme, we assume there exist k external helices of sizes
h'1 ,... ,h'k. Two external
helices are separated by non-paired positions, which is coded by language
W. We get the set decomposition:
where A[h] is the set of couples of subsequences of the same
size h' ³ h.
One can prove a simple relation between T[h,b] and
|| × È
|| [ A[h] × T[h,b] × W]k
where Y[b] counts sequences of length smaller than b (no structure is possible). I.e. Y[b](z) =åi=0b-1 zi.
We plug (3) into (2) and, applying translation rules, we get
(1) after simplification.
For h=1 and b=1, we get the equation in .
The next theorem directly follows from Darboux's theorem and equation
When n ¾® ¥,
where r is the smallest positive root of the discriminant
Dh,b(z) of (1).
The location of the roots is discussed in .
Notably, it is proven that
1/rh,b decreases when h and b increase and that,
for any h and b, Dh,b(z) has one root
rh,b in ]0,1[ and that Dh,b(z) has one root in ]0,1/2[.
It follows that Sn[h,b] grows exponentially and that
Sn[1,b] ³ 2n.
Numerical values of r have been computed using Maple.
One also derives functional equations satisfied by the generating functions
of specific structures: the so-called hairpins and cloverleaves.
Asymptotics follow similarly.
2 Counting alignments
2.1 Language decomposition
An alignment of k sequences of size (ni)i=1,k is
a sequence of k-tuples
(a1[j], ... ,a i[j] , ... , a k[j])
An aligned position is a an integer j such that
each subsequence (a i[j]) is increasing from 0 to ni;
- two successive k-tuples are not equal.
a i[j] = a i[j-1] +1, i=1,...,k.
A b-block is a subsequence of aligned positions of length at least b.
The number of k-alignments is counted in .
A first generalization is proposed by , for k=2.
Blocks of size below some threshold b are considered as non-significant, and
one eliminates all alignments containing such small blocks.
Nevertheless, the authors still count separately non-aligned letters.
E.g. ¹ . We extend this
counting for any k ³ 2.
A second generalization is proposed in  for k=2 and b=1.
One identifies and . We extend it for
matching blocks of size b greater than 1. The motivation is the following.
In , one considers only matching blocks of size at least b
as significant. It follows that the difference between
and should be considered as non
significant. Hence, in this
section, we identify all alignments that differ by a set of positions
without a b matching block. When b=1 and k=2, this is the generalization
described in .
Let f(n1,...,nk) be the number of alignments between k
sequences of lengths n1,...,nk.
The associated multivariate generating function is, in the ordered case:
t = Õi=1kzi and
s = åi=1k zi.
The bivariate generating function is, in the ordered case:
where p = Õi=1k (1-zi).
f(z1,...,zk) = (1-t +tb)
When b=1 and k=2, this simplifies into
f (z1,z2) =1/[1-(z1+z2)]. It follows that
g1(n,m) = ()
which is the result proved in  by a combinatorial approach.
The proof relies on the derivation of a coding language for
For example, one proves, for b=1, that
Let us build an alignment from left to right. In each position, one may
choose, for each sequence, either to align one character or to introduce a gap.
This contributes either by Zi or by 1, and we get, for independent
choices, Õi=1k (1+Zi). The only choice to be excluded is the
choice of a gap in all sequences, which is associated to 1k=1. Hence, we
get Equation (5).
It is worth noticing this is a bivariate function of s and t.
A possible, and usual, assumption is that aligned sequences have the same
size. Hence, one is interested in
f(n) = [z1n··· zkn] f(z1,...,zk) .
The generating function ån f(n)zn is called the diagonal of the
generating function f(z1,...,zk). One can find general results on diagonals in [1, 5]. We provide here a simplified scheme of the
approach in the particular case of the alignment of two sequences. We prove:
Let us define:
f(n) ~ a n-1/2 r -n where r is the (positive) root
of smallest modulus of D1 (t) (respectively D 2 (t)) in the
non-ordered (respectively ordered) case.
|D 2 (t)
||= (1-t2+tb+1)-4t(1-t+tb) .
When k=2, s and p depend only on two variables, t and
z1. Namely, one has:
s = (z1+t/z1) and
p = 1+t +z1 +t/z1. In both cases, one has:
Applying the Cauchy formula, we get:
where P(t,x) is a polynomial in t and x of degree 2 with respect
to the second variable x. One proves that the discriminant of P(t,x) with
respect to x is D1 (t) (respectively D2 (t))
in the non-ordered (respectively ordered) case. We first compute the integral.
Then, applying Darboux's theorem, we get that f(n) / r n has a
polynomial growth, where r is the smallest positive root of
d 1 (t) (respectively D 2(t)). Darboux's theorem also
provides the dominating term of f(n)/ r n. We refer the reader to  where this term is explicitly given for the non-ordered
|| f(n) t n =[z10] f(z1,t/z1) .
3 Miscellaneous problems
Many other parameters of interest to biologists can be studied through a
generating function approach. One can cite the longest runs  or filtration methods such as the statistical distance . This talk presented new results for the statistical distance
in a non-uniform probability model. Finally, statistics for the number of
occurrences of a given set of words is also of great interest. We presented a
generating function approach . We proved the limiting
asymptotically normal and provided formulæ to compute the moments or
the probability of occurrences in the finite range, for Bernoulli and Markov
Furstenberg (Harry). --
Algebraic functions over finite fields. Journal of Algebra,
vol. 7, 1967, pp. 271--277.
Griggs (J. R.), Hanlon (P.), Odlyzko (A. M.), and Waterman (M. S.). --
On the number of alignments of k sequences. Graphs and
Combinatorics, vol. 6, n°2, 1990, pp. 133--146.
Griggs (Jerrold R.), Hanlon (Philip J.), and Waterman (Michael S.). --
Sequence alignments with matched sections. SIAM Journal on
Algebraic Discrete Methods, vol. 7, n°4, 1986,
Howell (J. A.), Smith (T. F.), and Waterman (M. S.). --
Computation of generating functions for biological molecules. SIAM Journal on Applied Mathematics, vol. 39, n°1, 1980,
Litow (B.) and Dumas (Ph.). --
Additive cellular automata and algebraic series. Theoretical
Computer Science, vol. 119, n°2, October 1993,
Pevzner (P. A.). --
Statistical distance between texts and filtration methods in sequence
comparison. CABIOS, vol. 8, n°2, 1992, pp. 121--127.
Régnier (M.) and Szpankowski (W.). --
On the approximate pattern occurences in a text, 1997.
In Proceeding SEQUENCE'97, Positano.
Stein (P. R.) and Waterman (M. S.). --
On some new sequences generalizing the Catalan and Motzkin
numbers. Discrete Mathematics, vol. 26, n°3, 1978,
Tahi (F.). --
Méthodes formelles d'analyse des séquences de
Thèse de 3e cycle, Université de Paris XI, Orsay,
1997. 155 pages.
Viennot (X. G.) and Vauchaussade de Chaumont (M.). --
Enumeration of RNA's secondary structures by complexity. In Capasso
(V.), Grosso (E.), and Paven-Fontana (S. L.) (editors), Mathematics in
Medicine and Biology. Lecture Notes in Biomathematics, vol. 57. --
Waterman (M.). --
Introduction to Computational Biology. --
Chapman and Hall, London, 1995.
Waterman (Michael S.). --
Secondary structure of single-stranded nucleic acids,
pp. 167--212. --
Academic Press, 1978, Advances in Mathematics
Waterman (Michael S.). --
Combinatorics of RNA hairpins and cloverleaves. Studies in
Applied Mathematics, vol. 60, n°2, 1979, pp. 91--96.
This document was translated from LATEX by