A properly typeset version of this document is available in postscript and in pdf.
If some fonts do not look right on your screen, this might be
fixed by configuring your browser (see the documentation here).
1 Biological Motivation
Local nucleotide composition, that is, the distribution of nucleotides
A, C, G, T along a chromosome, is important for many biological
issues. Moreover, local nucleotide composition is accounted for in
many algorithms developed to search for different patterns in DNA
sequences. We present a method of segmentation of nucleotide
sequences into regions with different average composition. The
sequence is modelled as a series of segments; within each segment the
sequence is considered as a random Bernoulli process. The partition
algorithm proceeds in two stages. In the first stage the optimal
partition is found, which maximizes the overall product of marginal
likelihoods computed for each segment and prevents segmentation into
short segments. In the next stage, optimal boundaries are filtered,
and segments with close compositions are merged. This allows us to
study segments with the chosen length-scale.
2 Optimal Segmentation
2.1 Probabilistic formulation
A symbolic sequence over an alphabet W of V letters is
considered as a series of segments. Each segment is modelled as a
Bernoulli random sequence. Bernoulli probabilities are estimated from
the vector n =(n_{1},...,n_{V}) where n_{j} denotes the number of
occurrences of the jth symbol in the segment. In the Bayesian
approach [1] estimated parameters are random variables.
The probability distribution of these random variables is estimated
from the data by a bootstrapping approach. First, one assumes an
initial probability distribution---the so-called prior
distribution---that may be chosen rather arbitrarily. These
probability distributions are re-estimated from the data using the
Bayes formula. The results of Bayesian estimation are always some
probability distributions of the estimated quantity. Bayesian and
classical statistics, however, agree for large samples because
Bayesian distributions converge to the maximal likelihood estimation
for any reasonable prior distribution. Denote the set of letter
probabilities (the segment composition) as
s=(q_{1},...,q_{V}) with å_{k=1}^{V}q_{k}=1. The
likelihood of the individual sequence is
L(s)=Õ_{k=1}^{V}q_{k}^{n}_{k}. Given a composition
s=(q_{1},...,q_{V}), one writes the probability density
function p(s), with normalisation condition ò
p(s) ds=1.
One starts from some prior distribution p(s), say the uniform
distribution on å_{k}q_{k}=1. The composition s of the
Bernoulli random process is picked up according to this prior
distribution, p(s). The estimated probability density function
p(s/n) satisfies Bayes's theorem:
p(s/n)=
L(n/s)p(s)
P(n)
where P(n)=ò L(n/s)p(s) ds. The
normalisation constant P(n) is called marginal
likelihood [3]. It reflects the overall probability of the
given sequence in the two stage random process. For a uniform prior
distribution, one has:
P(n)=
(V-1)!
(N+V-1)!
n_{1}!... n_{V}!.
Surprisingly, this quantity is also obtained in a conceptually similar
but different probabilistic model (G. Shaeffer, 1999). For a
sequence of length N, the probability of this sequence in the
shuffling procedure is computed. Numbers (n_{1},... n_{V}) are picked
up according to uniform distribution. With the assumption that
segments as independent, the complete likelihood of the sequence
segmentation into k segments with known boundary location is:
P=
Õ
k
P_{k}(n_{k}).
This quantity is optimized over the set of all possible boundary
configurations yielding the optimal segmentation.
2.2 Dynamic programming
The maximization algorithm is as follows. Consider a sequence
S=s_{1}s_{2}s_{3}... s_{N} of length N, where s_{i}ÎW. For every
segment S(a,b)=s_{a}... s_{b}, one introduces a weight W(a,b): for
example, W(a,b) can be ln P(S(a,b)). A
segmentation R in m blocks is determined as a set of boundaries
R={k_{0}=0,k_{1},...,k_{m-1},k_{m}=N}, where k_{i} separates s_{k}
and s_{k+1}. Its weight is:
F(R)=
m
å
j=1
W(k_{j-1}+1,k_{j}).
For functions determined on the segmentations, one also uses another
set of variables, the indicators of the boundary positions q_{k},
1£ k£ N. By definition, q_{k}=1 if there exists a segment
boundary after the kth letter, otherwise it is 0. Below, we use
the notations F(R) and F(q_{1},...,q_{k}) indifferently. The
segmentation R^{*} with maximal weight is computed in a recursive
manner. Denote by R^{*}(k) the optimal segmentation of the fragment
S(1,k), 1£ k£ N. R^{*}(1) is trivial. When optimal
segmentations R^{*}(1), ..., R^{*}(k-1) are known, the optimal
segmentation R^{*}(k) is found using the following recurrence
expression:
F
(
R^{*}(k)
)
=
max
0£ i£
k-1
[
F
(
R^{*}(i)
)
+W(i+1,k)
]
,
(1)
with F(R^{*}(0))=0. This equation yields the algorithm.
Since the segmentation R^{*}(k) is built in time O(k), the total
time can be estimated as O(N^{2}).
2.3 Fluctuations in local composition
It appears that segments in optimal segmentation are usually very
short. Even a random uniform Bernoulli sequence is divided into many
segments. More generally, when the sequence consists of several
random homogeneous domains, the optimal segmentation may include many
borders located within the domains. This phenomenon is due to
statistical fluctuations of the local nucleotide composition in random
sequences. Thus it is advantageous to extract boundaries, which
separate long regions with different compositions from those that
reflect statistical fluctuations. This can be done by penalizing
those segmentations that contain more boundaries. The correct penalty
choice was initially chosen from computer simulations.
3 Filtration of Boundaries
3.1 Partition function
To study the relative significance of a boundary, one can calculate a
score, that reflects how the addition of this particular boundary
influences weights of segmentations. Given the probability P(q)
of each segmentation q=(q_{1},...,q_{N}), one defines the partition
function of the segmentations in a standard way [2] by
summing the probabilities of all possible partitions:
Z(N)=
å
q_{1},...,q_{N-1}
P(q_{1},...,q_{N-1})
(2)
With the partition function at hand, one can compute the probability
of a boundary to be located after a particular letter k. One
computes two partition functions for the regions to the left and to
the right of this border, Z_{L} and Z_{R} respectively:
P(k)=
Z_{L}(k)Z_{R}(N-k)
Z(N)
.
(3)
3.2 Dynamic programming
The partition function in (2) rewrites as
follows [2]:
Z(N)=
å
q_{1},...,q_{N-1}
e
F(q_{1},...,q_{N-1})
.
(4)
To compute the probability of a boundary after the letter k, we also
need the partition functions of the segments to the left and to the
right of this boundary, and recursive formulæ to compute Z_{L}(k)
and Z_{R}(k) are analogous to (1). They are obtained
through the formal substitution of operations. Summation is used
instead of taking the maximum, and multiplication is used instead of
summation [2]. Equation (1) becomes:
Z_{L}(k)
=
k-1
å
j=0
e^{W(j+1,k-1)}Z_{L}(j),
Z_{R}(k)
=
N
å
j=k
e^{W(k,j)}Z_{R}(j),
with boundary conditions Z_{L}(0)=Z_{R}(N+1)=1 and
W(k-1,k)=W(N,N+1)=0. An obvious modification of dynamic programming
calculates the partition function in the case when only the given set
of boundaries is allowed.
3.3 Filtration strategy
For the best result one should combine calculation of optimal
segmentation with filtration. At the first stage, an optimal
segmentation is found. Then a cut-off value is chosen and all the
boundaries with probabilities (3) lower than that
cut-off value are removed. The resulting set of boundaries usually is
not optimal in the sense that some boundaries can also be removed,
yielding a configuration with a higher probability P. So an
additional round of optimisation is performed, removing some
boundaries. Iterations converge rapidly to the stable set of
boundaries all of which have the partition function probabilities
greater than the cut-off value.
Durbin (R.), Eddy (S.), Krogh (A.), and Mitchinson (G.). --
Biological sequences analysis:probabilistic models of protein
and nucleic acids. --
Cambridge University Press, 1998.
Finkelstein (A. V.) and Roytberg (M. A.). --
Computation of biopolymers: A general approach to different problems.
Biosystems, vol. 30, 1993, pp. 1--19.