Enumeration of Autocorrelations and Computation of Their Populations
Éric Rivals
LIRMM, Université Montpellier II
Algorithms Seminar
November 22, 1999
[summary by Pierre Nicodème]
A properly typeset version of this document is available in
postscript and in
pdf.
If some fonts do not look right on your screen, this might be
fixed by configuring your browser (see the documentation here).
Abstract
This talk presents in a first part Guibas and Oldlyzko's
characterization of autocorrelations and in a second part algorithms
developed by Éric Rivals with Sven Rahmann (TBI, DKFZ, Heidelberg)
to enumerate the autocorrelations and to simultaneously compute their
populations.
1 Introduction
An interesting statistics about a random text of size N is the
number of different words of a given size n it contains, or,
equivalently, how many words of size n are missing in the random
text. These statistics are closely linked with the autocorrelations of
the words, that are sets of periods of the words. We consider here
the enumeration of autocorrelations and the populations of the
autocorrelations, originally studied by Guibas and
Odlyzko [3]. The original motivation of Rivals and Rahmann
comes from searching genomic databases with
q-grams [1].
2 Definitions
We consider a finite alphabet S. Let w=w1 w2
··· wn where wiÎ S. A period of w is an
integer p such that for all i between 1 and n-p we have
ai=ai+p. As an example, the word abracadabra has for
periods 0, 7, and 10. Its factor abra has for periods 0 and 1.
The autocorrelation vector of a word w, denoted by V(w),
is the binary vector V=(v0, v1,..., vn-1) such that vi is
equal to one if i is a period of w and to zero
otherwise. Alternatively, the autocorrelation will be denoted by the
corresponding binary word v0 v1 ··· vn-1. We denote P(w)
the set of autocorrelations of the word w.
We are interested in statistics about the whole set of words of size
n and therefore denote G(n) the set of autocorrelations of
size n and k(n) its cardinality.
The periods have the following properties:
-
0 is always a period;
-
if i is a period, then for all i in the range (1,ë
n/pû ] the integer ip is a period;
-
if p and q are periods of w, with p<q, then q-p is a period
of the prefix of length n-p of w.
Theorem 1 [Fine and Wilf]
Let p and q be periods of a word w, with p<q. If p+q£ |w|
+gcd(p,q) then gcd(p,q) is a period
of w.
See [2, 3] for this theorem.
3 Periods in Strings
This section follows the lines of Guibas and Odlyzko [3].
In order to give equivalent characterizations of autocorrelations
vectors within binary vectors in Theorem 2 below,
we now give the definitions of the forward and backward propagation
rules and of the X predicate that are used in this theorem.
If p<q are periods of a word w, then q+(q-p) is also
period. This gives the following rule.
Definition 1 [Forward Propagation Rule]
A binary vector V=(v0,v1,···,vn) satisfies the forward
propagation rule if, whenever we have vp=vq=1 with p<q, we also
have vt=1 for all
t in [ p,n) such that t=p+i(q-p) with i=0,1,2,....
The backward propagation rule asserts that if p and q are periods
with p<q and if p-(q-p) is not a period, then none of the positive
integers p-i(q-p) may be a period.
Definition 2 [Backward Propagation Rule]
A binary vector V=(v0,v-1,···,vn-1) satisfies the backward
propagation rule if the following condition holds. Consider every p
and q such that p<q£ 2p with vp=vq=1, but v2p-q=0;
then for all t in the range [ 0,2p-q ] such that t=p-i(q-p)
and i belongs to the interval
[1,ën-p/q-pû] we have
vt=0.
We now introduce a recursive predicate on binary vectors that is
equivalent to the condition that the binary vector is an
autocorrelation vector. In the following, we note the shortest period
of a word v by p(v).
Definition 3 [Recursive Predicate X]
Let V=(v0,v1,...,vn-1) be a non-empty binary vector. Define
p=p(v0 v1 ··· vn-1).
The vector V
satisfies the predicate X if and only if V is such that v0=1
and V satisfies one of the following two conditions:
-
Case (A), p£ ën/2û.
Let r=nmod p and q=p+r and let w=w1··· wq be the suffix of
v0 v1 ··· vn-1 of length q. Then:
-
for all j in the
range [ 1,n-q ], vj=1 if j=ip for some i, and vj=0
otherwise;
- wp=1 or r=0;
- if p(w)<p then p(w)>(q-p)+gcd(p,p(w));
- the vector (w1,...,wq) satisfies predicate X.
- Case (B), p>ën/2û.
Let w=w1··· wn-p be the suffix of v0··· vn-1 of
length n-p. Then for all j in the range [ 1,n-p ] we have
vj=0 and the vector (w1,..., wn-p) satisfies
predicate X.
The algorithmic check of the predicate X requires O(n)
operations on a vector V of size n.
Theorem 2
Let V=(v0,v1,···,vn) be a non-empty binary vector. Then the
following four statements are equivalents:
-
V is a correlation vector of a binary string;
- V is a correlation vector of some string;
- v0=1 and V satisfies the forward and backward propagation
rules;
- V satisfies the predicate X.
Note that equivalence between statements 1 and 2 implies that the
characterization of an autocorrelation vector is independent of the
size of the alphabet.
Figure 1:
Recursive algorithm
Autocorrelations.
4 An Algorithm to Enumerate all Autocorrelations of Size n
We use the predicate X to build a recursive bottom-up procedure
that constructs autocorrelation vectors. To this end, note that the
condition (2) of Case (A) of the predicate X is equivalent to
p(w) does not divide p
and
p(w)>j0=min { j| j+p>q+gcd(j,p) }.
Algorithm Autocorrelation to enumerate all autocorrelations
until size n is given in Figure 1.
Implementation
The autocorrelations are stored as binary vectors. The implementation
has been done as an iterative procedure, although the algorithm
presented in Figure 1 is recursive. Note that in
Case (A) of the algorithm the tests of conditions (a) and (b) of the
X predicate can be done in O(1) operations. Moreover only the
valid subset of G(q) is computed.
Complexity and optimality
Each bit of an autocorrelation is computed only once. The complexity
is unknown, no close formula for the number of autocorrelations of
size n being known.
Asymptotic bounds
Guibas and Odlyzko [3] give the following bounds for the
logarithm of the number k(n) of autocorrelations of size n:
bl= |
æ ç ç è |
|
+o(1) |
ö ÷ ÷ ø |
log2 n £ log k(n)
£ |
æ ç ç è |
|
+o(1) |
ö ÷ ÷ ø |
log2 n. |
For numerical computations up to n=200, Rivals and Rahmann obtain
k(n)<bl. They conjecture that the asymptotic value
of k(n) is bl, the lower bound of Guibas and Odlyzko.
5 Computation of the Populations of Autocorrelations
In this section, the size n of the autocorrelations vectors is fixed.
Definition 4
The population N of an autocorrelation vector V is defined as
N(V)=Card { w| V is the autocorrelation vector of w }.
We define a partial order £ on the autocorrelation vectors
by V=v0 v1 ··· vn-1 £ V'=v'0 v'1 ··· v'n-1 if
for all i in [ 0, n-1 ], v'i=1 whenever vi=1. We also
define the total order £ by V£ V' if the word v0 v1···
vn-1 precedes lexicographically the word v'0 v'1···
v'n-1. Then V £ V' implies V £ V'. Autocorrelation
vectors of size n are sorted along the total order £ and
numbered along this order from 1 to k(n). The notation Vk
refers to the vector at rank k in this order.
Definition 5
The number rk of free characters of the autocorrelation Vk is
the number of characters that we can choose freely to build a word
with the correlation Vk. The other characters are determined by the
periods of the autocorrelation.
With an alphabet of size s, for k from k (=k(n)) to 1, we get
The implementation is
quadratic in k(n).
References
- [1]
-
Burkhardt (S.), Crauser (A.), Ferragina (P.), Lenhof (H.-P.), Rivals (E.), and
Vingron (M.). --
q-gram based database searching using a suffix array (QUASAR). In
Third International Conference on Computational Biology. pp. 77--83. --
ACM-Press, 1999. S. Istrail, P. Pevzner and M.
Waterman, editors.
- [2]
-
Fine (N. J.) and Wilf (H. S.). --
Uniqueness theorems for periodic functions. Proceedings of the
AMS, vol. 16, 1965, pp. 109--114.
- [3]
-
Guibas (Leo J.) and Odlyzko (Andrew M.). --
Periods in strings. Journal of Combinatorial Theory. Series A,
vol. 30, n°1, 1981, pp. 19--42.
This document was translated from LATEX by
HEVEA.