Enumeration of Autocorrelations and Computation of Their Populations
Éric Rivals
LIRMM, Université Montpellier II
Algorithms Seminar
November 22, 1999
[summary by Pierre Nicodème]
A properly typeset version of this document is available in
postscript and in
pdf.
If some fonts do not look right on your screen, this might be
fixed by configuring your browser (see the documentation here).
Abstract
This talk presents in a first part Guibas and Oldlyzko's
characterization of autocorrelations and in a second part algorithms
developed by Éric Rivals with Sven Rahmann (TBI, DKFZ, Heidelberg)
to enumerate the autocorrelations and to simultaneously compute their
populations.
1 Introduction
An interesting statistics about a random text of size N is the
number of different words of a given size n it contains, or,
equivalently, how many words of size n are missing in the random
text. These statistics are closely linked with the autocorrelations of
the words, that are sets of periods of the words. We consider here
the enumeration of autocorrelations and the populations of the
autocorrelations, originally studied by Guibas and
Odlyzko [3]. The original motivation of Rivals and Rahmann
comes from searching genomic databases with
qgrams [1].
2 Definitions
We consider a finite alphabet S. Let w=w_{1} w_{2}
··· w_{n} where w_{i}Î S. A period of w is an
integer p such that for all i between 1 and np we have
a_{i}=a_{i+p}. As an example, the word abracadabra has for
periods 0, 7, and 10. Its factor abra has for periods 0 and 1.
The autocorrelation vector of a word w, denoted by V(w),
is the binary vector V=(v_{0}, v_{1},..., v_{n1}) such that v_{i} is
equal to one if i is a period of w and to zero
otherwise. Alternatively, the autocorrelation will be denoted by the
corresponding binary word v_{0} v_{1} ··· v_{n1}. We denote P(w)
the set of autocorrelations of the word w.
We are interested in statistics about the whole set of words of size
n and therefore denote G(n) the set of autocorrelations of
size n and k(n) its cardinality.
The periods have the following properties:

0 is always a period;

if i is a period, then for all i in the range (1,ë
n/pû ] the integer ip is a period;

if p and q are periods of w, with p<q, then qp is a period
of the prefix of length np of w.
Theorem 1 [Fine and Wilf]
Let p and q be periods of a word w, with p<q. If p+q£ w
+gcd(p,q) then gcd(p,q) is a period
of w.
See [2, 3] for this theorem.
3 Periods in Strings
This section follows the lines of Guibas and Odlyzko [3].
In order to give equivalent characterizations of autocorrelations
vectors within binary vectors in Theorem 2 below,
we now give the definitions of the forward and backward propagation
rules and of the X predicate that are used in this theorem.
If p<q are periods of a word w, then q+(qp) is also
period. This gives the following rule.
Definition 1 [Forward Propagation Rule]
A binary vector V=(v_{0},v_{1},···,v_{n}) satisfies the forward
propagation rule if, whenever we have v_{p}=v_{q}=1 with p<q, we also
have v_{t}=1 for all
t in [ p,n) such that t=p+i(qp) with i=0,1,2,....
The backward propagation rule asserts that if p and q are periods
with p<q and if p(qp) is not a period, then none of the positive
integers pi(qp) may be a period.
Definition 2 [Backward Propagation Rule]
A binary vector V=(v_{0},v1,···,v_{n1}) satisfies the backward
propagation rule if the following condition holds. Consider every p
and q such that p<q£ 2p with v_{p}=v_{q}=1, but v_{2pq}=0;
then for all t in the range [ 0,2pq ] such that t=pi(qp)
and i belongs to the interval
[1,ënp/qpû] we have
v_{t}=0.
We now introduce a recursive predicate on binary vectors that is
equivalent to the condition that the binary vector is an
autocorrelation vector. In the following, we note the shortest period
of a word v by p(v).
Definition 3 [Recursive Predicate X]
Let V=(v_{0},v_{1},...,v_{n1}) be a nonempty binary vector. Define
p=p(v_{0} v_{1} ··· v_{n1}).
The vector V
satisfies the predicate X if and only if V is such that v_{0}=1
and V satisfies one of the following two conditions:

Case (A), p£ ën/2û.
Let r=nmod p and q=p+r and let w=w_{1}··· w_{q} be the suffix of
v_{0} v_{1} ··· v_{n1} of length q. Then:

for all j in the
range [ 1,nq ], v_{j}=1 if j=ip for some i, and v_{j}=0
otherwise;
 w_{p}=1 or r=0;
 if p(w)<p then p(w)>(qp)+gcd(p,p(w));
 the vector (w_{1},...,w_{q}) satisfies predicate X.
 Case (B), p>ën/2û.
Let w=w_{1}··· w_{np} be the suffix of v_{0}··· v_{n1} of
length np. Then for all j in the range [ 1,np ] we have
v_{j}=0 and the vector (w_{1},..., w_{np}) satisfies
predicate X.
The algorithmic check of the predicate X requires O(n)
operations on a vector V of size n.
Theorem 2
Let V=(v_{0},v_{1},···,v_{n}) be a nonempty binary vector. Then the
following four statements are equivalents:

V is a correlation vector of a binary string;
 V is a correlation vector of some string;
 v_{0}=1 and V satisfies the forward and backward propagation
rules;
 V satisfies the predicate X.
Note that equivalence between statements 1 and 2 implies that the
characterization of an autocorrelation vector is independent of the
size of the alphabet.
Figure 1:
Recursive algorithm
Autocorrelations.
4 An Algorithm to Enumerate all Autocorrelations of Size n
We use the predicate X to build a recursive bottomup procedure
that constructs autocorrelation vectors. To this end, note that the
condition (2) of Case (A) of the predicate X is equivalent to
p(w) does not divide p
and
p(w)>j_{0}=min { j j+p>q+gcd(j,p) }.
Algorithm Autocorrelation to enumerate all autocorrelations
until size n is given in Figure 1.
Implementation
The autocorrelations are stored as binary vectors. The implementation
has been done as an iterative procedure, although the algorithm
presented in Figure 1 is recursive. Note that in
Case (A) of the algorithm the tests of conditions (a) and (b) of the
X predicate can be done in O(1) operations. Moreover only the
valid subset of G(q) is computed.
Complexity and optimality
Each bit of an autocorrelation is computed only once. The complexity
is unknown, no close formula for the number of autocorrelations of
size n being known.
Asymptotic bounds
Guibas and Odlyzko [3] give the following bounds for the
logarithm of the number k(n) of autocorrelations of size n:
b_{l}= 
æ ç ç è 

+o(1) 
ö ÷ ÷ ø 
log^{2} n £ log k(n)
£ 
æ ç ç è 

+o(1) 
ö ÷ ÷ ø 
log^{2} n. 
For numerical computations up to n=200, Rivals and Rahmann obtain
k(n)<b_{l}. They conjecture that the asymptotic value
of k(n) is b_{l}, the lower bound of Guibas and Odlyzko.
5 Computation of the Populations of Autocorrelations
In this section, the size n of the autocorrelations vectors is fixed.
Definition 4
The population N of an autocorrelation vector V is defined as
N(V)=Card { w V is the autocorrelation vector of w }.
We define a partial order £ on the autocorrelation vectors
by V=v_{0} v_{1} ··· v_{n1} £ V'=v'_{0} v'_{1} ··· v'_{n1} if
for all i in [ 0, n1 ], v'_{i}=1 whenever v_{i}=1. We also
define the total order £ by V£ V' if the word v_{0} v_{1}···
v_{n1} precedes lexicographically the word v'_{0} v'_{1}···
v'_{n1}. Then V £ V' implies V £ V'. Autocorrelation
vectors of size n are sorted along the total order £ and
numbered along this order from 1 to k(n). The notation V_{k}
refers to the vector at rank k in this order.
Definition 5
The number r_{k} of free characters of the autocorrelation V_{k} is
the number of characters that we can choose freely to build a word
with the correlation V_{k}. The other characters are determined by the
periods of the autocorrelation.
With an alphabet of size s, for k from k (=k(n)) to 1, we get
N(V_{k})=s 

 

å 
k<j<k and V_{j} > V_{k} 

N(V_{j}). 
The implementation is
quadratic in k(n).
References
 [1]

Burkhardt (S.), Crauser (A.), Ferragina (P.), Lenhof (H.P.), Rivals (E.), and
Vingron (M.). 
qgram based database searching using a suffix array (QUASAR). In
Third International Conference on Computational Biology. pp. 7783. 
ACMPress, 1999. S. Istrail, P. Pevzner and M.
Waterman, editors.
 [2]

Fine (N. J.) and Wilf (H. S.). 
Uniqueness theorems for periodic functions. Proceedings of the
AMS, vol. 16, 1965, pp. 109114.
 [3]

Guibas (Leo J.) and Odlyzko (Andrew M.). 
Periods in strings. Journal of Combinatorial Theory. Series A,
vol. 30, n°1, 1981, pp. 1942.
This document was translated from L^{A}T_{E}X by
H^{E}V^{E}A.