Enumeration of Autocorrelations and Computation of Their Populations

ric Rivals

LIRMM, Universit Montpellier II

Algorithms Seminar

November 22, 1999

[summary by Pierre Nicodme]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).

Abstract
This talk presents in a first part Guibas and Oldlyzko's characterization of autocorrelations and in a second part algorithms developed by ric Rivals with Sven Rahmann (TBI, DKFZ, Heidelberg) to enumerate the autocorrelations and to simultaneously compute their populations.



1   Introduction

An interesting statistics about a random text of size N is the number of different words of a given size n it contains, or, equivalently, how many words of size n are missing in the random text. These statistics are closely linked with the autocorrelations of the words, that are sets of periods of the words. We consider here the enumeration of autocorrelations and the populations of the autocorrelations, originally studied by Guibas and Odlyzko [3]. The original motivation of Rivals and Rahmann comes from searching genomic databases with q-grams [1].

2   Definitions

We consider a finite alphabet S. Let w=w1 w2 wn where wi S. A period of w is an integer p such that for all i between 1 and n-p we have ai=ai+p. As an example, the word abracadabra has for periods 0, 7, and 10. Its factor abra has for periods 0 and 1. The autocorrelation vector of a word w, denoted by V(w), is the binary vector V=(v0, v1,..., vn-1) such that vi is equal to one if i is a period of w and to zero otherwise. Alternatively, the autocorrelation will be denoted by the corresponding binary word v0 v1 vn-1. We denote P(w) the set of autocorrelations of the word w.

We are interested in statistics about the whole set of words of size n and therefore denote G(n) the set of autocorrelations of size n and k(n) its cardinality.

The periods have the following properties:
  1. 0 is always a period;
  2. if i is a period, then for all i in the range (1, n/p ] the integer ip is a period;
  3. if p and q are periods of w, with p<q, then q-p is a period of the prefix of length n-p of w.
Theorem 1  [Fine and Wilf]   Let p and q be periods of a word w, with p<q. If p+q |w| +gcd(p,q) then gcd(p,q) is a period of w.
See [2, 3] for this theorem.

3   Periods in Strings

This section follows the lines of Guibas and Odlyzko [3].

In order to give equivalent characterizations of autocorrelations vectors within binary vectors in Theorem 2 below, we now give the definitions of the forward and backward propagation rules and of the X predicate that are used in this theorem.

If p<q are periods of a word w, then q+(q-p) is also period. This gives the following rule.
Definition 1  [Forward Propagation Rule]   A binary vector V=(v0,v1,,vn) satisfies the forward propagation rule if, whenever we have vp=vq=1 with p<q, we also have vt=1 for all t in p,n) such that t=p+i(q-p) with i=0,1,2,....

The backward propagation rule asserts that if p and q are periods with p<q and if p-(q-p) is not a period, then none of the positive integers p-i(q-p) may be a period.
Definition 2  [Backward Propagation Rule]   A binary vector V=(v0,v-1,,vn-1) satisfies the backward propagation rule if the following condition holds. Consider every p and q such that p<q 2p with vp=vq=1, but v2p-q=0; then for all t in the range [ 0,2p-q ] such that t=p-i(q-p) and i belongs to the interval [1,n-p/q-p] we have vt=0.

We now introduce a recursive predicate on binary vectors that is equivalent to the condition that the binary vector is an autocorrelation vector. In the following, we note the shortest period of a word v by p(v).
Definition 3  [Recursive Predicate X]   Let V=(v0,v1,...,vn-1) be a non-empty binary vector. Define p=p(v0 v1 vn-1). The vector V satisfies the predicate X if and only if V is such that v0=1 and V satisfies one of the following two conditions:

The algorithmic check of the predicate X requires O(n) operations on a vector V of size n.

Theorem 2   Let V=(v0,v1,,vn) be a non-empty binary vector. Then the following four statements are equivalents:
  1. V is a correlation vector of a binary string;
  2. V is a correlation vector of some string;
  3. v0=1 and V satisfies the forward and backward propagation rules;
  4. V satisfies the predicate X.
Note that equivalence between statements 1 and 2 implies that the characterization of an autocorrelation vector is independent of the size of the alphabet.




Figure 1: Recursive algorithm Autocorrelations.


4   An Algorithm to Enumerate all Autocorrelations of Size n

We use the predicate X to build a recursive bottom-up procedure that constructs autocorrelation vectors. To this end, note that the condition (2) of Case (A) of the predicate X is equivalent to
p(w) does not divide p   and   p(w)>j0=min { j| j+p>q+gcd(j,p) }.
Algorithm Autocorrelation to enumerate all autocorrelations until size n is given in Figure 1.

Implementation
The autocorrelations are stored as binary vectors. The implementation has been done as an iterative procedure, although the algorithm presented in Figure 1 is recursive. Note that in Case (A) of the algorithm the tests of conditions (a) and (b) of the X predicate can be done in O(1) operations. Moreover only the valid subset of G(q) is computed.

Complexity and optimality
Each bit of an autocorrelation is computed only once. The complexity is unknown, no close formula for the number of autocorrelations of size n being known.

Asymptotic bounds
Guibas and Odlyzko [3] give the following bounds for the logarithm of the number k(n) of autocorrelations of size n:
bl=


1
2 log 2
+o(1)


log2 n log k(n)


1
2log(3/2)
+o(1)


log2 n.
For numerical computations up to n=200, Rivals and Rahmann obtain k(n)<bl. They conjecture that the asymptotic value of k(n) is bl, the lower bound of Guibas and Odlyzko.

5   Computation of the Populations of Autocorrelations

In this section, the size n of the autocorrelations vectors is fixed.
Definition 4   The population N of an autocorrelation vector V is defined as
N(V)=Card { w| V is the autocorrelation vector of w }.
We define a partial order on the autocorrelation vectors by V=v0 v1 vn-1 V'=v'0 v'1 v'n-1 if for all i in [ 0, n-1 ], v'i=1 whenever vi=1. We also define the total order by V V' if the word v0 v1 vn-1 precedes lexicographically the word v'0 v'1 v'n-1. Then V V' implies V V'. Autocorrelation vectors of size n are sorted along the total order and numbered along this order from 1 to k(n). The notation Vk refers to the vector at rank k in this order.
Definition 5   The number rk of free characters of the autocorrelation Vk is the number of characters that we can choose freely to build a word with the correlation Vk. The other characters are determined by the periods of the autocorrelation.
With an alphabet of size s, for k from k (=k(n)) to 1, we get
N(Vk)=s
rk
 
-
 
k<j<k and Vj > Vk
N(Vj).
The implementation is quadratic in k(n).

References

[1]
Burkhardt (S.), Crauser (A.), Ferragina (P.), Lenhof (H.-P.), Rivals (E.), and Vingron (M.). -- q-gram based database searching using a suffix array (QUASAR). In Third International Conference on Computational Biology. pp. 77--83. -- ACM-Press, 1999. S. Istrail, P. Pevzner and M. Waterman, editors.

[2]
Fine (N. J.) and Wilf (H. S.). -- Uniqueness theorems for periodic functions. Proceedings of the AMS, vol. 16, 1965, pp. 109--114.

[3]
Guibas (Leo J.) and Odlyzko (Andrew M.). -- Periods in strings. Journal of Combinatorial Theory. Series A, vol. 30, n°1, 1981, pp. 19--42.

This document was translated from LATEX by HEVEA.