Mireille Regnier, Alexandre Lifanov, and Vsevolod Makeev

Three variations on word counting

GCB'00

Bioinformatics /submitted/

This page contains examples of different approximation to the variane for Z-scores calculations used to search frequent motifs in nucleotide sequences. We studied eukaryotic non-coding nucleotide sequences of diferent nature. We looked for the frequent words, using both single-stranded and double- stranded counting. The Bernoulli model is evaluated from the letter frequencies in the sequence. For each 6-letter word found in the sequence, we counted all the occurrences of these words in the same sequence, allowing for one letter substitution. Then we calculated the expectation and the variance of the motif built from the initial word and all words derived with one substitution. We compared the exact formula for the variance and several approximations:

• (i) the Poisson approximation V ~ E;
• (ia) in the case of the double stranded sequence, the first term in the weghted formula;
• (ii) the approximation with the first correcting term V ~ E + (1-2m)P(H)^2;
• (iii) the approximation that takes into consideration overlaps but keeps only the linear term and drops the constant term;
• (iv) the precise formula for the variance with both the linear and the constant terms.

Basic results

• The contribution of the constant term to the variance is negligible in all cases (no difference between cases iii and iv).
• Difference between (ii) and (iii) is noticeable only for highly repetitive patterns. Notably, in the case of highly overlapping poly-Ns, the contribution of the overlapping term can amount to two fths of the overal variance.
• Formulae (iii) allows then to kick these words out of the ten best scores (say).
In all the table below the data are collected in columns, containing Z-scorese calculated with the formula
Z=(Observ-Expct)/sqrt(Var)
The columns are

(1) Var=Expct; ->Poissonian

(2) Var=(n-m)*(PH+P2H*(1-2*m)); Poissoninan with the correction term

(3) VarH=(n-m)*(PH+P2H*(1-2*m)+2*A); Correction and the linear overlapping term