Mireille Regnier, Alexandre Lifanov, and Vsevolod Makeev

Three variations on word counting

GCB'00

Bioinformatics /submitted/

This page contains examples of different approximation to the variane for Z-scores calculations used to search frequent motifs in nucleotide sequences. We studied eukaryotic non-coding nucleotide sequences of diferent nature. We looked for the frequent words, using both single-stranded and double- stranded counting. The Bernoulli model is evaluated from the letter frequencies in the sequence. For each 6-letter word found in the sequence, we counted all the occurrences of these words in the same sequence, allowing for one letter substitution. Then we calculated the expectation and the variance of the motif built from the initial word and all words derived with one substitution. We compared the exact formula for the variance and several approximations:


Basic results

In all the table below the data are collected in columns, containing Z-scorese calculated with the formula The columns are

(1) Var=Expct; ->Poissonian

(2) Var=(n-m)*(PH+P2H*(1-2*m)); Poissoninan with the correction term

(3) VarH=(n-m)*(PH+P2H*(1-2*m)+2*A); Correction and the linear overlapping term

(4) VarH=(n-m)*(PH+P2H*(1-2*m)+2*A)+m*(m-1)*P2H-2*Ad; The precise formula

Examples:

Drosophila melanogaster ninaE gene promoter region (AC Y00601); this is the promoter involved in the development of a specific expression pattern of a complex insect eye. This rather short sequence contains the repeated palindromic word TGCGCA, for which the 30% difference is Z-scores is observed in the case of double stranded counting.

The upstream region of Rat cellular retinol binding protein II gene (UNDER CONSTRUCTION)

N.crassa gla-1 gene

Mus musculus SDS-stable vimentin-bound DNA fragment (UNDER CONSTRUCTION)

human pericentromeric DNA region

l1 insertion element