Genome Analysis and Sequences with Random Letter Distribution

Michel Termier

Institut de Génétique et Microbiologie, Université de Paris XI (France)

Algorithms Seminar

April 2, 2001

[summary by Mathias Vandenbogaert]

A properly typeset version of this document is available in postscript and in pdf.

If some fonts do not look right on your screen, this might be fixed by configuring your browser (see the documentation here).
The information content of genomes of different organisms reflects their mode of physical organisation. For the last decades the wet lab biologist's research interests has been to decipher this information content, with the purpose of extracting useful biological features. The reliability of the information extraction process, mainly based on the textual nature of the underlying messages, was hard to achieve. Therefore, an approach based on the comparison of naturally occuring sequences and randomly generated sequences, is used for discerning the artefacts in sequences and for improving the power of our genome models.


The building plan for vegetative life is based on the assembly and catalytic function of proteins and active RNAs. The complete set of instructions that is needed to generate the building blocks of the reproductary system is called a ``genome.'' Any production of living tissue from these building blocks will give rise to an accumulation of secondary metabolites, which are of adverse influence for the survival of the species. The secondary effects of metabolite production are at the basis for the requirement of the genome to be able to respond to the induced environmental changes. To counter this problem, a cell of an organism will only bring to expression those genes that are required at some specific moment in the cell's life cycle. For this purpose, a genome disposes of regulatory systems in the generation processes of building blocks. These systems can be compared to logical gates that are situated in upstream sequences of most information that needs to be processed. This permits a modulation in the usage of information. The genomic information is stocked in a linear fashion, which facilitates the tracking of information. At the time the sequencing of the human genomic sequence is being accomplished, several tasks remain to be addressed: Our bioinformatics team is mainly interested in providing an answer to basically two questions:
  1. How can messages be extracted from genomic sequences in order to perform the function assignment task?
  2. What is the nature of the message contained within any linear macromolecular structure?

1  First Task: Message Extraction and Function Assignment

The approach consists in observing the known words in the vocabulary of the genome. These known words have been indexed through many years of genetic experiments, with the use of techniques handled in molecular biology wet labs. Through this biology-related knowledge accumulation, the following facts are at the basis for the study of genomic sequences:

1.1  Mechanisms for processing signals in messages

There exist mechanisms for processing complex signals, both within eucaryotes as well as within viral species. The eucaryotic mechanism is described as alternative splicing: a protein-encoding sequence can generate different proteins at the time mRNA is being spliced, according to different translational systems. Sample mechanisms for this group of organisms are read-through (the transcription machinery is reading through and beyond the STOP codon), and hopping (the transcription machinery is skipping the STOP codon and the codons surrounding it). The retro-viral mechanism is called re-encoding, which implies that different proteins can be obtained at the time the mRNA is being translated. Sample mechanisms for this group are frameshift (the reading frame for translation is changed, which induces an alteration of the encoded amino acids), read-through and hopping. Several features can be conferred to some sequences that are responsible for a frameshift:
  1. Slipping sequences (structure X XXY YYZ).
  2. A badly positioned classical STOP signal: the ribosome looses his grip on the sequence and gets positioned again in phase -1.
  3. A ribosome-blocking structure.
Regulatory sequences that are responsible for the modulation of DNA transcription in a less error-prone fashion are:
  1. Inhibitor signals. Their role is to bind proteins so that the RNA polymerase can no longer bind to the sequence to initiate transcription.
  2. Activator signals. There exists a multitude of signals per protein-encoding sequence, according to the specific function of the protein to be generated.
Usually, these regulatory sequences are short sequences, whose observed frequency is higher (hence unexpected) in comparison to a random word composed of the same letters.

1.2  Modelling a genomic sequence

A Markov model is frequently used for modelling a genomic sequence. The number of sequences that can be generated by this model, increases with the order of the Markov model, and reaches a plateau.

For a Bernoulli-type distribution of the nucleotides, the actual sequence follows a Gaussian distribution. Additionally, when [A+T] increases, the amount of START and STOP signals increases. This implies that the certainty of finding a gene increases.

Regulatory signals are words with biased composition, with respect to the global word distribution of the sequence. These signals have been selected for their properties in the course of evolution. They have been generated according to mechanisms which include random events [2, 3].

1.3  The importance of codon usage biases

In the context of genetic expression, the codon usage bias is correlated with the level of tRNAs available, and with the abundance of protein generated. The level of protein-encoding sequences that are significantly biased is of the order of 20% of the total amount of sequences. Within this respect, several observation have been made: The codon usage bias determining the level of codons corresponding to the amino acids of proteins has a direct effect in the genomic sequence composition of the organism. This bias, which is the result of an interaction of horizontal transfer and metabolic constraints, is at the basis of the selection of efficient proteins. The codon usage bias reveals information about the nucleotide triplet usage of the encoded protein and about the eventual external origin of the sequence in the organism. The significance of the codon usage bias can be evaluated by using weighted linguistics approaches. This consists in heuristically weighting the codons used to encode the amino acids, instead of using an average weight for every amino acid that is encoded by several triplets. This prevents from having resulting frequencies that diverge from the observed values.

Nevertheless, the probability of finding reasonable codon compositions through linguistic methods is fairly low, because:

2  Second Task: Determining the Nature of the Message

Life on any other planet besides Earth can only be detectable for us if it is based on our carbon chemistry. Any sequential organic macromolecule contains constitutional information, if textual organization can be detected within it.

Different approaches exist for the detection of organized information:
  1. Complexity analysis of sequences. The complexity of sequences is difficult to compute. Ed Trifonov introduced in 1990 the notion of linguistic complexity [7] that reflects the linguistic wealth of a sequence. This complexity is easily computable as C = Õi=1n-1 ui, with ui the ratio of the words found in a sliding window at position i in a sequence, versus the total number of different words that could possibly be found. Computations are made along windows, by multiplying the u ratios of words of all possible lengths in the window. This implies that all redundancies are eliminated. The value ratC var <;
  2. the nn of )the ncChil redundann, if tex nt words that quensequence, he obnsequencs dhquensehis imp rnBeOoi
  3. rnBeOo coded dons/TD> dhquensehis imp rnBeOoi
  4. rnBeOo coded dons/T)ONT SIZE=4><5>·> rnBe log(> dhquensehis imp rnBeOoi
  5. rnBeOo coded dons/T))M>. The comncChil sp;i DNAnt I tex nt words thai glois must besed on oura tibutd witalphabetLI>theit beacan ng sequectinces that are resreisod on latedhes her pl a seqwayat is enct eogatioileqconfainty lecules (mLI>the dif ng sequectinces tha resliketrueo sh resr modonmmon rseted fticallmLI>the disequences are shostrainud witng weiguistics aret can be v of lated thienome.'''' an acer plLI>the difding fraectional the sequence ins difginer> dhe rsituhd po ilitates teir proulating o
  6. the infhods ied to enccil rtmessage conermine thse order ofed ation witween the STAing sequences th/UL> Li
    [1]> rnBeOoA>
    apello99" (H.), Olier95" (E.), Lomlès-Devaul Teleq(C.), N a chké (P.and ho Risl5" (J.-L.).sp;--mpu usage biaa dirtool perffediche certalar location of theeukotic, whvol.sp;27, n°14, 0 t9, pp.sp;2848--285 l?dTD> <; [2]> rnBeOoA>
    ntham80" (R.).sp;--mWorg sts the seqetic exped bhins/TTt win theBiomistr STOPnt p. T, n°5, 0 80, pp.sp;327--33 l?dTD> <; 1CITT COLOR=marposele>[3]> rnBeOoA>
    ntham80" (R.),ussit95" (C.), Gouy (M.), Jacobtale (M.), actMerc95" (R.).sp;--mpu usaalytiogage biaa codome modated tgydulationfor theetic pression,y&nbhins/TNeic aciAs, i Rarch inM>, whn°9, 0 81, pp.sp;43--74l?dTD> <; [4]> rnBeOoA>
    ay99" (B.), LloydheA.T.), McLebe (M.J.), Dev the(K.M.), n orp (P.M.), actWolfc p(K.H.).sp;--mPeins modonmition of acton usage biasequpira98"rmi, whn°27, 0 t9, pp.sp;1642--­1649l?dTD> <; [5]> rnBeOoA>
    ver92" (S.G.), vabsp; of A and(Q.J.), Agostoni-Con che (M.L.), Aigle (M.), iAl ofghted (L.), Aityomlraki (D.), Anto the(G.), Anwar (R.),uBed --m complexit teDstabence tothest orgcholecr modIIIhins/TNre ofM>. T, n°357, 0 t2, pp.sp;38--46l?dTD> <; [6]> rnBeOoA>
    ver995" (E.), Delatie (M.0.), actHensitheA.a.sp;--mDof Dstauringsong winst orgcholecr moregulaards errfindtion oal gnificance of the coduence. The>Complex as t wiurrde l'Académioneri snt p, whn°318, 0 t5, pp.sp;599--608l?dTD> <; [7]> rnBeOoA>
    ov96" (O.), nlaal (D.sp;Mode.), actfonov int(E.sp;ModN.a.sp;--mLuistic complexity [ComBioSems. SM>, whn°38, 0 t6, pp.sp;65--74l?dTD> <; [8]> rnBeOoA>
    ha98"t(E.P.C.), Vis&nheA.a, actDe o it.eA.a.sp;--mOonucleotides tas in relblitaluurrobttats:neratedripl win ho taxoic seqparison toComJoul origf Apes td Pability ofM>, whn°36, 0 t8, pp.sp;179--193l?dTD> <; [9]> rnBeOoA>
    fens99" (W.) actDigcon(D.).sp;--mms avae effgdin thinlaae spllowg frameeomncofgithan theshuff , whn°27, 0 t9, pp.sp;1578--1584l?/D