Genome Analysis and Sequences with Random Letter Distribution
Institut de Génétique et Microbiologie, Université de
Paris XI (France)
April 2, 2001
[summary by Mathias Vandenbogaert]
A properly typeset version of this document is available in postscript
and in pdf
If some fonts do not look right on your screen, this might be
fixed by configuring your browser (see the documentation here).
The information content of genomes of different organisms reflects
their mode of physical organisation. For the last decades the wet lab
biologist's research interests has been to decipher this information
content, with the purpose of extracting useful biological
features. The reliability of the information extraction process,
mainly based on the textual nature of the underlying messages, was
hard to achieve. Therefore, an approach based on the comparison of
naturally occuring sequences and randomly generated sequences, is used
for discerning the artefacts in sequences and for improving the power
of our genome models.
The building plan for vegetative life is based on the assembly and
catalytic function of proteins and active RNAs. The complete set of
instructions that is needed to generate the building blocks of the
reproductary system is called a ``genome.'' Any production of living
tissue from these building blocks will give rise to an accumulation of
secondary metabolites, which are of adverse influence for the survival
of the species. The secondary effects of metabolite production are at
the basis for the requirement of the genome to be able to respond to
the induced environmental changes. To counter this problem, a cell of
an organism will only bring to expression those genes that are
required at some specific moment in the cell's life cycle. For this
purpose, a genome disposes of regulatory systems in the generation
processes of building blocks. These systems can be compared to logical
gates that are situated in upstream sequences of most information that
needs to be processed. This permits a modulation in the usage of
information. The genomic information is stocked in a linear fashion,
which facilitates the tracking of information. At the time the
sequencing of the human genomic sequence is being accomplished,
several tasks remain to be addressed:
Our bioinformatics team is mainly interested in providing an answer to basically two questions:
the decomposition of the genomic sequence into streams of messages;
- the distinction of these ``messages'' in contrast to the ``non-coding bulk information'';
- assignment of biologically significant functions to the messages.
How can messages be extracted from genomic sequences in order to perform
the function assignment task?
- What is the nature of the message contained within any linear macromolecular structure?
1 First Task: Message Extraction and Function Assignment
The approach consists in observing the known words in the vocabulary
of the genome. These known words have been indexed through many years
of genetic experiments, with the use of techniques handled in
molecular biology wet labs. Through this biology-related knowledge
accumulation, the following facts are at the basis for the study of
1.1 Mechanisms for processing signals in messages
There exist mechanisms for processing complex signals, both within
eucaryotes as well as within viral species. The eucaryotic
mechanism is described as alternative splicing: a
protein-encoding sequence can generate different proteins
at the time mRNA is being spliced, according to different
translational systems. Sample mechanisms for this group of
organisms are read-through (the transcription machinery
is reading through and beyond the STOP codon), and hopping
(the transcription machinery is skipping the STOP codon and
the codons surrounding it).
The retro-viral mechanism is called re-encoding,
which implies that different proteins can be obtained at the time
the mRNA is being translated.
Sample mechanisms for this group are frameshift (the
reading frame for translation is changed, which induces an
alteration of the encoded amino acids), read-through and hopping.
Several features can be conferred to some sequences that are
responsible for a frameshift:
Regulatory sequences that are responsible
for the modulation of DNA transcription in a less error-prone fashion are:
Slipping sequences (structure X XXY YYZ).
- A badly positioned classical STOP signal: the ribosome looses
his grip on the sequence and gets positioned again in phase -1.
- A ribosome-blocking structure.
Usually, these regulatory sequences are short sequences, whose
observed frequency is higher (hence unexpected) in comparison to a
random word composed of the same letters.
Inhibitor signals. Their role is to bind proteins
so that the RNA polymerase can no longer bind to the sequence
to initiate transcription.
- Activator signals. There exists a multitude of signals
per protein-encoding sequence, according to the specific function
of the protein to be generated.
1.2 Modelling a genomic sequence
A Markov model is frequently used for modelling a genomic sequence. The number of
sequences that can be generated by this model, increases with the order of the Markov model, and reaches a plateau.
For a Bernoulli-type distribution of the nucleotides, the actual
sequence follows a Gaussian distribution. Additionally, when [A+T]
increases, the amount of START and STOP signals increases. This
implies that the certainty of finding a gene increases.
Regulatory signals are words with biased composition, with respect to
the global word distribution of the sequence. These signals have been
selected for their properties in the course of evolution. They have
been generated according to mechanisms which include random events
1.3 The importance of codon usage biases
In the context of genetic expression, the codon usage bias is
correlated with the level of tRNAs available, and with the abundance
of protein generated. The level of protein-encoding sequences that are
significantly biased is of the order of 20% of the total amount of
sequences. Within this respect, several observation have been made:
The codon usage bias determining the level of codons corresponding to the amino acids of proteins
has a direct effect in the genomic sequence composition of the organism.
This bias, which is the result of an interaction of horizontal transfer and
metabolic constraints, is at the basis of the selection of efficient proteins.
The codon usage bias reveals information about the nucleotide triplet usage of the encoded protein
and about the eventual external origin of the sequence in the organism.
The significance of the codon usage bias can be evaluated by using weighted linguistics approaches.
This consists in heuristically weighting the codons used to encode the amino acids, instead of using
an average weight for every amino acid that is encoded by several triplets.
This prevents from having resulting frequencies that diverge from the observed values.
the biased structure helps in regulating the transcription
- there is a positional codon bias according to the strand on
which the gene is situated ;
- there is a codon usage bias according to the life cycle of the
organism and the cellular location of the metabolic
- there is a bias in relation with mRNA stability
- some horizontal transfers can have effects on the codon
Nevertheless, the probability of finding reasonable codon compositions through linguistic methods is
fairly low, because:
global linguistics are calculated on a larger set of oligonucleotides than the number of oligos that determine the proteins;
- the number of codons in a gene equals one third of the number of possible triplets;
- the different genes are built up from codons of different composition, and this is increasing
the background noise accordingly.
2 Second Task: Determining the Nature of the Message
Life on any other planet besides Earth can only be detectable for us
if it is based on our carbon chemistry. Any sequential organic
macromolecule contains constitutional information, if textual
organization can be detected within it.
Different approaches exist for the detection of organized information:
Complexity analysis of sequences. The complexity of
sequences is difficult to compute. Ed Trifonov introduced in 1990 the
notion of linguistic complexity  that reflects the
linguistic wealth of a sequence. This complexity is easily computable
as C = Õi=1n-1 ui, with ui the ratio of the words
found in a sliding window at position i in a sequence, versus the
total number of different words that could possibly be found.
Computations are made along windows, by multiplying the u ratios of
words of all possible lengths in the window. This implies that all
redundancies are eliminated. The value ratC var
- the nn of )the ncChil
redundann, if tex nt words that quensequence, he obnsequencs dhquensehis imp
rnBeOo coded dons/TD>
rnBeOo coded dons/T)ONT SIZE=4><5>·>
whn°318, 0 t5, pp.sp;599--608l?dTD>
rnBeOo coded dons/T))M>. The comncChil
DNAnt I tex nt words thai
glois must besed on oura tibutd witalphabetLI>theit beacan ng sequectinces that are resreisod on latedhes her pl
a seqwayat is enct eogatioileqconfainty lecules (mLI>the dif ng sequectinces tha resliketrueo sh resr modonmmon
rseted fticallmLI>the disequences are shostrainud witng weiguistics aret can be v
of lated thienome.'''' an acer plLI>the difding fraectional the sequence ins difginer>
ilitates teir proulating o
whvol.sp;27, n°14, 0 t9, pp.sp;2848--285 l?dTD>
- the infhods ied to enccil
rtmessage conermine thse order ofed
ation witween the STAing sequences th/UL>
apello99" (H.), Olier95" (E.), Lomlès-Devaul Teleq(C.), N a chké (P.and ho
Risl5" (J.-L.).sp;--mpu usage biaa dirtool perffediche certalar location of theeukotic
ntham80" (R.).sp;--mWorg sts the seqetic exped bhins/TTt win theBiomistr STOPnt p. T,
n°5, 0 80, pp.sp;327--33 l?dTD>
ntham80" (R.),ussit95" (C.), Gouy (M.), Jacobtale (M.), actMerc95" (R.).sp;--mpu usaalytiogage biaa codome modated tgydulationfor theetic
pression,y&nbhins/TNeic aciAs, i Rarch inM>,
whn°9, 0 81,
ay99" (B.), LloydheA.T.), McLebe (M.J.), Dev the(K.M.), n orp (P.M.), actWolfc
p(K.H.).sp;--mPeins modonmition of acton usage biasequpira98"rmi,
whn°27, 0 t9, pp.sp;1642--1649l?dTD>
ver92" (S.G.), vabsp; of A and(Q.J.), Agostoni-Con che (M.L.), Aigle (M.),
iAl ofghted (L.), Aityomlraki (D.), Anto the(G.), Anwar (R.),uBed --m complexit teDstabence
tothest orgcholecr modIIIhins/TNre ofM>. T,
n°357, 0 t2, pp.sp;38--46l?dTD>
ver995" (E.), Delatie (M.0.), actHensitheA.a.sp;--mDof
Dstauringsong winst orgcholecr moregulaards errfindtion
gnificance of the coduence. The>Complex as t wiurrde l'Académioneri
ov96" (O.), nlaal (D.sp;Mode.), actfonov int(E.sp;ModN.a.sp;--mLuistic complexity [ComBioSems. SM>,
whn°38, 0 t6,
ha98"t(E.P.C.), Vis&nheA.a, actDe o it.eA.a.sp;--mOonucleotides tas in relblitaluurrobttats:neratedripl win ho
taxoic seqparison toComJoul origf Apes td Pability ofM>,
0 t8, pp.sp;179--193l?dTD>
fens99" (W.) actDigcon(D.).sp;--mms avae effgdin thinlaae spllowg frameeomncofgithan theshuff ,
0 t9, pp.sp;1578--1584l?/D