Genome Analysis and Sequences with Random Letter Distribution
Institut de Génétique et Microbiologie, Université de
Paris XI (France)
April 2, 2001
[summary by Mathias Vandenbogaert]
A properly typeset version of this document is available in postscript
and in pdf
If some fonts do not look right on your screen, this might be
fixed by configuring your browser (see the documentation here).
The information content of genomes of different organisms reflects
their mode of physical organisation. For the last decades the wet lab
biologist's research interests has been to decipher this information
content, with the purpose of extracting useful biological
features. The reliability of the information extraction process,
mainly based on the textual nature of the underlying messages, was
hard to achieve. Therefore, an approach based on the comparison of
naturally occuring sequences and randomly generated sequences, is used
for discerning the artefacts in sequences and for improving the power
of our genome models.
The building plan for vegetative life is based on the assembly and
catalytic function of proteins and active RNAs. The complete set of
instructions that is needed to generate the building blocks of the
reproductary system is called a ``genome.'' Any production of living
tissue from these building blocks will give rise to an accumulation of
secondary metabolites, which are of adverse influence for the survival
of the species. The secondary effects of metabolite production are at
the basis for the requirement of the genome to be able to respond to
the induced environmental changes. To counter this problem, a cell of
an organism will only bring to expression those genes that are
required at some specific moment in the cell's life cycle. For this
purpose, a genome disposes of regulatory systems in the generation
processes of building blocks. These systems can be compared to logical
gates that are situated in upstream sequences of most information that
needs to be processed. This permits a modulation in the usage of
information. The genomic information is stocked in a linear fashion,
which facilitates the tracking of information. At the time the
sequencing of the human genomic sequence is being accomplished,
several tasks remain to be addressed:
Our bioinformatics team is mainly interested in providing an answer to basically two questions:
the decomposition of the genomic sequence into streams of messages;
- the distinction of these ``messages'' in contrast to the ``non-coding bulk information'';
- assignment of biologically significant functions to the messages.
How can messages be extracted from genomic sequences in order to perform
the function assignment task?
- What is the nature of the message contained within any linear macromolecular structure?
1 First Task: Message Extraction and Function Assignment
The approach consists in observing the known words in the vocabulary
of the genome. These known words have been indexed through many years
of genetic experiments, with the use of techniques handled in
molecular biology wet labs. Through this biology-related knowledge
accumulation, the following facts are at the basis for the study of
1.1 Mechanisms for processing signals in messages
There exist mechanisms for processing complex signals, both within
eucaryotes as well as within viral species. The eucaryotic
mechanism is described as alternative splicing: a
protein-encoding sequence can generate different proteins
at the time mRNA is being spliced, according to different
translational systems. Sample mechanisms for this group of
organisms are read-through (the transcription machinery
is reading through and beyond the STOP codon), and hopping
(the transcription machinery is skipping the STOP codon and
the codons surrounding it).
The retro-viral mechanism is called re-encoding,
which implies that different proteins can be obtained at the time
the mRNA is being translated.
Sample mechanisms for this group are frameshift (the
reading frame for translation is changed, which induces an
alteration of the encoded amino acids), read-through and hopping.
Several features can be conferred to some sequences that are
responsible for a frameshift:
Regulatory sequences that are responsible
for the modulation of DNA transcription in a less error-prone fashion are:
Slipping sequences (structure X XXY YYZ).
- A badly positioned classical STOP signal: the ribosome looses
his grip on the sequence and gets positioned again in phase -1.
- A ribosome-blocking structure.
Usually, these regulatory sequences are short sequences, whose
observed frequency is higher (hence unexpected) in comparison to a
random word composed of the same letters.
Inhibitor signals. Their role is to bind proteins
so that the RNA polymerase can no longer bind to the sequence
to initiate transcription.
- Activator signals. There exists a multitude of signals
per protein-encoding sequence, according to the specific function
of the protein to be generated.
1.2 Modelling a genomic sequence
A Markov model is frequently used for modelling a genomic sequence. The number of
sequences that can be generated by this model, increases with the order of the Markov model, and reaches a plateau.
For a Bernoulli-type distribution of the nucleotides, the actual
sequence follows a Gaussian distribution. Additionally, when [A+T]
increases, the amount of START and STOP signals increases. This
implies that the certainty of finding a gene increases.
Regulatory signals are words with biased composition, with respect to
the global word distribution of the sequence. These signals have been
selected for their properties in the course of evolution. They have
been generated according to mechanisms which include random events
1.3 The importance of codon usage biases
In the context of genetic expression, the codon usage bias is
correlated with the level of tRNAs available, and with the abundance
of protein generated. The level of protein-encoding sequences that are
significantly biased is of the order of 20% of the total amount of
sequences. Within this respect, several observation have been made:
The codon usage bias determining the level of codons corresponding to the amino acids of proteins
has a direct effect in the genomic sequence composition of the organism.
This bias, which is the result of an interaction of horizontal transfer and
metabolic constraints, is at the basis of the selection of efficient proteins.
The codon usage bias reveals information about the nucleotide triplet usage of the encoded protein
and about the eventual external origin of the sequence in the organism.
The significance of the codon usage bias can be evaluated by using weighted linguistics approaches.
This consists in heuristically weighting the codons used to encode the amino acids, instead of using
an average weight for every amino acid that is encoded by several triplets.
This prevents from having resulting frequencies that diverge from the observed values.
the biased structure helps in regulating the transcription
- there is a positional codon bias according to the strand on
which the gene is situated ;
- there is a codon usage bias according to the life cycle of the
organism and the cellular location of the metabolic
- there is a bias in relation with mRNA stability
- some horizontal transfers can have effects on the codon
Nevertheless, the probability of finding reasonable codon compositions through linguistic methods is
fairly low, because:
global linguistics are calculated on a larger set of oligonucleotides than the number of oligos that determine the proteins;
- the number of codons in a gene equals one third of the number of possible triplets;
- the different genes are built up from codons of different composition, and this is increasing
the background noise accordingly.
2 Second Task: Determining the Nature of the Message
Life on any other planet besides Earth can only be detectable for us
if it is based on our carbon chemistry. Any sequential organic
macromolecule contains constitutional information, if textual
organization can be detected within it.
Different approaches exist for the detection of organized information:
Complexity analysis of sequences. The complexity of
sequences is difficult to compute. Ed Trifonov introduced in 1990 the
notion of linguistic complexity  that reflects the
linguistic wealth of a sequence. This complexity is easily computable
as C = Õi=1n-1 ui, with ui the ratio of the words
found in a sliding window at position i in a sequence, versus the
total number of different words that could possibly be found.
Computations are made along windows, by multiplying the u ratios of
words of all possible lengths in the window. This implies that all
redundancies are eliminated. The value of C varies from 0 to 1.
- Shannon's entropy measure H(X) = - åi P(xi) ·
log(P(xi)). The entropy H(X) is maximal in the case
of a random equiprobable sequence. A reduction in entropy corresponds
to a generation of information. This implies that the measurement of
the amount of information can be done by:
I(X) = H(without message) - H(with message).
This way, the amount of information can be quantified by comparing a randomly generated Markovian sequence
(sequence without message) with a naturally occurring sequence.
This measure is related to global information content, but does not give any idea on the
distribution of the coding zones of the sequence.
It is a common observation in information-bearing texts that coding zones are separated
other by areas that are more or less deprived of information. If the hypothesis of a
genome makes sense, then its linguistics must respond to the following criterions:
it must be based on a restricted alphabet;
- it bears coding subsequences that are separated from each other
in a way that is recognizable by certain molecules;
- the coding subsequences are likely to share some common
- these sequences are constructed using linguistics that can vary
from one ``genome'' to another;
- the reading direction of the sequences is oriented (this should
facilitate their regulation);
- the method used to copy the message determines the ordered
relation between the coding sequences.
Chiapello (H.), Ollivier (E.), Landès-Devauchelle (C.), Nitschké (P.), and
Risler (J.-L.). --
Codon usage as a tool to predict the cellular location of eukaryotic
ribosomal proteins and aminoacyl-tRNA synthetases. Nucleic Acids
Research, vol. 27, n°14, 1999, pp. 2848--2851.
Grantham (R.). --
Workings of the genetic code. Trends in Biochemical Sciences,
n°5, 1980, pp. 327--331.
Grantham (R.), Gautier (C.), Gouy (M.), Jacobzone (M.), and Mercier (R.). --
Codon catalog usage is a genome strategy modulated for gene
expressivity. Nucleic Acids Research, n°9, 1981,
Lafay (B.), Lloyd (A.T.), McLean (M.J.), Devine (K.M.), Sharp (P.M.), and Wolfe
Proteome composition and codon usage in spirochaetes:
species-specific and DNA strand-specific mutational biases. Nucleic
Acids Research, n°27, 1999, pp. 1642--1649.
Oliver (S.G.), van der Aart (Q.J.), Agostoni-Carbone (M.L.), Aigle (M.),
Alberghina (L.), Alexandraki (D.), Antoine (G.), Anwar (R.), Ballesta (J.P.),
and Benit (P.). --
The complete DNA sequence of yeast chromosome III. Nature,
n°357, 1992, pp. 38--46.
Olivier (E.), Delorme (M.0.), and Henaut (A.). --
Dos DNA occurs along yeast chromosomes, regardless of functional
significance of the sequence. Comptes rendus de l'Académie des
sciences Paris, n°318, 1995, pp. 599--608.
Popov (O.), Segal (D. M.), and Trifonov (E. N.). --
Linguistic complexity of protein sequences as compared to texts of
human languages. BioSystems, n°38, 1996,
Rocha (E.P.C.), Viari (A.), and Danchin (A.). --
Oligonucleotide bias in bacillus subtilis: general trends and
taxonomic comparisons. Journal of Applied Probability, n°36,
1998, pp. 179--193.
Seffens (W.) and Digby (D.). --
mRNAs have greater negative folding free energies than shuffled or
codon choice randomized sequences. Nucleic Acids Research, n°27,
1999, pp. 1578--1584.
This document was translated from LATEX by