This site has been permanently archived. This is a static copy provided by the University of Southampton.
---
abstract: |
Coding information is the main source of heterogeneity
(non-randomness) in the sequences of bacterial genomes. This
information can be naturally modeled by analysing cluster structures in the ``in-phase'' triplet distributions of relatively short genomic fragments (200-400bp). We found a universal 7-cluster structure in all 143 completely sequenced bacterial genomes available in Genbank in August 2004, and explained its properties.
The 7-cluster structure is responsible for the main part of sequence heterogeneity in bacterial genomes. In this sense, our 7 clusters is the basic model of bacterial genome sequence. We demonstrated that there are four basic ``pure'' types of this model, observed in nature: ``parallel triangles'', ``perpendicular triangles'',
degenerated case and the flower-like type. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy (more precisely, by two similar functions, one for eubacterial genomes and the other one for archaea).
All 143 cluster animated 3D-scatters are collected in a database and is made available on our web-site:
http://www.ihes.fr/~zinovyev/7clusters
The finding can be readily introduced into any software for gene prediction, sequence alignment or bacterial genomes classification.
altloc:
- http://mathcircle.org/gorban/
- http://www.ihes.fr/~zinovyev/
chapter: ~
commentary: ~
commref: ~
confdates: ~
conference: ~
confloc: ~
contact_email: ~
creators_id: []
creators_name:
- family: Gorban
given: A.N.
honourific: ''
lineage: ''
- family: Popova
given: T.G.
honourific: ''
lineage: ''
- family: Zinovyev
given: A.Yu.
honourific: ''
lineage: ''
date: 2004-10
date_type: published
datestamp: 2004-11-06
department: ~
dir: disk0/00/00/39/15
edit_lock_since: ~
edit_lock_until: ~
edit_lock_user: ~
editors_id: []
editors_name: []
eprint_status: archive
eprintid: 3915
fileinfo: /style/images/fileicons/application_pdf.png;/3915/1/7clustersCog.pdf
full_text_status: public
importid: ~
institution: ~
isbn: ~
ispublished: ~
issn: ~
item_issues_comment: []
item_issues_count: 0
item_issues_description: []
item_issues_id: []
item_issues_reported_by: []
item_issues_resolved_by: []
item_issues_status: []
item_issues_timestamp: []
item_issues_type: []
keywords: 'codon usage, cluster structure, mean field, frequency dictionary'
lastmod: 2011-03-11 08:55:43
latitude: ~
longitude: ~
metadata_visibility: show
note: ~
number: ~
pagerange: ~
pubdom: FALSE
publication: ~
publisher: ~
refereed: FALSE
referencetext: |
Audic S, Claverie JM. Self-identification of protein-coding
regions in microbial genomes. (1998) {\it Proc Natl Acad Sci USA}. {\bf 95(17)}:10026-31.
Baldi P. On the convergence of a clustering algorithm for protein-coding regions in microbial genomes. (2000) {\it Bioinformatics}. {\bf 16}(4):367-71.
Bernaola-Galvan, P., Grosse, I., Carpena, P., Oliver, J.L.,
Roman-Roldan, R., Stanley, H.E. (2000). Finding borders between coding and noncoding DNA regions by an entropic segmentation method. \textit{Physical Review Letters}\textbf{85}(6): 1342-1345.
BioJava open-source project. http://www.biojava.org
Borodovsky, M., McIninch, J. (1993). GENMARK: parallel gene recognition for both DNA strands. {\it Comp.Chem} {\bf 17},
123-133.
Carbone A., Zinovyev A., Kepes F. Codon Adaptation Index as a measure of dominating codon bias. (2003) {\it Bioinformatics}. {\bf 19}, 13, p.2005-2015.
Cluster structures in genomic word frequency distributions.
Web-site with supplementary materials. {\it http://www.ihes.fr/$\sim$zinovyev/7clusters/index.htm }
Gorban AN, Mirkes EM, Popova TG, Sadovsky MG. A new approach to the investigations of statistical properties of genetic texts. (1993) {\it Biofizika} {\bf 38} (5): 762-767.
Gorban AN, Bugaenko NN, Sadovskii MG. Maximum entropy method in analysis of genetic text and measurement of its information content. (1998) {\it Open systems and information dynamics}. {\bf 5}, pp.265-278.
Gorban AN, Popova TG, Sadovsky MG. Classification of symbol
sequences over their frequency dictionaries: towards the
connection between structure and natural taxonomy. (2000) {\it Open System and Information Dynamics}, {\bf 7}:1-17.
Gorban A.N., Zinovyev A.Y., Wunsch D.C. Application of The Method of Elastic Maps In Analysis of Genetic Texts. (2003) In {\it Proceedings of International Joint Conference on Neural Networks (IJCNN)}, Portland, Oregon, July 20-24.
Gorban A, Zinovyev A, Popova T. Seven clusters in genomic triplet distributions. (2003) {\it In Silico Biology}. {\bf V.3}, 0039.
(e-print: http://arxiv.org/abs/cond-mat/0305681 and
http://cogprints.ecs.soton.ac.uk/archive/00003077/ )
Gorban A.N., Zinovyev A.Yu., Popova T.G. Statistical approaches to the automated gene identification without teacher // Institut des Hautes Etudes Scientiques. - IHES Preprint, France. 2001. - M/01/34.
Available at {\it http://www.ihes.fr} web-site. (See alsow e-print:
http://arxiv.org/abs/physics/0108016 )
Gorban A.N., Zinovyev A.Yu. Visualization of data by method of elastic maps and its applications in genomics, economics and sociology // Institut des Hautes Etudes Scientiques. - IHES Preprint, France. 2001. - M/01/36. Available at {\it
http://www.ihes.fr} web-site.
Karlin S. (1998) Global dinucleotide signatures and analysis of genomic heterogeneity. {\it Current opinion in microbiology} {\bf 1}(5): 598-610.
Lobry JR, Chessel D. (2003) Internal correspondence analysis of codon and amino-acid usage in thermophilic bacteria. {\it J.Appl.Genet.} 44(2):235-61.
Mathe C., Sagot M.F., Schiex T., Rouze P. Current methods of gene prediction, their strengths and weaknesses (2002) {\it Nucleic Acids Res}. {\bf 30}(19):4103-4117.
Nicolas P, Bize L, Muri F, Hoebeke M, Rodolphe F, Ehrlich SD, Prum B, Bessieres P. Mining Bacillus subtilis chromosome
heterogeneities using hidden Markov models. (2002) {\it Nucleic Acids Res.} {\bf 30}(6):1418-26.
Ou HY, Guo FB, Zhang CT. Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method. (2003) \textit{FEBS Lett.} Apr
10;\textbf{540}(1-3):188-94.
Salzberg S.L., Delcher A.L., Kasif S., White O. Microbial gene identification using interpolated Markov Models. (1998) {\it Nuc. Acids Res.} {\bf 26}(2): 544-548.
Trifonov,E.N. Translation framing code and frame-monitoring
mechanism as suggested by the analysis of mRNA and 16S rRNA
nucleotide sequences. (1987) {\it J.Mol.Biol.} {\bf 194},643-652.
Zhang,C.T.and Zhang,R. Analysis of distribution of bases in the coding sequences by a diagrammatic technique. (1991)
{\it Nucleic Acids Res.} {\bf 19},6313- 6317.
Zhang,C.T.and Chou,K.C. A graphic approach to analyzing codon usage in 1562 Escherichia coli protein coding sequences. (1994) {\it J.Mol.Biol.} {\bf 238},1-8.
Zinovyev A. Visualizing the spatial structure of triplet
distributions in genetic texts. - IHES Preprint, France. 2002. - M/02/28. Available at {\it http://www.ihes.fr} web-site.
Zinovyev A., Gorban A., Popova T. Self-Organizing Approach for Automated Gene Identification. (2003). {\it Open Systems and Information Dynamics} {\bf 10}(4). p.321-333.
relation_type: []
relation_uri: []
reportno: ~
rev_number: 12
series: ~
source: ~
status_changed: 2007-09-12 16:54:16
subjects:
- bio-theory
succeeds: ~
suggestions: ~
sword_depositor: ~
sword_slug: ~
thesistype: ~
title: |-
Four basic symmetry types in the universal 7-cluster
structure of 143 complete bacterial genomic sequences
type: preprint
userid: 4198
volume: ~