---
abstract: |
  Coding information is the main source of heterogeneity
  (non-randomness) in the sequences of bacterial genomes. This
  information can be naturally modeled by analysing cluster structures in the ``in-phase'' triplet distributions of relatively short genomic fragments (200-400bp). We found a universal 7-cluster structure in all 143 completely sequenced bacterial genomes available in Genbank in August 2004, and explained its properties.
  The 7-cluster structure is responsible for the main part of sequence heterogeneity in bacterial genomes. In this sense, our 7 clusters is the basic model of bacterial genome sequence. We demonstrated that there are four basic ``pure'' types of this model, observed in nature: ``parallel triangles'', ``perpendicular triangles'',
  degenerated case and the flower-like type. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy (more precisely, by two similar functions, one for eubacterial genomes and the other one for archaea).
  
  All 143 cluster animated 3D-scatters are collected in a database and is made available on our web-site:
  
  http://www.ihes.fr/~zinovyev/7clusters 
  
  The finding can be readily introduced into any software for gene prediction, sequence alignment or bacterial genomes classification.
altloc:
  - http://mathcircle.org/gorban/
  - http://www.ihes.fr/~zinovyev/
chapter: ~
commentary: ~
commref: ~
confdates: ~
conference: ~
confloc: ~
contact_email: ~
creators_id: []
creators_name:
  - family: Gorban
    given: A.N.
    honourific: ''
    lineage: ''
  - family: Popova
    given: T.G.
    honourific: ''
    lineage: ''
  - family: Zinovyev
    given: A.Yu.
    honourific: ''
    lineage: ''
date: 2004-10
date_type: published
datestamp: 2004-11-06
department: ~
dir: disk0/00/00/39/15
edit_lock_since: ~
edit_lock_until: ~
edit_lock_user: ~
editors_id: []
editors_name: []
eprint_status: archive
eprintid: 3915
fileinfo: /style/images/fileicons/application_pdf.png;/3915/1/7clustersCog.pdf
full_text_status: public
importid: ~
institution: ~
isbn: ~
ispublished: ~
issn: ~
item_issues_comment: []
item_issues_count: 0
item_issues_description: []
item_issues_id: []
item_issues_reported_by: []
item_issues_resolved_by: []
item_issues_status: []
item_issues_timestamp: []
item_issues_type: []
keywords: 'codon usage, cluster structure, mean field, frequency dictionary'
lastmod: 2011-03-11 08:55:43
latitude: ~
longitude: ~
metadata_visibility: show
note: ~
number: ~
pagerange: ~
pubdom: FALSE
publication: ~
publisher: ~
refereed: FALSE
referencetext: |
  Audic S, Claverie JM. Self-identification of protein-coding
  regions in microbial genomes. (1998) {\it Proc Natl Acad Sci USA}. {\bf 95(17)}:10026-31.
  
  Baldi P. On the convergence of a clustering algorithm for protein-coding regions in microbial genomes. (2000) {\it Bioinformatics}. {\bf 16}(4):367-71.
  
  Bernaola-Galvan, P., Grosse, I., Carpena, P., Oliver, J.L.,
  Roman-Roldan, R., Stanley, H.E. (2000). Finding borders between coding and noncoding DNA regions by an entropic segmentation method. \textit{Physical Review Letters}\textbf{85}(6): 1342-1345.
  
  BioJava open-source project. http://www.biojava.org 
  Borodovsky, M., McIninch, J. (1993). GENMARK: parallel gene recognition for both DNA strands. {\it Comp.Chem} {\bf 17},
  123-133.
  
  Carbone A., Zinovyev A., Kepes F. Codon Adaptation Index as a measure of dominating codon bias. (2003) {\it Bioinformatics}. {\bf 19}, 13, p.2005-2015.
  
  
  Cluster structures in genomic word frequency distributions.
  Web-site with supplementary materials. {\it http://www.ihes.fr/$\sim$zinovyev/7clusters/index.htm }
  
  Gorban AN, Mirkes EM, Popova TG, Sadovsky MG. A new  approach to the investigations of statistical properties of genetic texts. (1993) {\it Biofizika} {\bf 38} (5): 762-767.
  
  Gorban AN, Bugaenko NN, Sadovskii MG. Maximum entropy method in analysis of genetic text and measurement of its information content. (1998) {\it Open systems and information dynamics}. {\bf 5}, pp.265-278.
  
  Gorban AN, Popova TG, Sadovsky MG. Classification of symbol
  sequences over their frequency dictionaries: towards the
  connection between structure and natural taxonomy. (2000) {\it Open System and Information Dynamics}, {\bf 7}:1-17.
  
  Gorban A.N., Zinovyev A.Y., Wunsch D.C. Application of The Method of Elastic Maps In Analysis of Genetic Texts. (2003) In {\it Proceedings of International Joint Conference on Neural Networks (IJCNN)}, Portland, Oregon, July 20-24.
  
  Gorban A, Zinovyev A, Popova T. Seven clusters in genomic triplet distributions. (2003) {\it In Silico Biology}. {\bf V.3}, 0039.
  (e-print: http://arxiv.org/abs/cond-mat/0305681 and
  http://cogprints.ecs.soton.ac.uk/archive/00003077/ )
  
  Gorban A.N., Zinovyev A.Yu., Popova T.G. Statistical approaches to the automated gene identification without teacher // Institut des Hautes Etudes Scientiques. - IHES Preprint, France. 2001. - M/01/34.
  Available at {\it http://www.ihes.fr} web-site. (See alsow e-print:
  http://arxiv.org/abs/physics/0108016 )
  
  Gorban A.N., Zinovyev A.Yu. Visualization of data by method of elastic maps and its applications in genomics, economics and sociology // Institut des Hautes Etudes Scientiques. - IHES Preprint, France. 2001. - M/01/36. Available at {\it
  http://www.ihes.fr} web-site.
  
  Karlin S. (1998) Global dinucleotide signatures and analysis of genomic heterogeneity. {\it Current opinion in microbiology} {\bf 1}(5): 598-610.
  
  Lobry JR, Chessel D. (2003) Internal correspondence analysis of codon and amino-acid usage in thermophilic bacteria. {\it J.Appl.Genet.} 44(2):235-61.
  
  Mathe C., Sagot M.F., Schiex T., Rouze P. Current methods of gene prediction, their strengths and weaknesses (2002) {\it Nucleic Acids Res}. {\bf 30}(19):4103-4117.
  
  Nicolas P, Bize L, Muri F, Hoebeke M, Rodolphe F, Ehrlich SD, Prum B, Bessieres P. Mining Bacillus subtilis chromosome
  heterogeneities using hidden Markov models. (2002) {\it Nucleic Acids Res.}  {\bf 30}(6):1418-26.
  
  Ou HY, Guo FB, Zhang CT. Analysis of nucleotide  distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method. (2003) \textit{FEBS Lett.} Apr
  10;\textbf{540}(1-3):188-94.
  
  Salzberg S.L., Delcher A.L., Kasif S., White O. Microbial gene identification using interpolated Markov Models. (1998)  {\it Nuc. Acids Res.} {\bf 26}(2): 544-548.
  
  Trifonov,E.N. Translation framing code and frame-monitoring
  mechanism as suggested by the analysis of mRNA and 16S rRNA
  nucleotide sequences. (1987)  {\it J.Mol.Biol.} {\bf 194},643-652.
  
  Zhang,C.T.and Zhang,R. Analysis of distribution of bases in the coding sequences by a diagrammatic technique. (1991)
   {\it Nucleic Acids Res.} {\bf 19},6313- 6317.
  
  Zhang,C.T.and Chou,K.C. A graphic approach to analyzing codon usage in 1562 Escherichia coli protein coding sequences. (1994) {\it J.Mol.Biol.} {\bf 238},1-8.
  
  Zinovyev A. Visualizing the spatial structure of triplet
  distributions in genetic texts. - IHES Preprint, France. 2002. - M/02/28. Available at {\it http://www.ihes.fr} web-site.
  
  Zinovyev A., Gorban A., Popova T. Self-Organizing Approach for Automated Gene Identification. (2003). {\it Open Systems and Information Dynamics} {\bf 10}(4). p.321-333.
relation_type: []
relation_uri: []
reportno: ~
rev_number: 12
series: ~
source: ~
status_changed: 2007-09-12 16:54:16
subjects:
  - bio-theory
succeeds: ~
suggestions: ~
sword_depositor: ~
sword_slug: ~
thesistype: ~
title: |-
  Four basic symmetry types in the universal 7-cluster
  structure of 143 complete bacterial genomic sequences
type: preprint
userid: 4198
volume: ~
<script src='https://archive-bar.soton.ac.uk/archive-bar.js'></script>
<script src='https://archive-bar.soton.ac.uk/google-analytics.js'></script>