creators_name: Gorban, A.N. creators_name: Popova, T.G. creators_name: Zinovyev, A.Yu. type: preprint datestamp: 2004-11-06 lastmod: 2011-03-11 08:55:43 metadata_visibility: show title: Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences subjects: bio-theory full_text_status: public keywords: codon usage, cluster structure, mean field, frequency dictionary abstract: Coding information is the main source of heterogeneity (non-randomness) in the sequences of bacterial genomes. This information can be naturally modeled by analysing cluster structures in the ``in-phase'' triplet distributions of relatively short genomic fragments (200-400bp). We found a universal 7-cluster structure in all 143 completely sequenced bacterial genomes available in Genbank in August 2004, and explained its properties. The 7-cluster structure is responsible for the main part of sequence heterogeneity in bacterial genomes. In this sense, our 7 clusters is the basic model of bacterial genome sequence. We demonstrated that there are four basic ``pure'' types of this model, observed in nature: ``parallel triangles'', ``perpendicular triangles'', degenerated case and the flower-like type. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy (more precisely, by two similar functions, one for eubacterial genomes and the other one for archaea). All 143 cluster animated 3D-scatters are collected in a database and is made available on our web-site: http://www.ihes.fr/~zinovyev/7clusters The finding can be readily introduced into any software for gene prediction, sequence alignment or bacterial genomes classification. date: 2004-10 date_type: published refereed: FALSE referencetext: Audic S, Claverie JM. Self-identification of protein-coding regions in microbial genomes. (1998) {\it Proc Natl Acad Sci USA}. {\bf 95(17)}:10026-31. Baldi P. On the convergence of a clustering algorithm for protein-coding regions in microbial genomes. (2000) {\it Bioinformatics}. {\bf 16}(4):367-71. Bernaola-Galvan, P., Grosse, I., Carpena, P., Oliver, J.L., Roman-Roldan, R., Stanley, H.E. (2000). Finding borders between coding and noncoding DNA regions by an entropic segmentation method. \textit{Physical Review Letters}\textbf{85}(6): 1342-1345. BioJava open-source project. http://www.biojava.org Borodovsky, M., McIninch, J. (1993). GENMARK: parallel gene recognition for both DNA strands. {\it Comp.Chem} {\bf 17}, 123-133. Carbone A., Zinovyev A., Kepes F. Codon Adaptation Index as a measure of dominating codon bias. (2003) {\it Bioinformatics}. {\bf 19}, 13, p.2005-2015. Cluster structures in genomic word frequency distributions. Web-site with supplementary materials. {\it http://www.ihes.fr/$\sim$zinovyev/7clusters/index.htm } Gorban AN, Mirkes EM, Popova TG, Sadovsky MG. A new approach to the investigations of statistical properties of genetic texts. (1993) {\it Biofizika} {\bf 38} (5): 762-767. Gorban AN, Bugaenko NN, Sadovskii MG. Maximum entropy method in analysis of genetic text and measurement of its information content. (1998) {\it Open systems and information dynamics}. {\bf 5}, pp.265-278. Gorban AN, Popova TG, Sadovsky MG. Classification of symbol sequences over their frequency dictionaries: towards the connection between structure and natural taxonomy. (2000) {\it Open System and Information Dynamics}, {\bf 7}:1-17. Gorban A.N., Zinovyev A.Y., Wunsch D.C. Application of The Method of Elastic Maps In Analysis of Genetic Texts. (2003) In {\it Proceedings of International Joint Conference on Neural Networks (IJCNN)}, Portland, Oregon, July 20-24. Gorban A, Zinovyev A, Popova T. Seven clusters in genomic triplet distributions. (2003) {\it In Silico Biology}. {\bf V.3}, 0039. (e-print: http://arxiv.org/abs/cond-mat/0305681 and http://cogprints.ecs.soton.ac.uk/archive/00003077/ ) Gorban A.N., Zinovyev A.Yu., Popova T.G. Statistical approaches to the automated gene identification without teacher // Institut des Hautes Etudes Scientiques. - IHES Preprint, France. 2001. - M/01/34. Available at {\it http://www.ihes.fr} web-site. (See alsow e-print: http://arxiv.org/abs/physics/0108016 ) Gorban A.N., Zinovyev A.Yu. Visualization of data by method of elastic maps and its applications in genomics, economics and sociology // Institut des Hautes Etudes Scientiques. - IHES Preprint, France. 2001. - M/01/36. Available at {\it http://www.ihes.fr} web-site. Karlin S. (1998) Global dinucleotide signatures and analysis of genomic heterogeneity. {\it Current opinion in microbiology} {\bf 1}(5): 598-610. Lobry JR, Chessel D. (2003) Internal correspondence analysis of codon and amino-acid usage in thermophilic bacteria. {\it J.Appl.Genet.} 44(2):235-61. Mathe C., Sagot M.F., Schiex T., Rouze P. Current methods of gene prediction, their strengths and weaknesses (2002) {\it Nucleic Acids Res}. {\bf 30}(19):4103-4117. Nicolas P, Bize L, Muri F, Hoebeke M, Rodolphe F, Ehrlich SD, Prum B, Bessieres P. Mining Bacillus subtilis chromosome heterogeneities using hidden Markov models. (2002) {\it Nucleic Acids Res.} {\bf 30}(6):1418-26. Ou HY, Guo FB, Zhang CT. Analysis of nucleotide distribution in the genome of Streptomyces coelicolor A3(2) using the Z curve method. (2003) \textit{FEBS Lett.} Apr 10;\textbf{540}(1-3):188-94. Salzberg S.L., Delcher A.L., Kasif S., White O. Microbial gene identification using interpolated Markov Models. (1998) {\it Nuc. Acids Res.} {\bf 26}(2): 544-548. Trifonov,E.N. Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16S rRNA nucleotide sequences. (1987) {\it J.Mol.Biol.} {\bf 194},643-652. Zhang,C.T.and Zhang,R. Analysis of distribution of bases in the coding sequences by a diagrammatic technique. (1991) {\it Nucleic Acids Res.} {\bf 19},6313- 6317. Zhang,C.T.and Chou,K.C. A graphic approach to analyzing codon usage in 1562 Escherichia coli protein coding sequences. (1994) {\it J.Mol.Biol.} {\bf 238},1-8. Zinovyev A. Visualizing the spatial structure of triplet distributions in genetic texts. - IHES Preprint, France. 2002. - M/02/28. Available at {\it http://www.ihes.fr} web-site. Zinovyev A., Gorban A., Popova T. Self-Organizing Approach for Automated Gene Identification. (2003). {\it Open Systems and Information Dynamics} {\bf 10}(4). p.321-333. citation: Gorban, A.N. and Popova, T.G. and Zinovyev, A.Yu. (2004) Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences. [Preprint] document_url: http://cogprints.org/3915/1/7clustersCog.pdf