creators_name: Gorban, Alexander N. creators_name: Zinovyev, Andrei Yu creators_name: Popova, Tatyana G. type: preprint datestamp: 2003-07-23 lastmod: 2011-03-11 08:55:19 metadata_visibility: show title: Seven clusters in genomic triplet distributions subjects: bio-theory full_text_status: public keywords: genomic, cluster, gene finding, data vizualization, codon usage note: The universal seven-cluster structure of genetic texts is presented. This structure seems to be important, simple and geometrically elegant, as one can see from the illustrations. It could be interesting for the experts in genomics as well as for the scientists from other fields. abstract: Motivation: In several recent papers new algorithms were proposed for detecting coding regions without requiring learning dataset of already known genes. In this paper we studied cluster structure of several genomes in the space of codon usage. This allowed to interpret some of the results obtained in other studies and propose a simpler method, which is, nevertheless, fully functional. Results: Several complete genomic sequences were analyzed, using visualization of tables of triplet counts in a sliding window. The distribution of 64-dimensional vectors of triplet frequencies displays a well-detectable cluster structure. The structure was found to consist of seven clusters, corresponding to protein-coding information in three possible phases in one of the two complementary strands and in the non-coding regions. Awareness of the existence of this structure allows development of methods for the segmentation of sequences into regions with the same coding phase and non-coding regions. This method may be completely unsupervised or use some external information. Since the method does not need extraction of ORFs, it can be applied even for unassembled genomes. Accuracy calculated on the base-pair level (both sensitivity and specificity) exceeds 90%. This is not worse as compared to such methods as HMM, however, has the advantage to be much simpler and clear. date: 2002 date_type: published refereed: FALSE referencetext: Audic S., Claverie J.-M. (1998) Self-identification of protein-coding regions in microbial genomes. Proc.Natl.Acad.Sci.USA, 95. Baldi, P. (2000) On the convergence of a clustering algorithm for protein-coding regions in microbial genomes. Bioinformatics, 16. 367-371. Bernaola-Galvan, P., Grosse, I., Carpena, P., Oliver, J.L., Roman-Roldan, R., Stanley, H.E. (2000). Finding borders between coding and noncoding DNA regions by an entropic segmentation method. Physical Review Letters 85(6): 1342-1345. Besemer, J., Lomsadze, A., Borodovsky M. (2001) GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res., 29, No. 12. 2607-2618. Burge, C.B., Karlin, S. (1998) Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8: 346-354. Fickett, J.W. (1996) The gene identification problem: an overview for developers. Computer & Chemistry 20: 103-118. Gorban, A.N., Zinovyev, A.Yu., Popova, T.G. 2001. Statistical approaches to the automated gene identification without teacher. Institut des Hautes Etudes Scientiques preprint. IHES, France. Web-site link: http://www.ihes.fr/PREPRINTS/M01/M01-34.ps. gz. Salzberg, S.L., Delcher, A.L., Kasif, S., White, O. (1998) Microbial gene identification using interpolated Markov Models. Nucleic Acids Research 26(2): 544-548. Delcher A.L, Harmon D., Kasif S., White O., Salzberg, S.L. (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Research 27(23): 4636-4641. Mathe C, Peresetsky A, Dehais P, Van Montagu M, Rouze P. (1999) Classification of Arabidopsis thaliana gene sequences: clustering of coding sequences into two groups according to codon usage improves gene prediction. J Mol Biol. 285(5):1977-1991. Mathe C, Dehais P, Pavy N, Rombauts S, Van Montagu M, Rouze P. (2000) Gene prediction and gene classes in Arabidopsis thaliana. J. Biotechnol. 78(3):293-299. Mathe C, Sagot MF, Schiex T, Rouze P. Current methods of gene prediction, their strengths and weaknesses (2002) Nucleic Acids Res. 30(19):4103-4117. Medigue C., Rouxel T., Vigier P., Henault A., Danchin A. (1991) Evidence for horizontal gene transfer in Escherichia coli speciation. J. Mol. Biol. 222, 851-856. citation: Gorban, Prof. Alexander N. and Zinovyev, Dr. Andrei Yu and Popova, Dr. Tatyana G. (2002) Seven clusters in genomic triplet distributions. [Preprint] document_url: http://cogprints.org/3077/1/Seven.pdf