Cargando…

N-gram analysis of 970 microbial organisms reveals presence of biological language models

BACKGROUND: It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may ther...

Descripción completa

Detalles Bibliográficos
Autores principales: Osmanbeyoglu, Hatice Ulku, Ganapathiraju, Madhavi K
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3027111/
https://www.ncbi.nlm.nih.gov/pubmed/21219653
http://dx.doi.org/10.1186/1471-2105-12-12
_version_ 1782197120744292352
author Osmanbeyoglu, Hatice Ulku
Ganapathiraju, Madhavi K
author_facet Osmanbeyoglu, Hatice Ulku
Ganapathiraju, Madhavi K
author_sort Osmanbeyoglu, Hatice Ulku
collection PubMed
description BACKGROUND: It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree. RESULTS: We studied whole proteome sequences of 970 microbial organisms using n-gram frequencies and cross-perplexity employing the Biological Language Modeling Toolkit and Patternix Revelio toolkit. Genus-specific signatures were observed even in a simple unigram distribution. By taking statistical n-gram model of one organism as reference and computing cross-perplexity of all other microbial proteomes with it, cross-perplexity was found to be predictive of branch distance of the phylogenetic tree. For example, a 4-gram model from proteome of Shigellae flexneri 2a, which belongs to the Gammaproteobacteria class showed a self-perplexity of 15.34 while the cross-perplexity of other organisms was in the range of 15.59 to 29.5 and was proportional to their branching distance in the evolutionary tree from S. flexneri. The organisms of this genus, which happen to be pathotypes of E.coli, also have the closest perplexity values with E. coli. CONCLUSION: Whole proteome sequences of microbial organisms have been shown to contain particular n-gram sequences in abundance in one organism but occurring very rarely in other organisms, thereby serving as proteome signatures. Further it has also been shown that perplexity, a statistical measure of similarity of n-gram composition, can be used to predict evolutionary distance within a genus in the phylogenetic tree.
format Text
id pubmed-3027111
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30271112011-01-28 N-gram analysis of 970 microbial organisms reveals presence of biological language models Osmanbeyoglu, Hatice Ulku Ganapathiraju, Madhavi K BMC Bioinformatics Research Article BACKGROUND: It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree. RESULTS: We studied whole proteome sequences of 970 microbial organisms using n-gram frequencies and cross-perplexity employing the Biological Language Modeling Toolkit and Patternix Revelio toolkit. Genus-specific signatures were observed even in a simple unigram distribution. By taking statistical n-gram model of one organism as reference and computing cross-perplexity of all other microbial proteomes with it, cross-perplexity was found to be predictive of branch distance of the phylogenetic tree. For example, a 4-gram model from proteome of Shigellae flexneri 2a, which belongs to the Gammaproteobacteria class showed a self-perplexity of 15.34 while the cross-perplexity of other organisms was in the range of 15.59 to 29.5 and was proportional to their branching distance in the evolutionary tree from S. flexneri. The organisms of this genus, which happen to be pathotypes of E.coli, also have the closest perplexity values with E. coli. CONCLUSION: Whole proteome sequences of microbial organisms have been shown to contain particular n-gram sequences in abundance in one organism but occurring very rarely in other organisms, thereby serving as proteome signatures. Further it has also been shown that perplexity, a statistical measure of similarity of n-gram composition, can be used to predict evolutionary distance within a genus in the phylogenetic tree. BioMed Central 2011-01-10 /pmc/articles/PMC3027111/ /pubmed/21219653 http://dx.doi.org/10.1186/1471-2105-12-12 Text en Copyright ©2011 Osmanbeyoglu and Ganapathiraju; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Osmanbeyoglu, Hatice Ulku
Ganapathiraju, Madhavi K
N-gram analysis of 970 microbial organisms reveals presence of biological language models
title N-gram analysis of 970 microbial organisms reveals presence of biological language models
title_full N-gram analysis of 970 microbial organisms reveals presence of biological language models
title_fullStr N-gram analysis of 970 microbial organisms reveals presence of biological language models
title_full_unstemmed N-gram analysis of 970 microbial organisms reveals presence of biological language models
title_short N-gram analysis of 970 microbial organisms reveals presence of biological language models
title_sort n-gram analysis of 970 microbial organisms reveals presence of biological language models
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3027111/
https://www.ncbi.nlm.nih.gov/pubmed/21219653
http://dx.doi.org/10.1186/1471-2105-12-12
work_keys_str_mv AT osmanbeyogluhaticeulku ngramanalysisof970microbialorganismsrevealspresenceofbiologicallanguagemodels
AT ganapathirajumadhavik ngramanalysisof970microbialorganismsrevealspresenceofbiologicallanguagemodels