Cargando…

Gene prediction in metagenomic fragments: A large scale machine learning approach

BACKGROUND: Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hoff, Katharina J, Tech, Maike, Lingner, Thomas, Daniel, Rolf, Morgenstern, Burkhard, Meinicke, Peter
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2008
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2409338/ https://www.ncbi.nlm.nih.gov/pubmed/18442389 http://dx.doi.org/10.1186/1471-2105-9-217

_version_	1782155752815722496
author	Hoff, Katharina J Tech, Maike Lingner, Thomas Daniel, Rolf Morgenstern, Burkhard Meinicke, Peter
author_facet	Hoff, Katharina J Tech, Maike Lingner, Thomas Daniel, Rolf Morgenstern, Burkhard Meinicke, Peter
author_sort	Hoff, Katharina J
collection	PubMed
description	BACKGROUND: Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions. RESULTS: We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability. CONCLUSION: Large scale machine learning methods are well-suited for gene prediction in metagenomic DNA fragments. In particular, the combination of linear discriminants and neural networks is promising and should be considered for integration into metagenomic analysis pipelines. The data sets can be downloaded from the URL provided (see Availability and requirements section).
format	Text
id	pubmed-2409338
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-24093382008-06-04 Gene prediction in metagenomic fragments: A large scale machine learning approach Hoff, Katharina J Tech, Maike Lingner, Thomas Daniel, Rolf Morgenstern, Burkhard Meinicke, Peter BMC Bioinformatics Methodology Article BACKGROUND: Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions. RESULTS: We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability. CONCLUSION: Large scale machine learning methods are well-suited for gene prediction in metagenomic DNA fragments. In particular, the combination of linear discriminants and neural networks is promising and should be considered for integration into metagenomic analysis pipelines. The data sets can be downloaded from the URL provided (see Availability and requirements section). BioMed Central 2008-04-28 /pmc/articles/PMC2409338/ /pubmed/18442389 http://dx.doi.org/10.1186/1471-2105-9-217 Text en Copyright © 2008 Hoff et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Hoff, Katharina J Tech, Maike Lingner, Thomas Daniel, Rolf Morgenstern, Burkhard Meinicke, Peter Gene prediction in metagenomic fragments: A large scale machine learning approach
title	Gene prediction in metagenomic fragments: A large scale machine learning approach
title_full	Gene prediction in metagenomic fragments: A large scale machine learning approach
title_fullStr	Gene prediction in metagenomic fragments: A large scale machine learning approach
title_full_unstemmed	Gene prediction in metagenomic fragments: A large scale machine learning approach
title_short	Gene prediction in metagenomic fragments: A large scale machine learning approach
title_sort	gene prediction in metagenomic fragments: a large scale machine learning approach
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2409338/ https://www.ncbi.nlm.nih.gov/pubmed/18442389 http://dx.doi.org/10.1186/1471-2105-9-217
work_keys_str_mv	AT hoffkatharinaj genepredictioninmetagenomicfragmentsalargescalemachinelearningapproach AT techmaike genepredictioninmetagenomicfragmentsalargescalemachinelearningapproach AT lingnerthomas genepredictioninmetagenomicfragmentsalargescalemachinelearningapproach AT danielrolf genepredictioninmetagenomicfragmentsalargescalemachinelearningapproach AT morgensternburkhard genepredictioninmetagenomicfragmentsalargescalemachinelearningapproach AT meinickepeter genepredictioninmetagenomicfragmentsalargescalemachinelearningapproach

Gene prediction in metagenomic fragments: A large scale machine learning approach

Ejemplares similares