Cargando…
Combining gene prediction methods to improve metagenomic gene annotation
BACKGROUND: Traditional gene annotation methods rely on characteristics that may not be available in short reads generated from next generation technology, resulting in suboptimal performance for metagenomic (environmental) samples. Therefore, in recent years, new programs have been developed that o...
Autores principales: | , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2011
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042383/ https://www.ncbi.nlm.nih.gov/pubmed/21232129 http://dx.doi.org/10.1186/1471-2105-12-20 |
_version_ | 1782198537661972480 |
---|---|
author | Yok, Non G Rosen, Gail L |
author_facet | Yok, Non G Rosen, Gail L |
author_sort | Yok, Non G |
collection | PubMed |
description | BACKGROUND: Traditional gene annotation methods rely on characteristics that may not be available in short reads generated from next generation technology, resulting in suboptimal performance for metagenomic (environmental) samples. Therefore, in recent years, new programs have been developed that optimize performance on short reads. In this work, we benchmark three metagenomic gene prediction programs and combine their predictions to improve metagenomic read gene annotation. RESULTS: We not only analyze the programs' performance at different read-lengths like similar studies, but also separate different types of reads, including intra- and intergenic regions, for analysis. The main deficiencies are in the algorithms' ability to predict non-coding regions and gene edges, resulting in more false-positives and false-negatives than desired. In fact, the specificities of the algorithms are notably worse than the sensitivities. By combining the programs' predictions, we show significant improvement in specificity at minimal cost to sensitivity, resulting in 4% improvement in accuracy for 100 bp reads with ~1% improvement in accuracy for 200 bp reads and above. To correctly annotate the start and stop of the genes, we find that a consensus of all the predictors performs best for shorter read lengths while a unanimous agreement is better for longer read lengths, boosting annotation accuracy by 1-8%. We also demonstrate use of the classifier combinations on a real dataset. CONCLUSIONS: To optimize the performance for both prediction and annotation accuracies, we conclude that the consensus of all methods (or a majority vote) is the best for reads 400 bp and shorter, while using the intersection of GeneMark and Orphelia predictions is the best for reads 500 bp and longer. We demonstrate that most methods predict over 80% coding (including partially coding) reads on a real human gut sample sequenced by Illumina technology. |
format | Text |
id | pubmed-3042383 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2011 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-30423832011-02-25 Combining gene prediction methods to improve metagenomic gene annotation Yok, Non G Rosen, Gail L BMC Bioinformatics Research Article BACKGROUND: Traditional gene annotation methods rely on characteristics that may not be available in short reads generated from next generation technology, resulting in suboptimal performance for metagenomic (environmental) samples. Therefore, in recent years, new programs have been developed that optimize performance on short reads. In this work, we benchmark three metagenomic gene prediction programs and combine their predictions to improve metagenomic read gene annotation. RESULTS: We not only analyze the programs' performance at different read-lengths like similar studies, but also separate different types of reads, including intra- and intergenic regions, for analysis. The main deficiencies are in the algorithms' ability to predict non-coding regions and gene edges, resulting in more false-positives and false-negatives than desired. In fact, the specificities of the algorithms are notably worse than the sensitivities. By combining the programs' predictions, we show significant improvement in specificity at minimal cost to sensitivity, resulting in 4% improvement in accuracy for 100 bp reads with ~1% improvement in accuracy for 200 bp reads and above. To correctly annotate the start and stop of the genes, we find that a consensus of all the predictors performs best for shorter read lengths while a unanimous agreement is better for longer read lengths, boosting annotation accuracy by 1-8%. We also demonstrate use of the classifier combinations on a real dataset. CONCLUSIONS: To optimize the performance for both prediction and annotation accuracies, we conclude that the consensus of all methods (or a majority vote) is the best for reads 400 bp and shorter, while using the intersection of GeneMark and Orphelia predictions is the best for reads 500 bp and longer. We demonstrate that most methods predict over 80% coding (including partially coding) reads on a real human gut sample sequenced by Illumina technology. BioMed Central 2011-01-13 /pmc/articles/PMC3042383/ /pubmed/21232129 http://dx.doi.org/10.1186/1471-2105-12-20 Text en Copyright ©2011 Yok and Rosen; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Yok, Non G Rosen, Gail L Combining gene prediction methods to improve metagenomic gene annotation |
title | Combining gene prediction methods to improve metagenomic gene annotation |
title_full | Combining gene prediction methods to improve metagenomic gene annotation |
title_fullStr | Combining gene prediction methods to improve metagenomic gene annotation |
title_full_unstemmed | Combining gene prediction methods to improve metagenomic gene annotation |
title_short | Combining gene prediction methods to improve metagenomic gene annotation |
title_sort | combining gene prediction methods to improve metagenomic gene annotation |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042383/ https://www.ncbi.nlm.nih.gov/pubmed/21232129 http://dx.doi.org/10.1186/1471-2105-12-20 |
work_keys_str_mv | AT yoknong combininggenepredictionmethodstoimprovemetagenomicgeneannotation AT rosengaill combininggenepredictionmethodstoimprovemetagenomicgeneannotation |