Cargando…
MGC: a metagenomic gene caller
BACKGROUND: Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3698006/ https://www.ncbi.nlm.nih.gov/pubmed/23901840 http://dx.doi.org/10.1186/1471-2105-14-S9-S6 |
_version_ | 1782275221983592448 |
---|---|
author | El Allali, Achraf Rose, John R |
author_facet | El Allali, Achraf Rose, John R |
author_sort | El Allali, Achraf |
collection | PubMed |
description | BACKGROUND: Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract complete and incomplete open reading frames (ORFs) directly from short reads and identify the coding ORFs, bypassing other challenging tasks such as the assembly of the metagenome. RESULTS: In this paper we introduce a metagenomics gene caller (MGC) which is an improvement over the state-of-the-art prediction algorithm Orphelia. Orphelia uses a two-stage machine learning approach and computes a model that classifies extracted ORFs from fragmented sequences. We hypothesise and demonstrate evidence that sequences need separate models based on their local GC-content in order to avoid the noise introduced to a single model computed with sequences from the entire GC spectrum. We have also added two amino-acid features based on the benefit of amino-acid usage shown in our previous research. Our algorithm is able to predict genes and translation initiation sites (TIS) more accurately than Orphelia which uses a single model. CONCLUSIONS: Learning separate models for several pre-defined GC-content regions as opposed to a single model approach improves the performance of the neural network as demonstrated by the experimental results presented in this paper. The inclusion of amino-acid usage features also helps improve the overall accuracy of our algorithm. MGC's improvement sets the ground for further investigation into the use of GC-content to separate data for training models in machine learning based gene finders. |
format | Online Article Text |
id | pubmed-3698006 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-36980062013-07-02 MGC: a metagenomic gene caller El Allali, Achraf Rose, John R BMC Bioinformatics Methodology Article BACKGROUND: Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract complete and incomplete open reading frames (ORFs) directly from short reads and identify the coding ORFs, bypassing other challenging tasks such as the assembly of the metagenome. RESULTS: In this paper we introduce a metagenomics gene caller (MGC) which is an improvement over the state-of-the-art prediction algorithm Orphelia. Orphelia uses a two-stage machine learning approach and computes a model that classifies extracted ORFs from fragmented sequences. We hypothesise and demonstrate evidence that sequences need separate models based on their local GC-content in order to avoid the noise introduced to a single model computed with sequences from the entire GC spectrum. We have also added two amino-acid features based on the benefit of amino-acid usage shown in our previous research. Our algorithm is able to predict genes and translation initiation sites (TIS) more accurately than Orphelia which uses a single model. CONCLUSIONS: Learning separate models for several pre-defined GC-content regions as opposed to a single model approach improves the performance of the neural network as demonstrated by the experimental results presented in this paper. The inclusion of amino-acid usage features also helps improve the overall accuracy of our algorithm. MGC's improvement sets the ground for further investigation into the use of GC-content to separate data for training models in machine learning based gene finders. BioMed Central 2013-06-28 /pmc/articles/PMC3698006/ /pubmed/23901840 http://dx.doi.org/10.1186/1471-2105-14-S9-S6 Text en Copyright © 2013 El Allali and Rose; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Article El Allali, Achraf Rose, John R MGC: a metagenomic gene caller |
title | MGC: a metagenomic gene caller |
title_full | MGC: a metagenomic gene caller |
title_fullStr | MGC: a metagenomic gene caller |
title_full_unstemmed | MGC: a metagenomic gene caller |
title_short | MGC: a metagenomic gene caller |
title_sort | mgc: a metagenomic gene caller |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3698006/ https://www.ncbi.nlm.nih.gov/pubmed/23901840 http://dx.doi.org/10.1186/1471-2105-14-S9-S6 |
work_keys_str_mv | AT elallaliachraf mgcametagenomicgenecaller AT rosejohnr mgcametagenomicgenecaller |