Cargando…

MGC: a metagenomic gene caller

BACKGROUND: Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract...

Descripción completa

Detalles Bibliográficos
Autores principales: El Allali, Achraf, Rose, John R
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3698006/
https://www.ncbi.nlm.nih.gov/pubmed/23901840
http://dx.doi.org/10.1186/1471-2105-14-S9-S6
_version_ 1782275221983592448
author El Allali, Achraf
Rose, John R
author_facet El Allali, Achraf
Rose, John R
author_sort El Allali, Achraf
collection PubMed
description BACKGROUND: Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract complete and incomplete open reading frames (ORFs) directly from short reads and identify the coding ORFs, bypassing other challenging tasks such as the assembly of the metagenome. RESULTS: In this paper we introduce a metagenomics gene caller (MGC) which is an improvement over the state-of-the-art prediction algorithm Orphelia. Orphelia uses a two-stage machine learning approach and computes a model that classifies extracted ORFs from fragmented sequences. We hypothesise and demonstrate evidence that sequences need separate models based on their local GC-content in order to avoid the noise introduced to a single model computed with sequences from the entire GC spectrum. We have also added two amino-acid features based on the benefit of amino-acid usage shown in our previous research. Our algorithm is able to predict genes and translation initiation sites (TIS) more accurately than Orphelia which uses a single model. CONCLUSIONS: Learning separate models for several pre-defined GC-content regions as opposed to a single model approach improves the performance of the neural network as demonstrated by the experimental results presented in this paper. The inclusion of amino-acid usage features also helps improve the overall accuracy of our algorithm. MGC's improvement sets the ground for further investigation into the use of GC-content to separate data for training models in machine learning based gene finders.
format Online
Article
Text
id pubmed-3698006
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-36980062013-07-02 MGC: a metagenomic gene caller El Allali, Achraf Rose, John R BMC Bioinformatics Methodology Article BACKGROUND: Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract complete and incomplete open reading frames (ORFs) directly from short reads and identify the coding ORFs, bypassing other challenging tasks such as the assembly of the metagenome. RESULTS: In this paper we introduce a metagenomics gene caller (MGC) which is an improvement over the state-of-the-art prediction algorithm Orphelia. Orphelia uses a two-stage machine learning approach and computes a model that classifies extracted ORFs from fragmented sequences. We hypothesise and demonstrate evidence that sequences need separate models based on their local GC-content in order to avoid the noise introduced to a single model computed with sequences from the entire GC spectrum. We have also added two amino-acid features based on the benefit of amino-acid usage shown in our previous research. Our algorithm is able to predict genes and translation initiation sites (TIS) more accurately than Orphelia which uses a single model. CONCLUSIONS: Learning separate models for several pre-defined GC-content regions as opposed to a single model approach improves the performance of the neural network as demonstrated by the experimental results presented in this paper. The inclusion of amino-acid usage features also helps improve the overall accuracy of our algorithm. MGC's improvement sets the ground for further investigation into the use of GC-content to separate data for training models in machine learning based gene finders. BioMed Central 2013-06-28 /pmc/articles/PMC3698006/ /pubmed/23901840 http://dx.doi.org/10.1186/1471-2105-14-S9-S6 Text en Copyright © 2013 El Allali and Rose; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
El Allali, Achraf
Rose, John R
MGC: a metagenomic gene caller
title MGC: a metagenomic gene caller
title_full MGC: a metagenomic gene caller
title_fullStr MGC: a metagenomic gene caller
title_full_unstemmed MGC: a metagenomic gene caller
title_short MGC: a metagenomic gene caller
title_sort mgc: a metagenomic gene caller
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3698006/
https://www.ncbi.nlm.nih.gov/pubmed/23901840
http://dx.doi.org/10.1186/1471-2105-14-S9-S6
work_keys_str_mv AT elallaliachraf mgcametagenomicgenecaller
AT rosejohnr mgcametagenomicgenecaller