Cargando…

MGC: a metagenomic gene caller

BACKGROUND: Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract...

Descripción completa

Detalles Bibliográficos
Autores principales:	El Allali, Achraf, Rose, John R
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3698006/ https://www.ncbi.nlm.nih.gov/pubmed/23901840 http://dx.doi.org/10.1186/1471-2105-14-S9-S6

_version_	1782275221983592448
author	El Allali, Achraf Rose, John R
author_facet	El Allali, Achraf Rose, John R
author_sort	El Allali, Achraf
collection	PubMed
description	BACKGROUND: Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract complete and incomplete open reading frames (ORFs) directly from short reads and identify the coding ORFs, bypassing other challenging tasks such as the assembly of the metagenome. RESULTS: In this paper we introduce a metagenomics gene caller (MGC) which is an improvement over the state-of-the-art prediction algorithm Orphelia. Orphelia uses a two-stage machine learning approach and computes a model that classifies extracted ORFs from fragmented sequences. We hypothesise and demonstrate evidence that sequences need separate models based on their local GC-content in order to avoid the noise introduced to a single model computed with sequences from the entire GC spectrum. We have also added two amino-acid features based on the benefit of amino-acid usage shown in our previous research. Our algorithm is able to predict genes and translation initiation sites (TIS) more accurately than Orphelia which uses a single model. CONCLUSIONS: Learning separate models for several pre-defined GC-content regions as opposed to a single model approach improves the performance of the neural network as demonstrated by the experimental results presented in this paper. The inclusion of amino-acid usage features also helps improve the overall accuracy of our algorithm. MGC's improvement sets the ground for further investigation into the use of GC-content to separate data for training models in machine learning based gene finders.
format	Online Article Text
id	pubmed-3698006
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-36980062013-07-02 MGC: a metagenomic gene caller El Allali, Achraf Rose, John R BMC Bioinformatics Methodology Article BACKGROUND: Computational gene finding algorithms have proven their robustness in identifying genes in complete genomes. However, metagenomic sequencing has presented new challenges due to the incomplete and fragmented nature of the data. During the last few years, attempts have been made to extract complete and incomplete open reading frames (ORFs) directly from short reads and identify the coding ORFs, bypassing other challenging tasks such as the assembly of the metagenome. RESULTS: In this paper we introduce a metagenomics gene caller (MGC) which is an improvement over the state-of-the-art prediction algorithm Orphelia. Orphelia uses a two-stage machine learning approach and computes a model that classifies extracted ORFs from fragmented sequences. We hypothesise and demonstrate evidence that sequences need separate models based on their local GC-content in order to avoid the noise introduced to a single model computed with sequences from the entire GC spectrum. We have also added two amino-acid features based on the benefit of amino-acid usage shown in our previous research. Our algorithm is able to predict genes and translation initiation sites (TIS) more accurately than Orphelia which uses a single model. CONCLUSIONS: Learning separate models for several pre-defined GC-content regions as opposed to a single model approach improves the performance of the neural network as demonstrated by the experimental results presented in this paper. The inclusion of amino-acid usage features also helps improve the overall accuracy of our algorithm. MGC's improvement sets the ground for further investigation into the use of GC-content to separate data for training models in machine learning based gene finders. BioMed Central 2013-06-28 /pmc/articles/PMC3698006/ /pubmed/23901840 http://dx.doi.org/10.1186/1471-2105-14-S9-S6 Text en Copyright © 2013 El Allali and Rose; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article El Allali, Achraf Rose, John R MGC: a metagenomic gene caller
title	MGC: a metagenomic gene caller
title_full	MGC: a metagenomic gene caller
title_fullStr	MGC: a metagenomic gene caller
title_full_unstemmed	MGC: a metagenomic gene caller
title_short	MGC: a metagenomic gene caller
title_sort	mgc: a metagenomic gene caller
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3698006/ https://www.ncbi.nlm.nih.gov/pubmed/23901840 http://dx.doi.org/10.1186/1471-2105-14-S9-S6
work_keys_str_mv	AT elallaliachraf mgcametagenomicgenecaller AT rosejohnr mgcametagenomicgenecaller

MGC: a metagenomic gene caller

Ejemplares similares