Cargando…

Feature selection for gene prediction in metagenomic fragments

BACKGROUND: Computational approaches, specifically machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Due to the large feature space, current de novo gene prediction algorithms use different combinations of classification algorithms...

Descripción completa

Detalles Bibliográficos
Autores principales: Al-Ajlan, Amani, El Allali, Achraf
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6047368/
https://www.ncbi.nlm.nih.gov/pubmed/30026811
http://dx.doi.org/10.1186/s13040-018-0170-z
_version_ 1783339938876489728
author Al-Ajlan, Amani
El Allali, Achraf
author_facet Al-Ajlan, Amani
El Allali, Achraf
author_sort Al-Ajlan, Amani
collection PubMed
description BACKGROUND: Computational approaches, specifically machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Due to the large feature space, current de novo gene prediction algorithms use different combinations of classification algorithms to distinguish between coding and non-coding sequences. RESULTS: In this study, we apply a filter method to select relevant features from a large set of known features instead of combining them using linear classifiers or ignoring their individual coding potential. We use minimum redundancy maximum relevance (mRMR) to select the most relevant features. Support vector machines (SVM) are trained using these features, and the classification score is transformed into the posterior probability of the coding class. A greedy algorithm uses the probability of overlapped candidate genes to select the final genes. Instead of using one model for all sequences, we train an ensemble of SVM models on mutually exclusive datasets based on GC content and use the appropriated model to classify candidate genes based on their read’s GC content. CONCLUSION: Our proposed algorithm achieves an improvement over some existing algorithms. mRMR produces promising results in gene prediction. It improves classification performance and feature interpretation. Our research serves as a basis for future studies on feature selection for gene prediction.
format Online
Article
Text
id pubmed-6047368
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-60473682018-07-19 Feature selection for gene prediction in metagenomic fragments Al-Ajlan, Amani El Allali, Achraf BioData Min Methodology BACKGROUND: Computational approaches, specifically machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Due to the large feature space, current de novo gene prediction algorithms use different combinations of classification algorithms to distinguish between coding and non-coding sequences. RESULTS: In this study, we apply a filter method to select relevant features from a large set of known features instead of combining them using linear classifiers or ignoring their individual coding potential. We use minimum redundancy maximum relevance (mRMR) to select the most relevant features. Support vector machines (SVM) are trained using these features, and the classification score is transformed into the posterior probability of the coding class. A greedy algorithm uses the probability of overlapped candidate genes to select the final genes. Instead of using one model for all sequences, we train an ensemble of SVM models on mutually exclusive datasets based on GC content and use the appropriated model to classify candidate genes based on their read’s GC content. CONCLUSION: Our proposed algorithm achieves an improvement over some existing algorithms. mRMR produces promising results in gene prediction. It improves classification performance and feature interpretation. Our research serves as a basis for future studies on feature selection for gene prediction. BioMed Central 2018-06-07 /pmc/articles/PMC6047368/ /pubmed/30026811 http://dx.doi.org/10.1186/s13040-018-0170-z Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology
Al-Ajlan, Amani
El Allali, Achraf
Feature selection for gene prediction in metagenomic fragments
title Feature selection for gene prediction in metagenomic fragments
title_full Feature selection for gene prediction in metagenomic fragments
title_fullStr Feature selection for gene prediction in metagenomic fragments
title_full_unstemmed Feature selection for gene prediction in metagenomic fragments
title_short Feature selection for gene prediction in metagenomic fragments
title_sort feature selection for gene prediction in metagenomic fragments
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6047368/
https://www.ncbi.nlm.nih.gov/pubmed/30026811
http://dx.doi.org/10.1186/s13040-018-0170-z
work_keys_str_mv AT alajlanamani featureselectionforgenepredictioninmetagenomicfragments
AT elallaliachraf featureselectionforgenepredictioninmetagenomicfragments