Cargando…
Feature selection for gene prediction in metagenomic fragments
BACKGROUND: Computational approaches, specifically machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Due to the large feature space, current de novo gene prediction algorithms use different combinations of classification algorithms...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6047368/ https://www.ncbi.nlm.nih.gov/pubmed/30026811 http://dx.doi.org/10.1186/s13040-018-0170-z |
_version_ | 1783339938876489728 |
---|---|
author | Al-Ajlan, Amani El Allali, Achraf |
author_facet | Al-Ajlan, Amani El Allali, Achraf |
author_sort | Al-Ajlan, Amani |
collection | PubMed |
description | BACKGROUND: Computational approaches, specifically machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Due to the large feature space, current de novo gene prediction algorithms use different combinations of classification algorithms to distinguish between coding and non-coding sequences. RESULTS: In this study, we apply a filter method to select relevant features from a large set of known features instead of combining them using linear classifiers or ignoring their individual coding potential. We use minimum redundancy maximum relevance (mRMR) to select the most relevant features. Support vector machines (SVM) are trained using these features, and the classification score is transformed into the posterior probability of the coding class. A greedy algorithm uses the probability of overlapped candidate genes to select the final genes. Instead of using one model for all sequences, we train an ensemble of SVM models on mutually exclusive datasets based on GC content and use the appropriated model to classify candidate genes based on their read’s GC content. CONCLUSION: Our proposed algorithm achieves an improvement over some existing algorithms. mRMR produces promising results in gene prediction. It improves classification performance and feature interpretation. Our research serves as a basis for future studies on feature selection for gene prediction. |
format | Online Article Text |
id | pubmed-6047368 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-60473682018-07-19 Feature selection for gene prediction in metagenomic fragments Al-Ajlan, Amani El Allali, Achraf BioData Min Methodology BACKGROUND: Computational approaches, specifically machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Due to the large feature space, current de novo gene prediction algorithms use different combinations of classification algorithms to distinguish between coding and non-coding sequences. RESULTS: In this study, we apply a filter method to select relevant features from a large set of known features instead of combining them using linear classifiers or ignoring their individual coding potential. We use minimum redundancy maximum relevance (mRMR) to select the most relevant features. Support vector machines (SVM) are trained using these features, and the classification score is transformed into the posterior probability of the coding class. A greedy algorithm uses the probability of overlapped candidate genes to select the final genes. Instead of using one model for all sequences, we train an ensemble of SVM models on mutually exclusive datasets based on GC content and use the appropriated model to classify candidate genes based on their read’s GC content. CONCLUSION: Our proposed algorithm achieves an improvement over some existing algorithms. mRMR produces promising results in gene prediction. It improves classification performance and feature interpretation. Our research serves as a basis for future studies on feature selection for gene prediction. BioMed Central 2018-06-07 /pmc/articles/PMC6047368/ /pubmed/30026811 http://dx.doi.org/10.1186/s13040-018-0170-z Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Methodology Al-Ajlan, Amani El Allali, Achraf Feature selection for gene prediction in metagenomic fragments |
title | Feature selection for gene prediction in metagenomic fragments |
title_full | Feature selection for gene prediction in metagenomic fragments |
title_fullStr | Feature selection for gene prediction in metagenomic fragments |
title_full_unstemmed | Feature selection for gene prediction in metagenomic fragments |
title_short | Feature selection for gene prediction in metagenomic fragments |
title_sort | feature selection for gene prediction in metagenomic fragments |
topic | Methodology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6047368/ https://www.ncbi.nlm.nih.gov/pubmed/30026811 http://dx.doi.org/10.1186/s13040-018-0170-z |
work_keys_str_mv | AT alajlanamani featureselectionforgenepredictioninmetagenomicfragments AT elallaliachraf featureselectionforgenepredictioninmetagenomicfragments |