Cargando…

Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning

Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gao, Qijuan, Jin, Xiu, Xia, Enhua, Wu, Xiangwei, Gu, Lichuan, Yan, Hanwei, Xia, Yingchun, Li, Shaowen
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2020
Materias:	Genetics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7567012/ https://www.ncbi.nlm.nih.gov/pubmed/33133122 http://dx.doi.org/10.3389/fgene.2020.00820

_version_	1783596235521785856
author	Gao, Qijuan Jin, Xiu Xia, Enhua Wu, Xiangwei Gu, Lichuan Yan, Hanwei Xia, Yingchun Li, Shaowen
author_facet	Gao, Qijuan Jin, Xiu Xia, Enhua Wu, Xiangwei Gu, Lichuan Yan, Hanwei Xia, Yingchun Li, Shaowen
author_sort	Gao, Qijuan
collection	PubMed
description	Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets.
format	Online Article Text
id	pubmed-7567012
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-75670122020-10-30 Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning Gao, Qijuan Jin, Xiu Xia, Enhua Wu, Xiangwei Gu, Lichuan Yan, Hanwei Xia, Yingchun Li, Shaowen Front Genet Genetics Orphan genes are associated with regulatory patterns, but experimental methods for identifying orphan genes are both time-consuming and expensive. Designing an accurate and robust classification model to detect orphan and non-orphan genes in unbalanced distribution datasets poses a particularly huge challenge. Synthetic minority over-sampling algorithms (SMOTE) are selected in a preliminary step to deal with unbalanced gene datasets. To identify orphan genes in balanced and unbalanced Arabidopsis thaliana gene datasets, SMOTE algorithms were then combined with traditional and advanced ensemble classified algorithms respectively, using Support Vector Machine, Random Forest (RF), AdaBoost (adaptive boosting), GBDT (gradient boosting decision tree), and XGBoost (extreme gradient boosting). After comparing the performance of these ensemble models, SMOTE algorithms with XGBoost achieved an F1 score of 0.94 with the balanced A. thaliana gene datasets, but a lower score with the unbalanced datasets. The proposed ensemble method combines different balanced data algorithms including Borderline SMOTE (BSMOTE), Adaptive Synthetic Sampling (ADSYN), SMOTE-Tomek, and SMOTE-ENN with the XGBoost model separately. The performances of the SMOTE-ENN-XGBoost model, which combined over-sampling and under-sampling algorithms with XGBoost, achieved higher predictive accuracy than the other balanced algorithms with XGBoost models. Thus, SMOTE-ENN-XGBoost provides a theoretical basis for developing evaluation criteria for identifying orphan genes in unbalanced and biological datasets. Frontiers Media S.A. 2020-10-02 /pmc/articles/PMC7567012/ /pubmed/33133122 http://dx.doi.org/10.3389/fgene.2020.00820 Text en Copyright © 2020 Gao, Jin, Xia, Wu, Gu, Yan, Xia and Li. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Genetics Gao, Qijuan Jin, Xiu Xia, Enhua Wu, Xiangwei Gu, Lichuan Yan, Hanwei Xia, Yingchun Li, Shaowen Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning
title	Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning
title_full	Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning
title_fullStr	Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning
title_full_unstemmed	Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning
title_short	Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning
title_sort	identification of orphan genes in unbalanced datasets based on ensemble learning
topic	Genetics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7567012/ https://www.ncbi.nlm.nih.gov/pubmed/33133122 http://dx.doi.org/10.3389/fgene.2020.00820
work_keys_str_mv	AT gaoqijuan identificationoforphangenesinunbalanceddatasetsbasedonensemblelearning AT jinxiu identificationoforphangenesinunbalanceddatasetsbasedonensemblelearning AT xiaenhua identificationoforphangenesinunbalanceddatasetsbasedonensemblelearning AT wuxiangwei identificationoforphangenesinunbalanceddatasetsbasedonensemblelearning AT gulichuan identificationoforphangenesinunbalanceddatasetsbasedonensemblelearning AT yanhanwei identificationoforphangenesinunbalanceddatasetsbasedonensemblelearning AT xiayingchun identificationoforphangenesinunbalanceddatasetsbasedonensemblelearning AT lishaowen identificationoforphangenesinunbalanceddatasetsbasedonensemblelearning

Identification of Orphan Genes in Unbalanced Datasets Based on Ensemble Learning

Ejemplares similares