Cargando…
Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique
(1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6274900/ https://www.ncbi.nlm.nih.gov/pubmed/30380746 http://dx.doi.org/10.3390/ijms19113398 |
_version_ | 1783377714838765568 |
---|---|
author | Yan, Yuanting Dai, Tao Yang, Meili Du, Xiuquan Zhang, Yiwen Zhang, Yanping |
author_facet | Yan, Yuanting Dai, Tao Yang, Meili Du, Xiuquan Zhang, Yiwen Zhang, Yanping |
author_sort | Yan, Yuanting |
collection | PubMed |
description | (1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to select the informative genes for downstream classification is more important than how to impute MVs. However, most feature-selection (FS) algorithms need beforehand imputation, and the impact of beforehand MV imputation on downstream FS performance is seldom considered. (2) Method: A modified chi-square test-based FS is introduced for gene-expression data. To deal with the challenge of a small sample size of gene-expression data, a heuristic method called recursive element aggregation is proposed in this study. Our approach can directly handle incomplete data without any imputation methods or missing-data assumptions. The most informative genes can be selected through a threshold. After that, the best-first search strategy is utilized to find optimal feature subsets for classification. (3) Results: We compare our method with several FS algorithms. Evaluation is performed on twelve original incomplete cancer gene-expression datasets. We demonstrate that MV imputation on an incomplete dataset impacts subsequent FS in terms of classification tasks. Through directly conducting FS on incomplete data, our method can avoid potential disturbances on subsequent FS procedures caused by MV imputation. An experiment on small, round blue cell tumor (SRBCT) dataset showed that our method found additional genes besides many common genes with the two compared existing methods. |
format | Online Article Text |
id | pubmed-6274900 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-62749002018-12-15 Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique Yan, Yuanting Dai, Tao Yang, Meili Du, Xiuquan Zhang, Yiwen Zhang, Yanping Int J Mol Sci Article (1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to select the informative genes for downstream classification is more important than how to impute MVs. However, most feature-selection (FS) algorithms need beforehand imputation, and the impact of beforehand MV imputation on downstream FS performance is seldom considered. (2) Method: A modified chi-square test-based FS is introduced for gene-expression data. To deal with the challenge of a small sample size of gene-expression data, a heuristic method called recursive element aggregation is proposed in this study. Our approach can directly handle incomplete data without any imputation methods or missing-data assumptions. The most informative genes can be selected through a threshold. After that, the best-first search strategy is utilized to find optimal feature subsets for classification. (3) Results: We compare our method with several FS algorithms. Evaluation is performed on twelve original incomplete cancer gene-expression datasets. We demonstrate that MV imputation on an incomplete dataset impacts subsequent FS in terms of classification tasks. Through directly conducting FS on incomplete data, our method can avoid potential disturbances on subsequent FS procedures caused by MV imputation. An experiment on small, round blue cell tumor (SRBCT) dataset showed that our method found additional genes besides many common genes with the two compared existing methods. MDPI 2018-10-30 /pmc/articles/PMC6274900/ /pubmed/30380746 http://dx.doi.org/10.3390/ijms19113398 Text en © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Yan, Yuanting Dai, Tao Yang, Meili Du, Xiuquan Zhang, Yiwen Zhang, Yanping Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique |
title | Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique |
title_full | Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique |
title_fullStr | Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique |
title_full_unstemmed | Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique |
title_short | Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique |
title_sort | classifying incomplete gene-expression data: ensemble learning with non-pre-imputation feature filtering and best-first search technique |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6274900/ https://www.ncbi.nlm.nih.gov/pubmed/30380746 http://dx.doi.org/10.3390/ijms19113398 |
work_keys_str_mv | AT yanyuanting classifyingincompletegeneexpressiondataensemblelearningwithnonpreimputationfeaturefilteringandbestfirstsearchtechnique AT daitao classifyingincompletegeneexpressiondataensemblelearningwithnonpreimputationfeaturefilteringandbestfirstsearchtechnique AT yangmeili classifyingincompletegeneexpressiondataensemblelearningwithnonpreimputationfeaturefilteringandbestfirstsearchtechnique AT duxiuquan classifyingincompletegeneexpressiondataensemblelearningwithnonpreimputationfeaturefilteringandbestfirstsearchtechnique AT zhangyiwen classifyingincompletegeneexpressiondataensemblelearningwithnonpreimputationfeaturefilteringandbestfirstsearchtechnique AT zhangyanping classifyingincompletegeneexpressiondataensemblelearningwithnonpreimputationfeaturefilteringandbestfirstsearchtechnique |