Cargando…

Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique

(1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to...

Descripción completa

Detalles Bibliográficos
Autores principales: Yan, Yuanting, Dai, Tao, Yang, Meili, Du, Xiuquan, Zhang, Yiwen, Zhang, Yanping
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6274900/
https://www.ncbi.nlm.nih.gov/pubmed/30380746
http://dx.doi.org/10.3390/ijms19113398
_version_ 1783377714838765568
author Yan, Yuanting
Dai, Tao
Yang, Meili
Du, Xiuquan
Zhang, Yiwen
Zhang, Yanping
author_facet Yan, Yuanting
Dai, Tao
Yang, Meili
Du, Xiuquan
Zhang, Yiwen
Zhang, Yanping
author_sort Yan, Yuanting
collection PubMed
description (1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to select the informative genes for downstream classification is more important than how to impute MVs. However, most feature-selection (FS) algorithms need beforehand imputation, and the impact of beforehand MV imputation on downstream FS performance is seldom considered. (2) Method: A modified chi-square test-based FS is introduced for gene-expression data. To deal with the challenge of a small sample size of gene-expression data, a heuristic method called recursive element aggregation is proposed in this study. Our approach can directly handle incomplete data without any imputation methods or missing-data assumptions. The most informative genes can be selected through a threshold. After that, the best-first search strategy is utilized to find optimal feature subsets for classification. (3) Results: We compare our method with several FS algorithms. Evaluation is performed on twelve original incomplete cancer gene-expression datasets. We demonstrate that MV imputation on an incomplete dataset impacts subsequent FS in terms of classification tasks. Through directly conducting FS on incomplete data, our method can avoid potential disturbances on subsequent FS procedures caused by MV imputation. An experiment on small, round blue cell tumor (SRBCT) dataset showed that our method found additional genes besides many common genes with the two compared existing methods.
format Online
Article
Text
id pubmed-6274900
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-62749002018-12-15 Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique Yan, Yuanting Dai, Tao Yang, Meili Du, Xiuquan Zhang, Yiwen Zhang, Yanping Int J Mol Sci Article (1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to select the informative genes for downstream classification is more important than how to impute MVs. However, most feature-selection (FS) algorithms need beforehand imputation, and the impact of beforehand MV imputation on downstream FS performance is seldom considered. (2) Method: A modified chi-square test-based FS is introduced for gene-expression data. To deal with the challenge of a small sample size of gene-expression data, a heuristic method called recursive element aggregation is proposed in this study. Our approach can directly handle incomplete data without any imputation methods or missing-data assumptions. The most informative genes can be selected through a threshold. After that, the best-first search strategy is utilized to find optimal feature subsets for classification. (3) Results: We compare our method with several FS algorithms. Evaluation is performed on twelve original incomplete cancer gene-expression datasets. We demonstrate that MV imputation on an incomplete dataset impacts subsequent FS in terms of classification tasks. Through directly conducting FS on incomplete data, our method can avoid potential disturbances on subsequent FS procedures caused by MV imputation. An experiment on small, round blue cell tumor (SRBCT) dataset showed that our method found additional genes besides many common genes with the two compared existing methods. MDPI 2018-10-30 /pmc/articles/PMC6274900/ /pubmed/30380746 http://dx.doi.org/10.3390/ijms19113398 Text en © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Yan, Yuanting
Dai, Tao
Yang, Meili
Du, Xiuquan
Zhang, Yiwen
Zhang, Yanping
Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique
title Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique
title_full Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique
title_fullStr Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique
title_full_unstemmed Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique
title_short Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique
title_sort classifying incomplete gene-expression data: ensemble learning with non-pre-imputation feature filtering and best-first search technique
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6274900/
https://www.ncbi.nlm.nih.gov/pubmed/30380746
http://dx.doi.org/10.3390/ijms19113398
work_keys_str_mv AT yanyuanting classifyingincompletegeneexpressiondataensemblelearningwithnonpreimputationfeaturefilteringandbestfirstsearchtechnique
AT daitao classifyingincompletegeneexpressiondataensemblelearningwithnonpreimputationfeaturefilteringandbestfirstsearchtechnique
AT yangmeili classifyingincompletegeneexpressiondataensemblelearningwithnonpreimputationfeaturefilteringandbestfirstsearchtechnique
AT duxiuquan classifyingincompletegeneexpressiondataensemblelearningwithnonpreimputationfeaturefilteringandbestfirstsearchtechnique
AT zhangyiwen classifyingincompletegeneexpressiondataensemblelearningwithnonpreimputationfeaturefilteringandbestfirstsearchtechnique
AT zhangyanping classifyingincompletegeneexpressiondataensemblelearningwithnonpreimputationfeaturefilteringandbestfirstsearchtechnique