Cargando…

Impact of missing data imputation methods on gene expression clustering and classification

BACKGROUND: Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms we...

Descripción completa

Detalles Bibliográficos
Autores principales:	de Souto, Marcilio CP, Jaskowiak, Pablo A, Costa, Ivan G
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4350881/ https://www.ncbi.nlm.nih.gov/pubmed/25888091 http://dx.doi.org/10.1186/s12859-015-0494-3

_version_	1782360250318323712
author	de Souto, Marcilio CP Jaskowiak, Pablo A Costa, Ivan G
author_facet	de Souto, Marcilio CP Jaskowiak, Pablo A Costa, Ivan G
author_sort	de Souto, Marcilio CP
collection	PubMed
description	BACKGROUND: Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. RESULTS AND CONCLUSIONS: We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0494-3) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4350881
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-43508812015-03-06 Impact of missing data imputation methods on gene expression clustering and classification de Souto, Marcilio CP Jaskowiak, Pablo A Costa, Ivan G BMC Bioinformatics Research Article BACKGROUND: Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. RESULTS AND CONCLUSIONS: We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0494-3) contains supplementary material, which is available to authorized users. BioMed Central 2015-02-26 /pmc/articles/PMC4350881/ /pubmed/25888091 http://dx.doi.org/10.1186/s12859-015-0494-3 Text en © de Souto et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article de Souto, Marcilio CP Jaskowiak, Pablo A Costa, Ivan G Impact of missing data imputation methods on gene expression clustering and classification
title	Impact of missing data imputation methods on gene expression clustering and classification
title_full	Impact of missing data imputation methods on gene expression clustering and classification
title_fullStr	Impact of missing data imputation methods on gene expression clustering and classification
title_full_unstemmed	Impact of missing data imputation methods on gene expression clustering and classification
title_short	Impact of missing data imputation methods on gene expression clustering and classification
title_sort	impact of missing data imputation methods on gene expression clustering and classification
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4350881/ https://www.ncbi.nlm.nih.gov/pubmed/25888091 http://dx.doi.org/10.1186/s12859-015-0494-3
work_keys_str_mv	AT desoutomarciliocp impactofmissingdataimputationmethodsongeneexpressionclusteringandclassification AT jaskowiakpabloa impactofmissingdataimputationmethodsongeneexpressionclusteringandclassification AT costaivang impactofmissingdataimputationmethodsongeneexpressionclusteringandclassification

Impact of missing data imputation methods on gene expression clustering and classification

Ejemplares similares