Cargando…

Combining data discretization and missing value imputation for incomplete medical datasets

Data discretization aims to transform a set of continuous features into discrete features, thus simplifying the representation of information and making it easier to understand, use, and explain. In practice, users can take advantage of the discretization process to improve knowledge discovery and d...

Descripción completa

Detalles Bibliográficos
Autores principales:	Huang, Min-Wei, Tsai, Chih-Fong, Tsui, Shu-Ching, Lin, Wei-Chao
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10688879/ https://www.ncbi.nlm.nih.gov/pubmed/38033140 http://dx.doi.org/10.1371/journal.pone.0295032

_version_	1785152258707554304
author	Huang, Min-Wei Tsai, Chih-Fong Tsui, Shu-Ching Lin, Wei-Chao
author_facet	Huang, Min-Wei Tsai, Chih-Fong Tsui, Shu-Ching Lin, Wei-Chao
author_sort	Huang, Min-Wei
collection	PubMed
description	Data discretization aims to transform a set of continuous features into discrete features, thus simplifying the representation of information and making it easier to understand, use, and explain. In practice, users can take advantage of the discretization process to improve knowledge discovery and data analysis on medical domain problem datasets containing continuous features. However, certain feature values were frequently missing. Many data-mining algorithms cannot handle incomplete datasets. In this study, we considered the use of both discretization and missing-value imputation to process incomplete medical datasets, examining how the order of discretization and missing-value imputation combined influenced performance. The experimental results were obtained using seven different medical domain problem datasets: two discretizers, including the minimum description length principle (MDLP) and ChiMerge; three imputation methods, including the mean/mode, classification and regression tree (CART), and k-nearest neighbor (KNN) methods; and two classifiers, including support vector machines (SVM) and the C4.5 decision tree. The results show that a better performance can be obtained by first performing discretization followed by imputation, rather than vice versa. Furthermore, the highest classification accuracy rate was achieved by combining ChiMerge and KNN with SVM.
format	Online Article Text
id	pubmed-10688879
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-106888792023-12-01 Combining data discretization and missing value imputation for incomplete medical datasets Huang, Min-Wei Tsai, Chih-Fong Tsui, Shu-Ching Lin, Wei-Chao PLoS One Research Article Data discretization aims to transform a set of continuous features into discrete features, thus simplifying the representation of information and making it easier to understand, use, and explain. In practice, users can take advantage of the discretization process to improve knowledge discovery and data analysis on medical domain problem datasets containing continuous features. However, certain feature values were frequently missing. Many data-mining algorithms cannot handle incomplete datasets. In this study, we considered the use of both discretization and missing-value imputation to process incomplete medical datasets, examining how the order of discretization and missing-value imputation combined influenced performance. The experimental results were obtained using seven different medical domain problem datasets: two discretizers, including the minimum description length principle (MDLP) and ChiMerge; three imputation methods, including the mean/mode, classification and regression tree (CART), and k-nearest neighbor (KNN) methods; and two classifiers, including support vector machines (SVM) and the C4.5 decision tree. The results show that a better performance can be obtained by first performing discretization followed by imputation, rather than vice versa. Furthermore, the highest classification accuracy rate was achieved by combining ChiMerge and KNN with SVM. Public Library of Science 2023-11-30 /pmc/articles/PMC10688879/ /pubmed/38033140 http://dx.doi.org/10.1371/journal.pone.0295032 Text en © 2023 Huang et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Huang, Min-Wei Tsai, Chih-Fong Tsui, Shu-Ching Lin, Wei-Chao Combining data discretization and missing value imputation for incomplete medical datasets
title	Combining data discretization and missing value imputation for incomplete medical datasets
title_full	Combining data discretization and missing value imputation for incomplete medical datasets
title_fullStr	Combining data discretization and missing value imputation for incomplete medical datasets
title_full_unstemmed	Combining data discretization and missing value imputation for incomplete medical datasets
title_short	Combining data discretization and missing value imputation for incomplete medical datasets
title_sort	combining data discretization and missing value imputation for incomplete medical datasets
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10688879/ https://www.ncbi.nlm.nih.gov/pubmed/38033140 http://dx.doi.org/10.1371/journal.pone.0295032
work_keys_str_mv	AT huangminwei combiningdatadiscretizationandmissingvalueimputationforincompletemedicaldatasets AT tsaichihfong combiningdatadiscretizationandmissingvalueimputationforincompletemedicaldatasets AT tsuishuching combiningdatadiscretizationandmissingvalueimputationforincompletemedicaldatasets AT linweichao combiningdatadiscretizationandmissingvalueimputationforincompletemedicaldatasets

Combining data discretization and missing value imputation for incomplete medical datasets

Ejemplares similares