Cargando…

Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study

BACKGROUND: When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine lear...

Descripción completa

Detalles Bibliográficos
Autores principales: Gwon, Hansle, Ahn, Imjin, Kim, Yunha, Kang, Hee Jun, Seo, Hyeram, Cho, Ha Na, Choi, Heejung, Jun, Tae Joon, Kim, Young-Hak
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8552097/
https://www.ncbi.nlm.nih.gov/pubmed/34643539
http://dx.doi.org/10.2196/30824
_version_ 1784591311792242688
author Gwon, Hansle
Ahn, Imjin
Kim, Yunha
Kang, Hee Jun
Seo, Hyeram
Cho, Ha Na
Choi, Heejung
Jun, Tae Joon
Kim, Young-Hak
author_facet Gwon, Hansle
Ahn, Imjin
Kim, Yunha
Kang, Hee Jun
Seo, Hyeram
Cho, Ha Na
Choi, Heejung
Jun, Tae Joon
Kim, Young-Hak
author_sort Gwon, Hansle
collection PubMed
description BACKGROUND: When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree. OBJECTIVE: The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce. METHODS: In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model. RESULTS: In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a P value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible P value, 3.05e-5, in all situations. CONCLUSIONS: Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research.
format Online
Article
Text
id pubmed-8552097
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-85520972021-11-10 Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study Gwon, Hansle Ahn, Imjin Kim, Yunha Kang, Hee Jun Seo, Hyeram Cho, Ha Na Choi, Heejung Jun, Tae Joon Kim, Young-Hak JMIR Public Health Surveill Original Paper BACKGROUND: When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree. OBJECTIVE: The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce. METHODS: In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model. RESULTS: In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a P value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible P value, 3.05e-5, in all situations. CONCLUSIONS: Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research. JMIR Publications 2021-10-13 /pmc/articles/PMC8552097/ /pubmed/34643539 http://dx.doi.org/10.2196/30824 Text en ©Hansle Gwon, Imjin Ahn, Yunha Kim, Hee Jun Kang, Hyeram Seo, Ha Na Cho, Heejung Choi, Tae Joon Jun, Young-Hak Kim. Originally published in JMIR Public Health and Surveillance (https://publichealth.jmir.org), 13.10.2021. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on https://publichealth.jmir.org, as well as this copyright and license information must be included.
spellingShingle Original Paper
Gwon, Hansle
Ahn, Imjin
Kim, Yunha
Kang, Hee Jun
Seo, Hyeram
Cho, Ha Na
Choi, Heejung
Jun, Tae Joon
Kim, Young-Hak
Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study
title Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study
title_full Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study
title_fullStr Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study
title_full_unstemmed Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study
title_short Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study
title_sort self–training with quantile errors for multivariate missing data imputation for regression problems in electronic medical records: algorithm development study
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8552097/
https://www.ncbi.nlm.nih.gov/pubmed/34643539
http://dx.doi.org/10.2196/30824
work_keys_str_mv AT gwonhansle selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy
AT ahnimjin selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy
AT kimyunha selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy
AT kangheejun selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy
AT seohyeram selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy
AT chohana selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy
AT choiheejung selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy
AT juntaejoon selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy
AT kimyounghak selftrainingwithquantileerrorsformultivariatemissingdataimputationforregressionproblemsinelectronicmedicalrecordsalgorithmdevelopmentstudy