Cargando…

Increasing the Density of Laboratory Measures for Machine Learning Applications

Background. The imputation of missingness is a key step in Electronic Health Records (EHR) mining, as it can significantly affect the conclusions derived from the downstream analysis in translational medicine. The missingness of laboratory values in EHR is not at random, yet imputation techniques te...

Descripción completa

Detalles Bibliográficos
Autores principales: Abedi, Vida, Li, Jiang, Shivakumar, Manu K., Avula, Venkatesh, Chaudhary, Durgesh P., Shellenberger, Matthew J., Khara, Harshit S., Zhang, Yanfei, Lee, Ming Ta Michael, Wolk, Donna M., Yeasin, Mohammed, Hontecillas, Raquel, Bassaganya-Riera, Josep, Zand, Ramin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7795258/
https://www.ncbi.nlm.nih.gov/pubmed/33396741
http://dx.doi.org/10.3390/jcm10010103
_version_ 1783634402699378688
author Abedi, Vida
Li, Jiang
Shivakumar, Manu K.
Avula, Venkatesh
Chaudhary, Durgesh P.
Shellenberger, Matthew J.
Khara, Harshit S.
Zhang, Yanfei
Lee, Ming Ta Michael
Wolk, Donna M.
Yeasin, Mohammed
Hontecillas, Raquel
Bassaganya-Riera, Josep
Zand, Ramin
author_facet Abedi, Vida
Li, Jiang
Shivakumar, Manu K.
Avula, Venkatesh
Chaudhary, Durgesh P.
Shellenberger, Matthew J.
Khara, Harshit S.
Zhang, Yanfei
Lee, Ming Ta Michael
Wolk, Donna M.
Yeasin, Mohammed
Hontecillas, Raquel
Bassaganya-Riera, Josep
Zand, Ramin
author_sort Abedi, Vida
collection PubMed
description Background. The imputation of missingness is a key step in Electronic Health Records (EHR) mining, as it can significantly affect the conclusions derived from the downstream analysis in translational medicine. The missingness of laboratory values in EHR is not at random, yet imputation techniques tend to disregard this key distinction. Consequently, the development of an adaptive imputation strategy designed specifically for EHR is an important step in improving the data imbalance and enhancing the predictive power of modeling tools for healthcare applications. Method. We analyzed the laboratory measures derived from Geisinger’s EHR on patients in three distinct cohorts—patients tested for Clostridioides difficile (Cdiff) infection, patients with a diagnosis of inflammatory bowel disease (IBD), and patients with a diagnosis of hip or knee osteoarthritis (OA). We extracted Logical Observation Identifiers Names and Codes (LOINC) from which we excluded those with 75% or more missingness. The comorbidities, primary or secondary diagnosis, as well as active problem lists, were also extracted. The adaptive imputation strategy was designed based on a hybrid approach. The comorbidity patterns of patients were transformed into latent patterns and then clustered. Imputation was performed on a cluster of patients for each cohort independently to show the generalizability of the method. The results were compared with imputation applied to the complete dataset without incorporating the information from comorbidity patterns. Results. We analyzed a total of 67,445 patients (11,230 IBD patients, 10,000 OA patients, and 46,215 patients tested for C. difficile infection). We extracted 495 LOINC and 11,230 diagnosis codes for the IBD cohort, 8160 diagnosis codes for the Cdiff cohort, and 2042 diagnosis codes for the OA cohort based on the primary/secondary diagnosis and active problem list in the EHR. Overall, the most improvement from this strategy was observed when the laboratory measures had a higher level of missingness. The best root mean square error (RMSE) difference for each dataset was recorded as −35.5 for the Cdiff, −8.3 for the IBD, and −11.3 for the OA dataset. Conclusions. An adaptive imputation strategy designed specifically for EHR that uses complementary information from the clinical profile of the patient can be used to improve the imputation of missing laboratory values, especially when laboratory codes with high levels of missingness are included in the analysis.
format Online
Article
Text
id pubmed-7795258
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-77952582021-01-10 Increasing the Density of Laboratory Measures for Machine Learning Applications Abedi, Vida Li, Jiang Shivakumar, Manu K. Avula, Venkatesh Chaudhary, Durgesh P. Shellenberger, Matthew J. Khara, Harshit S. Zhang, Yanfei Lee, Ming Ta Michael Wolk, Donna M. Yeasin, Mohammed Hontecillas, Raquel Bassaganya-Riera, Josep Zand, Ramin J Clin Med Article Background. The imputation of missingness is a key step in Electronic Health Records (EHR) mining, as it can significantly affect the conclusions derived from the downstream analysis in translational medicine. The missingness of laboratory values in EHR is not at random, yet imputation techniques tend to disregard this key distinction. Consequently, the development of an adaptive imputation strategy designed specifically for EHR is an important step in improving the data imbalance and enhancing the predictive power of modeling tools for healthcare applications. Method. We analyzed the laboratory measures derived from Geisinger’s EHR on patients in three distinct cohorts—patients tested for Clostridioides difficile (Cdiff) infection, patients with a diagnosis of inflammatory bowel disease (IBD), and patients with a diagnosis of hip or knee osteoarthritis (OA). We extracted Logical Observation Identifiers Names and Codes (LOINC) from which we excluded those with 75% or more missingness. The comorbidities, primary or secondary diagnosis, as well as active problem lists, were also extracted. The adaptive imputation strategy was designed based on a hybrid approach. The comorbidity patterns of patients were transformed into latent patterns and then clustered. Imputation was performed on a cluster of patients for each cohort independently to show the generalizability of the method. The results were compared with imputation applied to the complete dataset without incorporating the information from comorbidity patterns. Results. We analyzed a total of 67,445 patients (11,230 IBD patients, 10,000 OA patients, and 46,215 patients tested for C. difficile infection). We extracted 495 LOINC and 11,230 diagnosis codes for the IBD cohort, 8160 diagnosis codes for the Cdiff cohort, and 2042 diagnosis codes for the OA cohort based on the primary/secondary diagnosis and active problem list in the EHR. Overall, the most improvement from this strategy was observed when the laboratory measures had a higher level of missingness. The best root mean square error (RMSE) difference for each dataset was recorded as −35.5 for the Cdiff, −8.3 for the IBD, and −11.3 for the OA dataset. Conclusions. An adaptive imputation strategy designed specifically for EHR that uses complementary information from the clinical profile of the patient can be used to improve the imputation of missing laboratory values, especially when laboratory codes with high levels of missingness are included in the analysis. MDPI 2020-12-30 /pmc/articles/PMC7795258/ /pubmed/33396741 http://dx.doi.org/10.3390/jcm10010103 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Abedi, Vida
Li, Jiang
Shivakumar, Manu K.
Avula, Venkatesh
Chaudhary, Durgesh P.
Shellenberger, Matthew J.
Khara, Harshit S.
Zhang, Yanfei
Lee, Ming Ta Michael
Wolk, Donna M.
Yeasin, Mohammed
Hontecillas, Raquel
Bassaganya-Riera, Josep
Zand, Ramin
Increasing the Density of Laboratory Measures for Machine Learning Applications
title Increasing the Density of Laboratory Measures for Machine Learning Applications
title_full Increasing the Density of Laboratory Measures for Machine Learning Applications
title_fullStr Increasing the Density of Laboratory Measures for Machine Learning Applications
title_full_unstemmed Increasing the Density of Laboratory Measures for Machine Learning Applications
title_short Increasing the Density of Laboratory Measures for Machine Learning Applications
title_sort increasing the density of laboratory measures for machine learning applications
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7795258/
https://www.ncbi.nlm.nih.gov/pubmed/33396741
http://dx.doi.org/10.3390/jcm10010103
work_keys_str_mv AT abedivida increasingthedensityoflaboratorymeasuresformachinelearningapplications
AT lijiang increasingthedensityoflaboratorymeasuresformachinelearningapplications
AT shivakumarmanuk increasingthedensityoflaboratorymeasuresformachinelearningapplications
AT avulavenkatesh increasingthedensityoflaboratorymeasuresformachinelearningapplications
AT chaudharydurgeshp increasingthedensityoflaboratorymeasuresformachinelearningapplications
AT shellenbergermatthewj increasingthedensityoflaboratorymeasuresformachinelearningapplications
AT kharaharshits increasingthedensityoflaboratorymeasuresformachinelearningapplications
AT zhangyanfei increasingthedensityoflaboratorymeasuresformachinelearningapplications
AT leemingtamichael increasingthedensityoflaboratorymeasuresformachinelearningapplications
AT wolkdonnam increasingthedensityoflaboratorymeasuresformachinelearningapplications
AT yeasinmohammed increasingthedensityoflaboratorymeasuresformachinelearningapplications
AT hontecillasraquel increasingthedensityoflaboratorymeasuresformachinelearningapplications
AT bassaganyarierajosep increasingthedensityoflaboratorymeasuresformachinelearningapplications
AT zandramin increasingthedensityoflaboratorymeasuresformachinelearningapplications