Cargando…

Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach

BACKGROUND: Clinical registers constitute an invaluable resource in the medical data-driven decision making context. Accurate machine learning and data mining approaches on these data can lead to faster diagnosis, definition of tailored interventions, and improved outcome prediction. A typical issue...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tavazzi, Erica, Daberdaku, Sebastian, Vasta, Rosario, Calvo, Andrea, Chiò, Adriano, Di Camillo, Barbara
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7439551/ https://www.ncbi.nlm.nih.gov/pubmed/32819346 http://dx.doi.org/10.1186/s12911-020-01166-2

_version_	1783573005767540736
author	Tavazzi, Erica Daberdaku, Sebastian Vasta, Rosario Calvo, Andrea Chiò, Adriano Di Camillo, Barbara
author_facet	Tavazzi, Erica Daberdaku, Sebastian Vasta, Rosario Calvo, Andrea Chiò, Adriano Di Camillo, Barbara
author_sort	Tavazzi, Erica
collection	PubMed
description	BACKGROUND: Clinical registers constitute an invaluable resource in the medical data-driven decision making context. Accurate machine learning and data mining approaches on these data can lead to faster diagnosis, definition of tailored interventions, and improved outcome prediction. A typical issue when implementing such approaches is the almost unavoidable presence of missing values in the collected data. In this work, we propose an imputation algorithm based on a mutual information-weighted k-nearest neighbours approach, able to handle the simultaneous presence of missing information in different types of variables. We developed and validated the method on a clinical register, constituted by the information collected over subsequent screening visits of a cohort of patients affected by amyotrophic lateral sclerosis. METHODS: For each subject with missing data to be imputed, we create a feature vector constituted by the information collected over his/her first three months of visits. This vector is used as sample in a k-nearest neighbours procedure, in order to select, among the other patients, the ones with the most similar temporal evolution of the disease over time. An ad hoc similarity metric was implemented for the sample comparison, capable of handling the different nature of the data, the presence of multiple missing values and include the cross-information among features captured by the mutual information statistic. RESULTS: We validated the proposed imputation method on an independent test set, comparing its performance with those of three state-of-the-art competitors, resulting in better performance. We further assessed the validity of our algorithm by comparing the performance of a survival classifier built on the data imputed with our method versus the one built on the data imputed with the best-performing competitor. CONCLUSIONS: Imputation of missing data is a crucial –and often mandatory– step when working with real-world datasets. The algorithm proposed in this work could effectively impute an amyotrophic lateral sclerosis clinical dataset, by handling the temporal and the mixed-type nature of the data and by exploiting the cross-information among features. We also showed how the imputation quality can affect a machine learning task.
format	Online Article Text
id	pubmed-7439551
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-74395512020-08-24 Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach Tavazzi, Erica Daberdaku, Sebastian Vasta, Rosario Calvo, Andrea Chiò, Adriano Di Camillo, Barbara BMC Med Inform Decis Mak Research BACKGROUND: Clinical registers constitute an invaluable resource in the medical data-driven decision making context. Accurate machine learning and data mining approaches on these data can lead to faster diagnosis, definition of tailored interventions, and improved outcome prediction. A typical issue when implementing such approaches is the almost unavoidable presence of missing values in the collected data. In this work, we propose an imputation algorithm based on a mutual information-weighted k-nearest neighbours approach, able to handle the simultaneous presence of missing information in different types of variables. We developed and validated the method on a clinical register, constituted by the information collected over subsequent screening visits of a cohort of patients affected by amyotrophic lateral sclerosis. METHODS: For each subject with missing data to be imputed, we create a feature vector constituted by the information collected over his/her first three months of visits. This vector is used as sample in a k-nearest neighbours procedure, in order to select, among the other patients, the ones with the most similar temporal evolution of the disease over time. An ad hoc similarity metric was implemented for the sample comparison, capable of handling the different nature of the data, the presence of multiple missing values and include the cross-information among features captured by the mutual information statistic. RESULTS: We validated the proposed imputation method on an independent test set, comparing its performance with those of three state-of-the-art competitors, resulting in better performance. We further assessed the validity of our algorithm by comparing the performance of a survival classifier built on the data imputed with our method versus the one built on the data imputed with the best-performing competitor. CONCLUSIONS: Imputation of missing data is a crucial –and often mandatory– step when working with real-world datasets. The algorithm proposed in this work could effectively impute an amyotrophic lateral sclerosis clinical dataset, by handling the temporal and the mixed-type nature of the data and by exploiting the cross-information among features. We also showed how the imputation quality can affect a machine learning task. BioMed Central 2020-08-20 /pmc/articles/PMC7439551/ /pubmed/32819346 http://dx.doi.org/10.1186/s12911-020-01166-2 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Tavazzi, Erica Daberdaku, Sebastian Vasta, Rosario Calvo, Andrea Chiò, Adriano Di Camillo, Barbara Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach
title	Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach
title_full	Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach
title_fullStr	Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach
title_full_unstemmed	Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach
title_short	Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach
title_sort	exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7439551/ https://www.ncbi.nlm.nih.gov/pubmed/32819346 http://dx.doi.org/10.1186/s12911-020-01166-2
work_keys_str_mv	AT tavazzierica exploitingmutualinformationfortheimputationofstaticanddynamicmixedtypeclinicaldatawithanadaptiveknearestneighboursapproach AT daberdakusebastian exploitingmutualinformationfortheimputationofstaticanddynamicmixedtypeclinicaldatawithanadaptiveknearestneighboursapproach AT vastarosario exploitingmutualinformationfortheimputationofstaticanddynamicmixedtypeclinicaldatawithanadaptiveknearestneighboursapproach AT calvoandrea exploitingmutualinformationfortheimputationofstaticanddynamicmixedtypeclinicaldatawithanadaptiveknearestneighboursapproach AT chioadriano exploitingmutualinformationfortheimputationofstaticanddynamicmixedtypeclinicaldatawithanadaptiveknearestneighboursapproach AT dicamillobarbara exploitingmutualinformationfortheimputationofstaticanddynamicmixedtypeclinicaldatawithanadaptiveknearestneighboursapproach

Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach

Ejemplares similares