Cargando…

Detecting the impact of subject characteristics on machine learning-based diagnostic applications

Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using t...

Descripción completa

Detalles Bibliográficos
Autores principales: Chaibub Neto, Elias, Pratap, Abhishek, Perumal, Thanneer M., Tummalacherla, Meghasyam, Snyder, Phil, Bot, Brian M., Trister, Andrew D., Friend, Stephen H., Mangravite, Lara, Omberg, Larsson
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6789029/
https://www.ncbi.nlm.nih.gov/pubmed/31633058
http://dx.doi.org/10.1038/s41746-019-0178-x
_version_ 1783458557964845056
author Chaibub Neto, Elias
Pratap, Abhishek
Perumal, Thanneer M.
Tummalacherla, Meghasyam
Snyder, Phil
Bot, Brian M.
Trister, Andrew D.
Friend, Stephen H.
Mangravite, Lara
Omberg, Larsson
author_facet Chaibub Neto, Elias
Pratap, Abhishek
Perumal, Thanneer M.
Tummalacherla, Meghasyam
Snyder, Phil
Bot, Brian M.
Trister, Andrew D.
Friend, Stephen H.
Mangravite, Lara
Omberg, Larsson
author_sort Chaibub Neto, Elias
collection PubMed
description Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using training and test sets generated from multiple repeated measures collected across a set of individuals. However, the inclusion of repeated measurements is not always appropriately taken into account in the analytical evaluations of predictive performance. The assignment of repeated measurements from each individual to both the training and the test sets (“record-wise” data split) is a common practice and can lead to massive underestimation of the prediction error due to the presence of “identity confounding.” In essence, these models learn to identify subjects, in addition to diagnostic signal. Here, we present a method that can be used to effectively calculate the amount of identity confounding learned by classifiers developed using a record-wise data split. By applying this method to several real datasets, we demonstrate that identity confounding is a serious issue in digital health studies and that record-wise data splits for machine learning- based applications need to be avoided.
format Online
Article
Text
id pubmed-6789029
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-67890292019-10-18 Detecting the impact of subject characteristics on machine learning-based diagnostic applications Chaibub Neto, Elias Pratap, Abhishek Perumal, Thanneer M. Tummalacherla, Meghasyam Snyder, Phil Bot, Brian M. Trister, Andrew D. Friend, Stephen H. Mangravite, Lara Omberg, Larsson NPJ Digit Med Article Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using training and test sets generated from multiple repeated measures collected across a set of individuals. However, the inclusion of repeated measurements is not always appropriately taken into account in the analytical evaluations of predictive performance. The assignment of repeated measurements from each individual to both the training and the test sets (“record-wise” data split) is a common practice and can lead to massive underestimation of the prediction error due to the presence of “identity confounding.” In essence, these models learn to identify subjects, in addition to diagnostic signal. Here, we present a method that can be used to effectively calculate the amount of identity confounding learned by classifiers developed using a record-wise data split. By applying this method to several real datasets, we demonstrate that identity confounding is a serious issue in digital health studies and that record-wise data splits for machine learning- based applications need to be avoided. Nature Publishing Group UK 2019-10-11 /pmc/articles/PMC6789029/ /pubmed/31633058 http://dx.doi.org/10.1038/s41746-019-0178-x Text en © The Author(s) 2019 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle Article
Chaibub Neto, Elias
Pratap, Abhishek
Perumal, Thanneer M.
Tummalacherla, Meghasyam
Snyder, Phil
Bot, Brian M.
Trister, Andrew D.
Friend, Stephen H.
Mangravite, Lara
Omberg, Larsson
Detecting the impact of subject characteristics on machine learning-based diagnostic applications
title Detecting the impact of subject characteristics on machine learning-based diagnostic applications
title_full Detecting the impact of subject characteristics on machine learning-based diagnostic applications
title_fullStr Detecting the impact of subject characteristics on machine learning-based diagnostic applications
title_full_unstemmed Detecting the impact of subject characteristics on machine learning-based diagnostic applications
title_short Detecting the impact of subject characteristics on machine learning-based diagnostic applications
title_sort detecting the impact of subject characteristics on machine learning-based diagnostic applications
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6789029/
https://www.ncbi.nlm.nih.gov/pubmed/31633058
http://dx.doi.org/10.1038/s41746-019-0178-x
work_keys_str_mv AT chaibubnetoelias detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications
AT pratapabhishek detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications
AT perumalthanneerm detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications
AT tummalacherlameghasyam detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications
AT snyderphil detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications
AT botbrianm detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications
AT tristerandrewd detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications
AT friendstephenh detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications
AT mangravitelara detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications
AT omberglarsson detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications