Cargando…
Detecting the impact of subject characteristics on machine learning-based diagnostic applications
Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using t...
Autores principales: | , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6789029/ https://www.ncbi.nlm.nih.gov/pubmed/31633058 http://dx.doi.org/10.1038/s41746-019-0178-x |
_version_ | 1783458557964845056 |
---|---|
author | Chaibub Neto, Elias Pratap, Abhishek Perumal, Thanneer M. Tummalacherla, Meghasyam Snyder, Phil Bot, Brian M. Trister, Andrew D. Friend, Stephen H. Mangravite, Lara Omberg, Larsson |
author_facet | Chaibub Neto, Elias Pratap, Abhishek Perumal, Thanneer M. Tummalacherla, Meghasyam Snyder, Phil Bot, Brian M. Trister, Andrew D. Friend, Stephen H. Mangravite, Lara Omberg, Larsson |
author_sort | Chaibub Neto, Elias |
collection | PubMed |
description | Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using training and test sets generated from multiple repeated measures collected across a set of individuals. However, the inclusion of repeated measurements is not always appropriately taken into account in the analytical evaluations of predictive performance. The assignment of repeated measurements from each individual to both the training and the test sets (“record-wise” data split) is a common practice and can lead to massive underestimation of the prediction error due to the presence of “identity confounding.” In essence, these models learn to identify subjects, in addition to diagnostic signal. Here, we present a method that can be used to effectively calculate the amount of identity confounding learned by classifiers developed using a record-wise data split. By applying this method to several real datasets, we demonstrate that identity confounding is a serious issue in digital health studies and that record-wise data splits for machine learning- based applications need to be avoided. |
format | Online Article Text |
id | pubmed-6789029 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-67890292019-10-18 Detecting the impact of subject characteristics on machine learning-based diagnostic applications Chaibub Neto, Elias Pratap, Abhishek Perumal, Thanneer M. Tummalacherla, Meghasyam Snyder, Phil Bot, Brian M. Trister, Andrew D. Friend, Stephen H. Mangravite, Lara Omberg, Larsson NPJ Digit Med Article Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using training and test sets generated from multiple repeated measures collected across a set of individuals. However, the inclusion of repeated measurements is not always appropriately taken into account in the analytical evaluations of predictive performance. The assignment of repeated measurements from each individual to both the training and the test sets (“record-wise” data split) is a common practice and can lead to massive underestimation of the prediction error due to the presence of “identity confounding.” In essence, these models learn to identify subjects, in addition to diagnostic signal. Here, we present a method that can be used to effectively calculate the amount of identity confounding learned by classifiers developed using a record-wise data split. By applying this method to several real datasets, we demonstrate that identity confounding is a serious issue in digital health studies and that record-wise data splits for machine learning- based applications need to be avoided. Nature Publishing Group UK 2019-10-11 /pmc/articles/PMC6789029/ /pubmed/31633058 http://dx.doi.org/10.1038/s41746-019-0178-x Text en © The Author(s) 2019 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. |
spellingShingle | Article Chaibub Neto, Elias Pratap, Abhishek Perumal, Thanneer M. Tummalacherla, Meghasyam Snyder, Phil Bot, Brian M. Trister, Andrew D. Friend, Stephen H. Mangravite, Lara Omberg, Larsson Detecting the impact of subject characteristics on machine learning-based diagnostic applications |
title | Detecting the impact of subject characteristics on machine learning-based diagnostic applications |
title_full | Detecting the impact of subject characteristics on machine learning-based diagnostic applications |
title_fullStr | Detecting the impact of subject characteristics on machine learning-based diagnostic applications |
title_full_unstemmed | Detecting the impact of subject characteristics on machine learning-based diagnostic applications |
title_short | Detecting the impact of subject characteristics on machine learning-based diagnostic applications |
title_sort | detecting the impact of subject characteristics on machine learning-based diagnostic applications |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6789029/ https://www.ncbi.nlm.nih.gov/pubmed/31633058 http://dx.doi.org/10.1038/s41746-019-0178-x |
work_keys_str_mv | AT chaibubnetoelias detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications AT pratapabhishek detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications AT perumalthanneerm detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications AT tummalacherlameghasyam detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications AT snyderphil detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications AT botbrianm detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications AT tristerandrewd detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications AT friendstephenh detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications AT mangravitelara detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications AT omberglarsson detectingtheimpactofsubjectcharacteristicsonmachinelearningbaseddiagnosticapplications |