Cargando…

Investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big US observational healthcare data

OBJECTIVE: The internal validation of prediction models aims to quantify the generalisability of a model. We aim to determine the impact, if any, that the choice of development and internal validation design has on the internal performance bias and model generalisability in big data (n~500 000). DES...

Descripción completa

Detalles Bibliográficos
Autores principales:	Reps, Jenna M, Ryan, Patrick, Rijnbeek, P R
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BMJ Publishing Group 2021
Materias:	Health Informatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8710861/ https://www.ncbi.nlm.nih.gov/pubmed/34952871 http://dx.doi.org/10.1136/bmjopen-2021-050146

_version_	1784623255587389440
author	Reps, Jenna M Ryan, Patrick Rijnbeek, P R
author_facet	Reps, Jenna M Ryan, Patrick Rijnbeek, P R
author_sort	Reps, Jenna M
collection	PubMed
description	OBJECTIVE: The internal validation of prediction models aims to quantify the generalisability of a model. We aim to determine the impact, if any, that the choice of development and internal validation design has on the internal performance bias and model generalisability in big data (n~500 000). DESIGN: Retrospective cohort. SETTING: Primary and secondary care; three US claims databases. PARTICIPANTS: 1 200 769 patients pharmaceutically treated for their first occurrence of depression. METHODS: We investigated the impact of the development/validation design across 21 real-world prediction questions. Model discrimination and calibration were assessed. We trained LASSO logistic regression models using US claims data and internally validated the models using eight different designs: ‘no test/validation set’, ‘test/validation set’ and cross validation with 3-fold, 5-fold or 10-fold with and without a test set. We then externally validated each model in two new US claims databases. We estimated the internal validation bias per design by empirically comparing the differences between the estimated internal performance and external performance. RESULTS: The differences between the models’ internal estimated performances and external performances were largest for the ‘no test/validation set’ design. This indicates even with large data the ‘no test/validation set’ design causes models to overfit. The seven alternative designs included some validation process to select the hyperparameters and a fair testing process to estimate internal performance. These designs had similar internal performance estimates and performed similarly when externally validated in the two external databases. CONCLUSIONS: Even with big data, it is important to use some validation process to select the optimal hyperparameters and fairly assess internal validation using a test set or cross-validation.
format	Online Article Text
id	pubmed-8710861
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BMJ Publishing Group
record_format	MEDLINE/PubMed
spelling	pubmed-87108612022-01-10 Investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big US observational healthcare data Reps, Jenna M Ryan, Patrick Rijnbeek, P R BMJ Open Health Informatics OBJECTIVE: The internal validation of prediction models aims to quantify the generalisability of a model. We aim to determine the impact, if any, that the choice of development and internal validation design has on the internal performance bias and model generalisability in big data (n~500 000). DESIGN: Retrospective cohort. SETTING: Primary and secondary care; three US claims databases. PARTICIPANTS: 1 200 769 patients pharmaceutically treated for their first occurrence of depression. METHODS: We investigated the impact of the development/validation design across 21 real-world prediction questions. Model discrimination and calibration were assessed. We trained LASSO logistic regression models using US claims data and internally validated the models using eight different designs: ‘no test/validation set’, ‘test/validation set’ and cross validation with 3-fold, 5-fold or 10-fold with and without a test set. We then externally validated each model in two new US claims databases. We estimated the internal validation bias per design by empirically comparing the differences between the estimated internal performance and external performance. RESULTS: The differences between the models’ internal estimated performances and external performances were largest for the ‘no test/validation set’ design. This indicates even with large data the ‘no test/validation set’ design causes models to overfit. The seven alternative designs included some validation process to select the hyperparameters and a fair testing process to estimate internal performance. These designs had similar internal performance estimates and performed similarly when externally validated in the two external databases. CONCLUSIONS: Even with big data, it is important to use some validation process to select the optimal hyperparameters and fairly assess internal validation using a test set or cross-validation. BMJ Publishing Group 2021-12-24 /pmc/articles/PMC8710861/ /pubmed/34952871 http://dx.doi.org/10.1136/bmjopen-2021-050146 Text en © Author(s) (or their employer(s)) 2021. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ. https://creativecommons.org/licenses/by-nc/4.0/This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) .
spellingShingle	Health Informatics Reps, Jenna M Ryan, Patrick Rijnbeek, P R Investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big US observational healthcare data
title	Investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big US observational healthcare data
title_full	Investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big US observational healthcare data
title_fullStr	Investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big US observational healthcare data
title_full_unstemmed	Investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big US observational healthcare data
title_short	Investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big US observational healthcare data
title_sort	investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big us observational healthcare data
topic	Health Informatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8710861/ https://www.ncbi.nlm.nih.gov/pubmed/34952871 http://dx.doi.org/10.1136/bmjopen-2021-050146
work_keys_str_mv	AT repsjennam investigatingtheimpactofdevelopmentandinternalvalidationdesignwhentrainingprognosticmodelsusingaretrospectivecohortinbigusobservationalhealthcaredata AT ryanpatrick investigatingtheimpactofdevelopmentandinternalvalidationdesignwhentrainingprognosticmodelsusingaretrospectivecohortinbigusobservationalhealthcaredata AT rijnbeekpr investigatingtheimpactofdevelopmentandinternalvalidationdesignwhentrainingprognosticmodelsusingaretrospectivecohortinbigusobservationalhealthcaredata

Investigating the impact of development and internal validation design when training prognostic models using a retrospective cohort in big US observational healthcare data

Ejemplares similares