Cargando…

A strategy for validation of variables derived from large-scale electronic health record data

PURPOSE: Standardized approaches for rigorous validation of phenotyping from large-scale electronic health record (EHR) data have not been widely reported. We proposed a methodologically rigorous and efficient approach to guide such validation, including strategies for sampling cases and controls, d...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liu, Lin, Bustamante, Ranier, Earles, Ashley, Demb, Joshua, Messer, Karen, Gupta, Samir
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2021
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9615095/ https://www.ncbi.nlm.nih.gov/pubmed/34329789 http://dx.doi.org/10.1016/j.jbi.2021.103879

_version_	1784820343763894272
author	Liu, Lin Bustamante, Ranier Earles, Ashley Demb, Joshua Messer, Karen Gupta, Samir
author_facet	Liu, Lin Bustamante, Ranier Earles, Ashley Demb, Joshua Messer, Karen Gupta, Samir
author_sort	Liu, Lin
collection	PubMed
description	PURPOSE: Standardized approaches for rigorous validation of phenotyping from large-scale electronic health record (EHR) data have not been widely reported. We proposed a methodologically rigorous and efficient approach to guide such validation, including strategies for sampling cases and controls, determining sample sizes, estimating algorithm performance, and terminating the validation process, hereafter referred to as the San Diego Approach to Variable Validation (SDAVV). METHODS: We propose sample size formulae which should be used prior to chart review, based on pre-specified critical lower bounds for positive predictive value (PPV) and negative predictive value (NPV). We also propose a stepwise strategy for iterative algorithm development/validation cycles, updating sample sizes for data abstraction until both PPV and NPV achieve target performance. RESULTS: We applied the SDAVV to a Department of Veterans Affairs study in which we created two phenotyping algorithms, one for distinguishing normal colonoscopy cases from abnormal colonoscopy controls and one for identifying aspirin exposure. Estimated PPV and NPV both reached 0.970 with a 95% confidence lower bound of 0.915, estimated sensitivity was 0.963 and specificity was 0.975 for identifying normal colonoscopy cases. The phenotyping algorithm for identifying aspirin exposure reached a PPV of 0.990 (a 95% lower bound of 0.950), an NPV of 0.980 (a 95% lower bound of 0.930), and sensitivity and specificity were 0.960 and 1.000. CONCLUSIONS: A structured approach for prospectively developing and validating phenotyping algorithms from large-scale EHR data can be successfully implemented, and should be considered to improve the quality of “big data” research.
format	Online Article Text
id	pubmed-9615095
institution	National Center for Biotechnology Information
language	English
publishDate	2021
record_format	MEDLINE/PubMed
spelling	pubmed-96150952022-10-28 A strategy for validation of variables derived from large-scale electronic health record data Liu, Lin Bustamante, Ranier Earles, Ashley Demb, Joshua Messer, Karen Gupta, Samir J Biomed Inform Article PURPOSE: Standardized approaches for rigorous validation of phenotyping from large-scale electronic health record (EHR) data have not been widely reported. We proposed a methodologically rigorous and efficient approach to guide such validation, including strategies for sampling cases and controls, determining sample sizes, estimating algorithm performance, and terminating the validation process, hereafter referred to as the San Diego Approach to Variable Validation (SDAVV). METHODS: We propose sample size formulae which should be used prior to chart review, based on pre-specified critical lower bounds for positive predictive value (PPV) and negative predictive value (NPV). We also propose a stepwise strategy for iterative algorithm development/validation cycles, updating sample sizes for data abstraction until both PPV and NPV achieve target performance. RESULTS: We applied the SDAVV to a Department of Veterans Affairs study in which we created two phenotyping algorithms, one for distinguishing normal colonoscopy cases from abnormal colonoscopy controls and one for identifying aspirin exposure. Estimated PPV and NPV both reached 0.970 with a 95% confidence lower bound of 0.915, estimated sensitivity was 0.963 and specificity was 0.975 for identifying normal colonoscopy cases. The phenotyping algorithm for identifying aspirin exposure reached a PPV of 0.990 (a 95% lower bound of 0.950), an NPV of 0.980 (a 95% lower bound of 0.930), and sensitivity and specificity were 0.960 and 1.000. CONCLUSIONS: A structured approach for prospectively developing and validating phenotyping algorithms from large-scale EHR data can be successfully implemented, and should be considered to improve the quality of “big data” research. 2021-09 2021-07-27 /pmc/articles/PMC9615095/ /pubmed/34329789 http://dx.doi.org/10.1016/j.jbi.2021.103879 Text en https://creativecommons.org/licenses/by/4.0/This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ).
spellingShingle	Article Liu, Lin Bustamante, Ranier Earles, Ashley Demb, Joshua Messer, Karen Gupta, Samir A strategy for validation of variables derived from large-scale electronic health record data
title	A strategy for validation of variables derived from large-scale electronic health record data
title_full	A strategy for validation of variables derived from large-scale electronic health record data
title_fullStr	A strategy for validation of variables derived from large-scale electronic health record data
title_full_unstemmed	A strategy for validation of variables derived from large-scale electronic health record data
title_short	A strategy for validation of variables derived from large-scale electronic health record data
title_sort	strategy for validation of variables derived from large-scale electronic health record data
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9615095/ https://www.ncbi.nlm.nih.gov/pubmed/34329789 http://dx.doi.org/10.1016/j.jbi.2021.103879
work_keys_str_mv	AT liulin astrategyforvalidationofvariablesderivedfromlargescaleelectronichealthrecorddata AT bustamanteranier astrategyforvalidationofvariablesderivedfromlargescaleelectronichealthrecorddata AT earlesashley astrategyforvalidationofvariablesderivedfromlargescaleelectronichealthrecorddata AT dembjoshua astrategyforvalidationofvariablesderivedfromlargescaleelectronichealthrecorddata AT messerkaren astrategyforvalidationofvariablesderivedfromlargescaleelectronichealthrecorddata AT guptasamir astrategyforvalidationofvariablesderivedfromlargescaleelectronichealthrecorddata AT liulin strategyforvalidationofvariablesderivedfromlargescaleelectronichealthrecorddata AT bustamanteranier strategyforvalidationofvariablesderivedfromlargescaleelectronichealthrecorddata AT earlesashley strategyforvalidationofvariablesderivedfromlargescaleelectronichealthrecorddata AT dembjoshua strategyforvalidationofvariablesderivedfromlargescaleelectronichealthrecorddata AT messerkaren strategyforvalidationofvariablesderivedfromlargescaleelectronichealthrecorddata AT guptasamir strategyforvalidationofvariablesderivedfromlargescaleelectronichealthrecorddata

A strategy for validation of variables derived from large-scale electronic health record data

Ejemplares similares