Cargando…

Stratified split sampling of electronic health records

Although superficially similar to data from clinical research, data extracted from electronic health records may require fundamentally different approaches for model building and analysis. Because electronic health record data is designed for clinical, rather than scientific use, researchers must fi...

Descripción completa

Detalles Bibliográficos
Autores principales: Huo, Tianyao, Glueck, Deborah H., Shenkman, Elizabeth A., Muller, Keith E.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10210417/
https://www.ncbi.nlm.nih.gov/pubmed/37231360
http://dx.doi.org/10.1186/s12874-023-01938-0
_version_ 1785047062253928448
author Huo, Tianyao
Glueck, Deborah H.
Shenkman, Elizabeth A.
Muller, Keith E.
author_facet Huo, Tianyao
Glueck, Deborah H.
Shenkman, Elizabeth A.
Muller, Keith E.
author_sort Huo, Tianyao
collection PubMed
description Although superficially similar to data from clinical research, data extracted from electronic health records may require fundamentally different approaches for model building and analysis. Because electronic health record data is designed for clinical, rather than scientific use, researchers must first provide clear definitions of outcome and predictor variables. Yet an iterative process of defining outcomes and predictors, assessing association, and then repeating the process may increase Type I error rates, and thus decrease the chance of replicability, defined by the National Academy of Sciences as the chance of “obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.”[1] In addition, failure to account for subgroups may mask heterogeneous associations between predictor and outcome by subgroups, and decrease the generalizability of the findings. To increase chances of replicability and generalizability, we recommend using a stratified split sample approach for studies using electronic health records. A split sample approach divides the data randomly into an exploratory set for iterative variable definition, iterative analyses of association, and consideration of subgroups. The confirmatory set is used only to replicate results found in the first set. The addition of the word ‘stratified’ indicates that rare subgroups are oversampled randomly by including them in the exploratory sample at higher rates than appear in the population. The stratified sampling provides a sufficient sample size for assessing heterogeneity of association by testing for effect modification by group membership. An electronic health record study of the associations between socio-demographic factors and uptake of hepatic cancer screening, and potential heterogeneity of association in subgroups defined by gender, self-identified race and ethnicity, census-tract level poverty and insurance type illustrates the recommended approach.
format Online
Article
Text
id pubmed-10210417
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-102104172023-05-26 Stratified split sampling of electronic health records Huo, Tianyao Glueck, Deborah H. Shenkman, Elizabeth A. Muller, Keith E. BMC Med Res Methodol Research Although superficially similar to data from clinical research, data extracted from electronic health records may require fundamentally different approaches for model building and analysis. Because electronic health record data is designed for clinical, rather than scientific use, researchers must first provide clear definitions of outcome and predictor variables. Yet an iterative process of defining outcomes and predictors, assessing association, and then repeating the process may increase Type I error rates, and thus decrease the chance of replicability, defined by the National Academy of Sciences as the chance of “obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data.”[1] In addition, failure to account for subgroups may mask heterogeneous associations between predictor and outcome by subgroups, and decrease the generalizability of the findings. To increase chances of replicability and generalizability, we recommend using a stratified split sample approach for studies using electronic health records. A split sample approach divides the data randomly into an exploratory set for iterative variable definition, iterative analyses of association, and consideration of subgroups. The confirmatory set is used only to replicate results found in the first set. The addition of the word ‘stratified’ indicates that rare subgroups are oversampled randomly by including them in the exploratory sample at higher rates than appear in the population. The stratified sampling provides a sufficient sample size for assessing heterogeneity of association by testing for effect modification by group membership. An electronic health record study of the associations between socio-demographic factors and uptake of hepatic cancer screening, and potential heterogeneity of association in subgroups defined by gender, self-identified race and ethnicity, census-tract level poverty and insurance type illustrates the recommended approach. BioMed Central 2023-05-25 /pmc/articles/PMC10210417/ /pubmed/37231360 http://dx.doi.org/10.1186/s12874-023-01938-0 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Huo, Tianyao
Glueck, Deborah H.
Shenkman, Elizabeth A.
Muller, Keith E.
Stratified split sampling of electronic health records
title Stratified split sampling of electronic health records
title_full Stratified split sampling of electronic health records
title_fullStr Stratified split sampling of electronic health records
title_full_unstemmed Stratified split sampling of electronic health records
title_short Stratified split sampling of electronic health records
title_sort stratified split sampling of electronic health records
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10210417/
https://www.ncbi.nlm.nih.gov/pubmed/37231360
http://dx.doi.org/10.1186/s12874-023-01938-0
work_keys_str_mv AT huotianyao stratifiedsplitsamplingofelectronichealthrecords
AT glueckdeborahh stratifiedsplitsamplingofelectronichealthrecords
AT shenkmanelizabetha stratifiedsplitsamplingofelectronichealthrecords
AT mullerkeithe stratifiedsplitsamplingofelectronichealthrecords