Cargando…

Health record hiccups—5,526 real-world time series with change points labelled by crowdsourced visual inspection

BACKGROUND: Large routinely collected data such as electronic health records (EHRs) are increasingly used in research, but the statistical methods and processes used to check such data for temporal data quality issues have not moved beyond manual, ad hoc production and visual inspection of graphs. W...

Descripción completa

Detalles Bibliográficos
Autores principales: Quan, T Phuong, Lacey, Ben, Peto, Tim E A, Walker, A Sarah
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10375518/
https://www.ncbi.nlm.nih.gov/pubmed/37503960
http://dx.doi.org/10.1093/gigascience/giad060
_version_ 1785079050966925312
author Quan, T Phuong
Lacey, Ben
Peto, Tim E A
Walker, A Sarah
author_facet Quan, T Phuong
Lacey, Ben
Peto, Tim E A
Walker, A Sarah
author_sort Quan, T Phuong
collection PubMed
description BACKGROUND: Large routinely collected data such as electronic health records (EHRs) are increasingly used in research, but the statistical methods and processes used to check such data for temporal data quality issues have not moved beyond manual, ad hoc production and visual inspection of graphs. With the prospect of EHR data being used for disease surveillance via automated pipelines and public-facing dashboards, automation of data quality checks will become increasingly valuable. FINDINGS: We generated 5,526 time series from 8 different EHR datasets and engaged >2,000 citizen-science volunteers to label the locations of all suspicious-looking change points in the resulting graphs. Consensus labels were produced using density-based clustering with noise, with validation conducted using 956 images containing labels produced by an experienced data scientist. Parameter tuning was done against 670 images and performance calculated against 286 images, resulting in a final sensitivity of 80.4% (95% CI, 77.1%–83.3%), specificity of 99.8% (99.7%–99.8%), positive predictive value of 84.5% (81.4%–87.2%), and negative predictive value of 99.7% (99.6%–99.7%). In total, 12,745 change points were found within 3,687 of the time series. CONCLUSIONS: This large collection of labelled EHR time series can be used to validate automated methods for change point detection in real-world settings, encouraging the development of methods that can successfully be applied in practice. It is particularly valuable since change point detection methods are typically validated using synthetic data, so their performance in real-world settings cannot be assumed to be comparable. While the dataset focusses on EHRs and data quality, it should also be applicable in other fields.
format Online
Article
Text
id pubmed-10375518
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-103755182023-07-29 Health record hiccups—5,526 real-world time series with change points labelled by crowdsourced visual inspection Quan, T Phuong Lacey, Ben Peto, Tim E A Walker, A Sarah Gigascience Data Note BACKGROUND: Large routinely collected data such as electronic health records (EHRs) are increasingly used in research, but the statistical methods and processes used to check such data for temporal data quality issues have not moved beyond manual, ad hoc production and visual inspection of graphs. With the prospect of EHR data being used for disease surveillance via automated pipelines and public-facing dashboards, automation of data quality checks will become increasingly valuable. FINDINGS: We generated 5,526 time series from 8 different EHR datasets and engaged >2,000 citizen-science volunteers to label the locations of all suspicious-looking change points in the resulting graphs. Consensus labels were produced using density-based clustering with noise, with validation conducted using 956 images containing labels produced by an experienced data scientist. Parameter tuning was done against 670 images and performance calculated against 286 images, resulting in a final sensitivity of 80.4% (95% CI, 77.1%–83.3%), specificity of 99.8% (99.7%–99.8%), positive predictive value of 84.5% (81.4%–87.2%), and negative predictive value of 99.7% (99.6%–99.7%). In total, 12,745 change points were found within 3,687 of the time series. CONCLUSIONS: This large collection of labelled EHR time series can be used to validate automated methods for change point detection in real-world settings, encouraging the development of methods that can successfully be applied in practice. It is particularly valuable since change point detection methods are typically validated using synthetic data, so their performance in real-world settings cannot be assumed to be comparable. While the dataset focusses on EHRs and data quality, it should also be applicable in other fields. Oxford University Press 2023-07-28 /pmc/articles/PMC10375518/ /pubmed/37503960 http://dx.doi.org/10.1093/gigascience/giad060 Text en © The Author(s) 2023. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Data Note
Quan, T Phuong
Lacey, Ben
Peto, Tim E A
Walker, A Sarah
Health record hiccups—5,526 real-world time series with change points labelled by crowdsourced visual inspection
title Health record hiccups—5,526 real-world time series with change points labelled by crowdsourced visual inspection
title_full Health record hiccups—5,526 real-world time series with change points labelled by crowdsourced visual inspection
title_fullStr Health record hiccups—5,526 real-world time series with change points labelled by crowdsourced visual inspection
title_full_unstemmed Health record hiccups—5,526 real-world time series with change points labelled by crowdsourced visual inspection
title_short Health record hiccups—5,526 real-world time series with change points labelled by crowdsourced visual inspection
title_sort health record hiccups—5,526 real-world time series with change points labelled by crowdsourced visual inspection
topic Data Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10375518/
https://www.ncbi.nlm.nih.gov/pubmed/37503960
http://dx.doi.org/10.1093/gigascience/giad060
work_keys_str_mv AT quantphuong healthrecordhiccups5526realworldtimeserieswithchangepointslabelledbycrowdsourcedvisualinspection
AT laceyben healthrecordhiccups5526realworldtimeserieswithchangepointslabelledbycrowdsourcedvisualinspection
AT petotimea healthrecordhiccups5526realworldtimeserieswithchangepointslabelledbycrowdsourcedvisualinspection
AT walkerasarah healthrecordhiccups5526realworldtimeserieswithchangepointslabelledbycrowdsourcedvisualinspection