Cargando…

Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort

‘Big data’ in healthcare encompass measurements collated from multiple sources with various degrees of data quality. These data require quality control assessment to optimise quality for clinical management and for robust large-scale data analysis in healthcare research. Height and weight data repre...

Descripción completa

Detalles Bibliográficos
Autores principales: Phan, Hang T. T., Borca, Florina, Cable, David, Batchelor, James, Davies, Justin H., Ennis, Sarah
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7311482/
https://www.ncbi.nlm.nih.gov/pubmed/32576940
http://dx.doi.org/10.1038/s41598-020-66925-7
_version_ 1783549548292997120
author Phan, Hang T. T.
Borca, Florina
Cable, David
Batchelor, James
Davies, Justin H.
Ennis, Sarah
author_facet Phan, Hang T. T.
Borca, Florina
Cable, David
Batchelor, James
Davies, Justin H.
Ennis, Sarah
author_sort Phan, Hang T. T.
collection PubMed
description ‘Big data’ in healthcare encompass measurements collated from multiple sources with various degrees of data quality. These data require quality control assessment to optimise quality for clinical management and for robust large-scale data analysis in healthcare research. Height and weight data represent one of the most abundantly recorded health statistics. The shift to electronic recording of anthropometric measurements in electronic healthcare records, has rapidly inflated the number of measurements. WHO guidelines inform removal of population-based extreme outliers but an absence of tools limits cleaning of longitudinal anthropometric measurements. We developed and optimised a protocol for cleaning paediatric height and weight data that incorporates outlier detection using robust linear regression methodology using a manually curated set of 6,279 patients’ longitudinal measurements. The protocol was then applied to a cohort of 200,000 patient records collected from 60,000 paediatric patients attending a regional teaching hospital in South England. WHO guidelines detected biologically implausible data in <1% of records. Additional error rates of 3% and 0.2% for height and weight respectively were detected using the protocol. Inflated error rates for height measurements were largely due to small but physiologically implausible decreases in height. Lowest error rates were observed when data was measured and digitally recorded by staff routinely required to do so. The protocol successfully automates the parsing of implausible and poor quality height and weight data from a voluminous longitudinal dataset and standardises the quality assessment of data for clinical and research applications.
format Online
Article
Text
id pubmed-7311482
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-73114822020-06-25 Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort Phan, Hang T. T. Borca, Florina Cable, David Batchelor, James Davies, Justin H. Ennis, Sarah Sci Rep Article ‘Big data’ in healthcare encompass measurements collated from multiple sources with various degrees of data quality. These data require quality control assessment to optimise quality for clinical management and for robust large-scale data analysis in healthcare research. Height and weight data represent one of the most abundantly recorded health statistics. The shift to electronic recording of anthropometric measurements in electronic healthcare records, has rapidly inflated the number of measurements. WHO guidelines inform removal of population-based extreme outliers but an absence of tools limits cleaning of longitudinal anthropometric measurements. We developed and optimised a protocol for cleaning paediatric height and weight data that incorporates outlier detection using robust linear regression methodology using a manually curated set of 6,279 patients’ longitudinal measurements. The protocol was then applied to a cohort of 200,000 patient records collected from 60,000 paediatric patients attending a regional teaching hospital in South England. WHO guidelines detected biologically implausible data in <1% of records. Additional error rates of 3% and 0.2% for height and weight respectively were detected using the protocol. Inflated error rates for height measurements were largely due to small but physiologically implausible decreases in height. Lowest error rates were observed when data was measured and digitally recorded by staff routinely required to do so. The protocol successfully automates the parsing of implausible and poor quality height and weight data from a voluminous longitudinal dataset and standardises the quality assessment of data for clinical and research applications. Nature Publishing Group UK 2020-06-23 /pmc/articles/PMC7311482/ /pubmed/32576940 http://dx.doi.org/10.1038/s41598-020-66925-7 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle Article
Phan, Hang T. T.
Borca, Florina
Cable, David
Batchelor, James
Davies, Justin H.
Ennis, Sarah
Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort
title Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort
title_full Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort
title_fullStr Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort
title_full_unstemmed Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort
title_short Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort
title_sort automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7311482/
https://www.ncbi.nlm.nih.gov/pubmed/32576940
http://dx.doi.org/10.1038/s41598-020-66925-7
work_keys_str_mv AT phanhangtt automateddatacleaningofpaediatricanthropometricdatafromlongitudinalelectronichealthrecordsprotocolandapplicationtoalargepatientcohort
AT borcaflorina automateddatacleaningofpaediatricanthropometricdatafromlongitudinalelectronichealthrecordsprotocolandapplicationtoalargepatientcohort
AT cabledavid automateddatacleaningofpaediatricanthropometricdatafromlongitudinalelectronichealthrecordsprotocolandapplicationtoalargepatientcohort
AT batchelorjames automateddatacleaningofpaediatricanthropometricdatafromlongitudinalelectronichealthrecordsprotocolandapplicationtoalargepatientcohort
AT daviesjustinh automateddatacleaningofpaediatricanthropometricdatafromlongitudinalelectronichealthrecordsprotocolandapplicationtoalargepatientcohort
AT ennissarah automateddatacleaningofpaediatricanthropometricdatafromlongitudinalelectronichealthrecordsprotocolandapplicationtoalargepatientcohort