Cargando…
Using set visualisation to find and explain patterns of missing values: a case study with NHS hospital episode statistics data
OBJECTIVES: Missing data is the most common data quality issue in electronic health records (EHRs). Missing data checks implemented in common analytical software are typically limited to counting the number of missing values in individual fields, but researchers and organisations also need to unders...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BMJ Publishing Group
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9680176/ https://www.ncbi.nlm.nih.gov/pubmed/36410820 http://dx.doi.org/10.1136/bmjopen-2022-064887 |
_version_ | 1784834353846550528 |
---|---|
author | Ruddle, Roy A Adnan, Muhammad Hall, Marlous |
author_facet | Ruddle, Roy A Adnan, Muhammad Hall, Marlous |
author_sort | Ruddle, Roy A |
collection | PubMed |
description | OBJECTIVES: Missing data is the most common data quality issue in electronic health records (EHRs). Missing data checks implemented in common analytical software are typically limited to counting the number of missing values in individual fields, but researchers and organisations also need to understand multifield missing data patterns to better inform advanced missing data strategies for which counts or numerical summaries are poorly suited. This study shows how set-based visualisation enables multifield missing data patterns to be discovered and investigated. DESIGN: Development and evaluation of interactive set visualisation techniques to find patterns of missing data and generate actionable insights. The visualisations comprised easily interpretable bar charts for sets, heatmaps for set intersections and histograms for distributions of both sets and intersections. SETTING AND PARTICIPANTS: Anonymised admitted patient care health records for National Health Service (NHS) hospitals and independent sector providers in England. The visualisation and data mining software was run over 16 million records and 86 fields in the dataset. RESULTS: The dataset contained 960 million missing values. Set visualisation bar charts showed how those values were distributed across the fields, including several fields that, unexpectedly, were not complete. Set intersection heatmaps revealed unexpected gaps in diagnosis, operation and date fields because diagnosis and operation fields were not filled up sequentially and some operations did not have corresponding dates. Information gain ratio and entropy calculations allowed us to identify the origin of each unexpected pattern, in terms of the values of other fields. CONCLUSIONS: Our findings show how set visualisation reveals important insights about multifield missing data patterns in large EHR datasets. The study revealed both rare and widespread data quality issues that were previously unknown, and allowed a particular part of a specific hospital to be pinpointed as the origin of rare issues that NHS Digital did not know exist. |
format | Online Article Text |
id | pubmed-9680176 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BMJ Publishing Group |
record_format | MEDLINE/PubMed |
spelling | pubmed-96801762022-11-23 Using set visualisation to find and explain patterns of missing values: a case study with NHS hospital episode statistics data Ruddle, Roy A Adnan, Muhammad Hall, Marlous BMJ Open Health Informatics OBJECTIVES: Missing data is the most common data quality issue in electronic health records (EHRs). Missing data checks implemented in common analytical software are typically limited to counting the number of missing values in individual fields, but researchers and organisations also need to understand multifield missing data patterns to better inform advanced missing data strategies for which counts or numerical summaries are poorly suited. This study shows how set-based visualisation enables multifield missing data patterns to be discovered and investigated. DESIGN: Development and evaluation of interactive set visualisation techniques to find patterns of missing data and generate actionable insights. The visualisations comprised easily interpretable bar charts for sets, heatmaps for set intersections and histograms for distributions of both sets and intersections. SETTING AND PARTICIPANTS: Anonymised admitted patient care health records for National Health Service (NHS) hospitals and independent sector providers in England. The visualisation and data mining software was run over 16 million records and 86 fields in the dataset. RESULTS: The dataset contained 960 million missing values. Set visualisation bar charts showed how those values were distributed across the fields, including several fields that, unexpectedly, were not complete. Set intersection heatmaps revealed unexpected gaps in diagnosis, operation and date fields because diagnosis and operation fields were not filled up sequentially and some operations did not have corresponding dates. Information gain ratio and entropy calculations allowed us to identify the origin of each unexpected pattern, in terms of the values of other fields. CONCLUSIONS: Our findings show how set visualisation reveals important insights about multifield missing data patterns in large EHR datasets. The study revealed both rare and widespread data quality issues that were previously unknown, and allowed a particular part of a specific hospital to be pinpointed as the origin of rare issues that NHS Digital did not know exist. BMJ Publishing Group 2022-11-21 /pmc/articles/PMC9680176/ /pubmed/36410820 http://dx.doi.org/10.1136/bmjopen-2022-064887 Text en © Author(s) (or their employer(s)) 2022. Re-use permitted under CC BY. Published by BMJ. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed in accordance with the Creative Commons Attribution 4.0 Unported (CC BY 4.0) license, which permits others to copy, redistribute, remix, transform and build upon this work for any purpose, provided the original work is properly cited, a link to the licence is given, and indication of whether changes were made. See: https://creativecommons.org/licenses/by/4.0/. |
spellingShingle | Health Informatics Ruddle, Roy A Adnan, Muhammad Hall, Marlous Using set visualisation to find and explain patterns of missing values: a case study with NHS hospital episode statistics data |
title | Using set visualisation to find and explain patterns of missing values: a case study with NHS hospital episode statistics data |
title_full | Using set visualisation to find and explain patterns of missing values: a case study with NHS hospital episode statistics data |
title_fullStr | Using set visualisation to find and explain patterns of missing values: a case study with NHS hospital episode statistics data |
title_full_unstemmed | Using set visualisation to find and explain patterns of missing values: a case study with NHS hospital episode statistics data |
title_short | Using set visualisation to find and explain patterns of missing values: a case study with NHS hospital episode statistics data |
title_sort | using set visualisation to find and explain patterns of missing values: a case study with nhs hospital episode statistics data |
topic | Health Informatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9680176/ https://www.ncbi.nlm.nih.gov/pubmed/36410820 http://dx.doi.org/10.1136/bmjopen-2022-064887 |
work_keys_str_mv | AT ruddleroya usingsetvisualisationtofindandexplainpatternsofmissingvaluesacasestudywithnhshospitalepisodestatisticsdata AT adnanmuhammad usingsetvisualisationtofindandexplainpatternsofmissingvaluesacasestudywithnhshospitalepisodestatisticsdata AT hallmarlous usingsetvisualisationtofindandexplainpatternsofmissingvaluesacasestudywithnhshospitalepisodestatisticsdata |