Cargando…

Predicting Causes of Data Quality Issues in a Clinical Data Research Network

Clinical data research networks (CDRNs) invest substantially in identifying and investigating data quality problems. While identification is largely automated, the investigation and resolution are carried out manually at individual institutions. In the PEDSnet CDRN, we found that only approximately...

Descripción completa

Detalles Bibliográficos
Autores principales: Khare, Ritu, Ruth, Byron J., Miller, Matthew, Tucker, Joshua, Utidjian, Levon H., Razzaghi, Hanieh, Patibandla, Nandan, Burrows, Evanette K., Bailey, L. Charles
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Medical Informatics Association 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5961770/
https://www.ncbi.nlm.nih.gov/pubmed/29888053
_version_ 1783324776454946816
author Khare, Ritu
Ruth, Byron J.
Miller, Matthew
Tucker, Joshua
Utidjian, Levon H.
Razzaghi, Hanieh
Patibandla, Nandan
Burrows, Evanette K.
Bailey, L. Charles
author_facet Khare, Ritu
Ruth, Byron J.
Miller, Matthew
Tucker, Joshua
Utidjian, Levon H.
Razzaghi, Hanieh
Patibandla, Nandan
Burrows, Evanette K.
Bailey, L. Charles
author_sort Khare, Ritu
collection PubMed
description Clinical data research networks (CDRNs) invest substantially in identifying and investigating data quality problems. While identification is largely automated, the investigation and resolution are carried out manually at individual institutions. In the PEDSnet CDRN, we found that only approximately 35% of the identified data quality issues are resolvable as they are caused by errors in the extract-transform-load (ETL) code. Nonetheless, with no prior knowledge of issue causes, partner institutions end up spending significant time investigating issues that represent either inherent data characteristics or false alarms. This work investigates whether the causes (ETL, Characteristic, or False alarm) can be predicted before spending time investigating issues. We trained a classifier on the metadata from 10,281 real-world data quality issues, and achieved a cause prediction F1-measure of up to 90%. While initially tested on PEDSnet, the proposed methodology is applicable to other CDRNs facing similar bottlenecks in handling data quality results.
format Online
Article
Text
id pubmed-5961770
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher American Medical Informatics Association
record_format MEDLINE/PubMed
spelling pubmed-59617702018-06-08 Predicting Causes of Data Quality Issues in a Clinical Data Research Network Khare, Ritu Ruth, Byron J. Miller, Matthew Tucker, Joshua Utidjian, Levon H. Razzaghi, Hanieh Patibandla, Nandan Burrows, Evanette K. Bailey, L. Charles AMIA Jt Summits Transl Sci Proc Articles Clinical data research networks (CDRNs) invest substantially in identifying and investigating data quality problems. While identification is largely automated, the investigation and resolution are carried out manually at individual institutions. In the PEDSnet CDRN, we found that only approximately 35% of the identified data quality issues are resolvable as they are caused by errors in the extract-transform-load (ETL) code. Nonetheless, with no prior knowledge of issue causes, partner institutions end up spending significant time investigating issues that represent either inherent data characteristics or false alarms. This work investigates whether the causes (ETL, Characteristic, or False alarm) can be predicted before spending time investigating issues. We trained a classifier on the metadata from 10,281 real-world data quality issues, and achieved a cause prediction F1-measure of up to 90%. While initially tested on PEDSnet, the proposed methodology is applicable to other CDRNs facing similar bottlenecks in handling data quality results. American Medical Informatics Association 2018-05-18 /pmc/articles/PMC5961770/ /pubmed/29888053 Text en ©2018 AMIA - All rights reserved. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose
spellingShingle Articles
Khare, Ritu
Ruth, Byron J.
Miller, Matthew
Tucker, Joshua
Utidjian, Levon H.
Razzaghi, Hanieh
Patibandla, Nandan
Burrows, Evanette K.
Bailey, L. Charles
Predicting Causes of Data Quality Issues in a Clinical Data Research Network
title Predicting Causes of Data Quality Issues in a Clinical Data Research Network
title_full Predicting Causes of Data Quality Issues in a Clinical Data Research Network
title_fullStr Predicting Causes of Data Quality Issues in a Clinical Data Research Network
title_full_unstemmed Predicting Causes of Data Quality Issues in a Clinical Data Research Network
title_short Predicting Causes of Data Quality Issues in a Clinical Data Research Network
title_sort predicting causes of data quality issues in a clinical data research network
topic Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5961770/
https://www.ncbi.nlm.nih.gov/pubmed/29888053
work_keys_str_mv AT khareritu predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork
AT ruthbyronj predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork
AT millermatthew predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork
AT tuckerjoshua predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork
AT utidjianlevonh predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork
AT razzaghihanieh predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork
AT patibandlanandan predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork
AT burrowsevanettek predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork
AT baileylcharles predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork