Cargando…
Predicting Causes of Data Quality Issues in a Clinical Data Research Network
Clinical data research networks (CDRNs) invest substantially in identifying and investigating data quality problems. While identification is largely automated, the investigation and resolution are carried out manually at individual institutions. In the PEDSnet CDRN, we found that only approximately...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
American Medical Informatics Association
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5961770/ https://www.ncbi.nlm.nih.gov/pubmed/29888053 |
_version_ | 1783324776454946816 |
---|---|
author | Khare, Ritu Ruth, Byron J. Miller, Matthew Tucker, Joshua Utidjian, Levon H. Razzaghi, Hanieh Patibandla, Nandan Burrows, Evanette K. Bailey, L. Charles |
author_facet | Khare, Ritu Ruth, Byron J. Miller, Matthew Tucker, Joshua Utidjian, Levon H. Razzaghi, Hanieh Patibandla, Nandan Burrows, Evanette K. Bailey, L. Charles |
author_sort | Khare, Ritu |
collection | PubMed |
description | Clinical data research networks (CDRNs) invest substantially in identifying and investigating data quality problems. While identification is largely automated, the investigation and resolution are carried out manually at individual institutions. In the PEDSnet CDRN, we found that only approximately 35% of the identified data quality issues are resolvable as they are caused by errors in the extract-transform-load (ETL) code. Nonetheless, with no prior knowledge of issue causes, partner institutions end up spending significant time investigating issues that represent either inherent data characteristics or false alarms. This work investigates whether the causes (ETL, Characteristic, or False alarm) can be predicted before spending time investigating issues. We trained a classifier on the metadata from 10,281 real-world data quality issues, and achieved a cause prediction F1-measure of up to 90%. While initially tested on PEDSnet, the proposed methodology is applicable to other CDRNs facing similar bottlenecks in handling data quality results. |
format | Online Article Text |
id | pubmed-5961770 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | American Medical Informatics Association |
record_format | MEDLINE/PubMed |
spelling | pubmed-59617702018-06-08 Predicting Causes of Data Quality Issues in a Clinical Data Research Network Khare, Ritu Ruth, Byron J. Miller, Matthew Tucker, Joshua Utidjian, Levon H. Razzaghi, Hanieh Patibandla, Nandan Burrows, Evanette K. Bailey, L. Charles AMIA Jt Summits Transl Sci Proc Articles Clinical data research networks (CDRNs) invest substantially in identifying and investigating data quality problems. While identification is largely automated, the investigation and resolution are carried out manually at individual institutions. In the PEDSnet CDRN, we found that only approximately 35% of the identified data quality issues are resolvable as they are caused by errors in the extract-transform-load (ETL) code. Nonetheless, with no prior knowledge of issue causes, partner institutions end up spending significant time investigating issues that represent either inherent data characteristics or false alarms. This work investigates whether the causes (ETL, Characteristic, or False alarm) can be predicted before spending time investigating issues. We trained a classifier on the metadata from 10,281 real-world data quality issues, and achieved a cause prediction F1-measure of up to 90%. While initially tested on PEDSnet, the proposed methodology is applicable to other CDRNs facing similar bottlenecks in handling data quality results. American Medical Informatics Association 2018-05-18 /pmc/articles/PMC5961770/ /pubmed/29888053 Text en ©2018 AMIA - All rights reserved. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose |
spellingShingle | Articles Khare, Ritu Ruth, Byron J. Miller, Matthew Tucker, Joshua Utidjian, Levon H. Razzaghi, Hanieh Patibandla, Nandan Burrows, Evanette K. Bailey, L. Charles Predicting Causes of Data Quality Issues in a Clinical Data Research Network |
title | Predicting Causes of Data Quality Issues in a Clinical Data Research Network |
title_full | Predicting Causes of Data Quality Issues in a Clinical Data Research Network |
title_fullStr | Predicting Causes of Data Quality Issues in a Clinical Data Research Network |
title_full_unstemmed | Predicting Causes of Data Quality Issues in a Clinical Data Research Network |
title_short | Predicting Causes of Data Quality Issues in a Clinical Data Research Network |
title_sort | predicting causes of data quality issues in a clinical data research network |
topic | Articles |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5961770/ https://www.ncbi.nlm.nih.gov/pubmed/29888053 |
work_keys_str_mv | AT khareritu predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork AT ruthbyronj predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork AT millermatthew predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork AT tuckerjoshua predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork AT utidjianlevonh predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork AT razzaghihanieh predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork AT patibandlanandan predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork AT burrowsevanettek predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork AT baileylcharles predictingcausesofdataqualityissuesinaclinicaldataresearchnetwork |