Cargando…

Predicting Causes of Data Quality Issues in a Clinical Data Research Network

Clinical data research networks (CDRNs) invest substantially in identifying and investigating data quality problems. While identification is largely automated, the investigation and resolution are carried out manually at individual institutions. In the PEDSnet CDRN, we found that only approximately...

Descripción completa

Detalles Bibliográficos
Autores principales: Khare, Ritu, Ruth, Byron J., Miller, Matthew, Tucker, Joshua, Utidjian, Levon H., Razzaghi, Hanieh, Patibandla, Nandan, Burrows, Evanette K., Bailey, L. Charles
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Medical Informatics Association 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5961770/
https://www.ncbi.nlm.nih.gov/pubmed/29888053
Descripción
Sumario:Clinical data research networks (CDRNs) invest substantially in identifying and investigating data quality problems. While identification is largely automated, the investigation and resolution are carried out manually at individual institutions. In the PEDSnet CDRN, we found that only approximately 35% of the identified data quality issues are resolvable as they are caused by errors in the extract-transform-load (ETL) code. Nonetheless, with no prior knowledge of issue causes, partner institutions end up spending significant time investigating issues that represent either inherent data characteristics or false alarms. This work investigates whether the causes (ETL, Characteristic, or False alarm) can be predicted before spending time investigating issues. We trained a classifier on the metadata from 10,281 real-world data quality issues, and achieved a cause prediction F1-measure of up to 90%. While initially tested on PEDSnet, the proposed methodology is applicable to other CDRNs facing similar bottlenecks in handling data quality results.