Cargando…

A data driven learning approach for the assessment of data quality

BACKGROUND: Data quality assessment is important but complex and task dependent. Identifying suitable measurement methods and reference ranges for assessing their results is challenging. Manually inspecting the measurement results and current data driven approaches for learning which results indicat...

Descripción completa

Detalles Bibliográficos
Autores principales: Tute, Erik, Ganapathy, Nagarajan, Wulff, Antje
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8561935/
https://www.ncbi.nlm.nih.gov/pubmed/34724930
http://dx.doi.org/10.1186/s12911-021-01656-x
_version_ 1784593168773152768
author Tute, Erik
Ganapathy, Nagarajan
Wulff, Antje
author_facet Tute, Erik
Ganapathy, Nagarajan
Wulff, Antje
author_sort Tute, Erik
collection PubMed
description BACKGROUND: Data quality assessment is important but complex and task dependent. Identifying suitable measurement methods and reference ranges for assessing their results is challenging. Manually inspecting the measurement results and current data driven approaches for learning which results indicate data quality issues have considerable limitations, e.g. to identify task dependent thresholds for measurement results that indicate data quality issues. OBJECTIVES: To explore the applicability and potential benefits of a data driven approach to learn task dependent knowledge about suitable measurement methods and assessment of their results. Such knowledge could be useful for others to determine whether a local data stock is suitable for a given task. METHODS: We started by creating artificial data with previously defined data quality issues and applied a set of generic measurement methods on this data (e.g. a method to count the number of values in a certain variable or the mean value of the values). We trained decision trees on exported measurement methods’ results and corresponding outcome data (data that indicated the data’s suitability for a use case). For evaluation, we derived rules for potential measurement methods and reference values from the decision trees and compared these regarding their coverage of the true data quality issues artificially created in the dataset. Three researchers independently derived these rules. One with knowledge about present data quality issues and two without. RESULTS: Our self-trained decision trees were able to indicate rules for 12 of 19 previously defined data quality issues. Learned knowledge about measurement methods and their assessment was complementary to manual interpretation of measurement methods’ results. CONCLUSIONS: Our data driven approach derives sensible knowledge for task dependent data quality assessment and complements other current approaches. Based on labeled measurement methods’ results as training data, our approach successfully suggested applicable rules for checking data quality characteristics that determine whether a dataset is suitable for a given task. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-021-01656-x.
format Online
Article
Text
id pubmed-8561935
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-85619352021-11-03 A data driven learning approach for the assessment of data quality Tute, Erik Ganapathy, Nagarajan Wulff, Antje BMC Med Inform Decis Mak Research BACKGROUND: Data quality assessment is important but complex and task dependent. Identifying suitable measurement methods and reference ranges for assessing their results is challenging. Manually inspecting the measurement results and current data driven approaches for learning which results indicate data quality issues have considerable limitations, e.g. to identify task dependent thresholds for measurement results that indicate data quality issues. OBJECTIVES: To explore the applicability and potential benefits of a data driven approach to learn task dependent knowledge about suitable measurement methods and assessment of their results. Such knowledge could be useful for others to determine whether a local data stock is suitable for a given task. METHODS: We started by creating artificial data with previously defined data quality issues and applied a set of generic measurement methods on this data (e.g. a method to count the number of values in a certain variable or the mean value of the values). We trained decision trees on exported measurement methods’ results and corresponding outcome data (data that indicated the data’s suitability for a use case). For evaluation, we derived rules for potential measurement methods and reference values from the decision trees and compared these regarding their coverage of the true data quality issues artificially created in the dataset. Three researchers independently derived these rules. One with knowledge about present data quality issues and two without. RESULTS: Our self-trained decision trees were able to indicate rules for 12 of 19 previously defined data quality issues. Learned knowledge about measurement methods and their assessment was complementary to manual interpretation of measurement methods’ results. CONCLUSIONS: Our data driven approach derives sensible knowledge for task dependent data quality assessment and complements other current approaches. Based on labeled measurement methods’ results as training data, our approach successfully suggested applicable rules for checking data quality characteristics that determine whether a dataset is suitable for a given task. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-021-01656-x. BioMed Central 2021-11-01 /pmc/articles/PMC8561935/ /pubmed/34724930 http://dx.doi.org/10.1186/s12911-021-01656-x Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Tute, Erik
Ganapathy, Nagarajan
Wulff, Antje
A data driven learning approach for the assessment of data quality
title A data driven learning approach for the assessment of data quality
title_full A data driven learning approach for the assessment of data quality
title_fullStr A data driven learning approach for the assessment of data quality
title_full_unstemmed A data driven learning approach for the assessment of data quality
title_short A data driven learning approach for the assessment of data quality
title_sort data driven learning approach for the assessment of data quality
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8561935/
https://www.ncbi.nlm.nih.gov/pubmed/34724930
http://dx.doi.org/10.1186/s12911-021-01656-x
work_keys_str_mv AT tuteerik adatadrivenlearningapproachfortheassessmentofdataquality
AT ganapathynagarajan adatadrivenlearningapproachfortheassessmentofdataquality
AT wulffantje adatadrivenlearningapproachfortheassessmentofdataquality
AT tuteerik datadrivenlearningapproachfortheassessmentofdataquality
AT ganapathynagarajan datadrivenlearningapproachfortheassessmentofdataquality
AT wulffantje datadrivenlearningapproachfortheassessmentofdataquality