Cargando…

Method for Data Quality Assessment of Synthetic Industrial Data

Sometimes it is difficult, or even impossible, to acquire real data from sensors and machines that must be used in research. Such examples are the modern industrial platforms that frequently are reticent to share data. In such situations, the only option is to work with synthetic data obtained by si...

Descripción completa

Detalles Bibliográficos
Autores principales:	Iantovics, László Barna, Enăchescu, Călin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8876977/ https://www.ncbi.nlm.nih.gov/pubmed/35214509 http://dx.doi.org/10.3390/s22041608

_version_	1784658297419202560
author	Iantovics, László Barna Enăchescu, Călin
author_facet	Iantovics, László Barna Enăchescu, Călin
author_sort	Iantovics, László Barna
collection	PubMed
description	Sometimes it is difficult, or even impossible, to acquire real data from sensors and machines that must be used in research. Such examples are the modern industrial platforms that frequently are reticent to share data. In such situations, the only option is to work with synthetic data obtained by simulation. Regarding simulated data, a limitation could consist in the fact that the data are not appropriate for research, based on poor quality or limited quantity. In such cases, the design of algorithms that are tested on that data does not give credible results. For avoiding such situations, we consider that mathematically grounded data-quality assessments should be designed according to the specific type of problem that must be solved. In this paper, we approach a multivariate type of prediction whose results finally can be used for binary classification. We propose the use of a mathematically grounded data-quality assessment, which includes, among other things, the analysis of predictive power of independent variables used for prediction. We present the assumptions that should be passed by the synthetic data. Different threshold values are established by a human assessor. In the case of research data, if all the assumptions pass, then we can consider that the data are appropriate for research and can be applied by even using other methods for solving the same type of problem. The applied method finally delivers a classification table on which can be applied any indicators of performed classification quality, such as sensitivity, specificity, accuracy, F1 score, area under curve (AUC), receiver operating characteristics (ROC), true skill statistics (TSS) and Kappa coefficient. These indicators’ values offer the possibility of comparison of the results obtained by applying the considered method with results of any other method applied for solving the same type of problem. For evaluation and validation purposes, we performed an experimental case study on a novel synthetic dataset provided by the well-known UCI data repository.
format	Online Article Text
id	pubmed-8876977
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-88769772022-02-26 Method for Data Quality Assessment of Synthetic Industrial Data Iantovics, László Barna Enăchescu, Călin Sensors (Basel) Article Sometimes it is difficult, or even impossible, to acquire real data from sensors and machines that must be used in research. Such examples are the modern industrial platforms that frequently are reticent to share data. In such situations, the only option is to work with synthetic data obtained by simulation. Regarding simulated data, a limitation could consist in the fact that the data are not appropriate for research, based on poor quality or limited quantity. In such cases, the design of algorithms that are tested on that data does not give credible results. For avoiding such situations, we consider that mathematically grounded data-quality assessments should be designed according to the specific type of problem that must be solved. In this paper, we approach a multivariate type of prediction whose results finally can be used for binary classification. We propose the use of a mathematically grounded data-quality assessment, which includes, among other things, the analysis of predictive power of independent variables used for prediction. We present the assumptions that should be passed by the synthetic data. Different threshold values are established by a human assessor. In the case of research data, if all the assumptions pass, then we can consider that the data are appropriate for research and can be applied by even using other methods for solving the same type of problem. The applied method finally delivers a classification table on which can be applied any indicators of performed classification quality, such as sensitivity, specificity, accuracy, F1 score, area under curve (AUC), receiver operating characteristics (ROC), true skill statistics (TSS) and Kappa coefficient. These indicators’ values offer the possibility of comparison of the results obtained by applying the considered method with results of any other method applied for solving the same type of problem. For evaluation and validation purposes, we performed an experimental case study on a novel synthetic dataset provided by the well-known UCI data repository. MDPI 2022-02-18 /pmc/articles/PMC8876977/ /pubmed/35214509 http://dx.doi.org/10.3390/s22041608 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Iantovics, László Barna Enăchescu, Călin Method for Data Quality Assessment of Synthetic Industrial Data
title	Method for Data Quality Assessment of Synthetic Industrial Data
title_full	Method for Data Quality Assessment of Synthetic Industrial Data
title_fullStr	Method for Data Quality Assessment of Synthetic Industrial Data
title_full_unstemmed	Method for Data Quality Assessment of Synthetic Industrial Data
title_short	Method for Data Quality Assessment of Synthetic Industrial Data
title_sort	method for data quality assessment of synthetic industrial data
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8876977/ https://www.ncbi.nlm.nih.gov/pubmed/35214509 http://dx.doi.org/10.3390/s22041608
work_keys_str_mv	AT iantovicslaszlobarna methodfordataqualityassessmentofsyntheticindustrialdata AT enachescucalin methodfordataqualityassessmentofsyntheticindustrialdata

Method for Data Quality Assessment of Synthetic Industrial Data

Ejemplares similares