Cargando…

Detecting Errors with Zero-Shot Learning

Error detection is a critical step in data cleaning. Most traditional error detection methods are based on rules and external information with high cost, especially when dealing with large-scaled data. Recently, with the advances of deep learning, some researchers focus their attention on learning t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wu, Xiaoyu, Wang, Ning
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9317027/ https://www.ncbi.nlm.nih.gov/pubmed/35885159 http://dx.doi.org/10.3390/e24070936

_version_	1784754957604356096
author	Wu, Xiaoyu Wang, Ning
author_facet	Wu, Xiaoyu Wang, Ning
author_sort	Wu, Xiaoyu
collection	PubMed
description	Error detection is a critical step in data cleaning. Most traditional error detection methods are based on rules and external information with high cost, especially when dealing with large-scaled data. Recently, with the advances of deep learning, some researchers focus their attention on learning the semantic distribution of data for error detection; however, the low error rate in real datasets makes it hard to collect negative samples for training supervised deep learning models. Most of the existing deep-learning-based error detection algorithms solve the class imbalance problem by data augmentation. Due to the inadequate sampling of negative samples, the features learned by those methods may be biased. In this paper, we propose an AEGAN (Auto-Encoder Generative Adversarial Network)-based deep learning model named SAT-GAN (Self-Attention Generative Adversarial Network) to detect errors in relational datasets. Combining the self-attention mechanism with the pre-trained language model, our model can capture semantic features of the dataset, specifically the functional dependency between attributes, so that no rules or constraints are needed for SAT-GAN to identify inconsistent data. For the lack of negative samples, we propose to train our model via zero-shot learning. As a clean-data tailored model, SAT-GAN tries to recognize error data as outliers by learning the latent features of clean data. In our evaluation, SAT-GAN achieves an average [Formula: see text]-score of 0.95 on five datasets, which yields at least 46.2% [Formula: see text]-score improvement over rule-based methods and outperforms state-of-the-art deep learning approaches in the absence of rules and negative samples.
format	Online Article Text
id	pubmed-9317027
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-93170272022-07-27 Detecting Errors with Zero-Shot Learning Wu, Xiaoyu Wang, Ning Entropy (Basel) Article Error detection is a critical step in data cleaning. Most traditional error detection methods are based on rules and external information with high cost, especially when dealing with large-scaled data. Recently, with the advances of deep learning, some researchers focus their attention on learning the semantic distribution of data for error detection; however, the low error rate in real datasets makes it hard to collect negative samples for training supervised deep learning models. Most of the existing deep-learning-based error detection algorithms solve the class imbalance problem by data augmentation. Due to the inadequate sampling of negative samples, the features learned by those methods may be biased. In this paper, we propose an AEGAN (Auto-Encoder Generative Adversarial Network)-based deep learning model named SAT-GAN (Self-Attention Generative Adversarial Network) to detect errors in relational datasets. Combining the self-attention mechanism with the pre-trained language model, our model can capture semantic features of the dataset, specifically the functional dependency between attributes, so that no rules or constraints are needed for SAT-GAN to identify inconsistent data. For the lack of negative samples, we propose to train our model via zero-shot learning. As a clean-data tailored model, SAT-GAN tries to recognize error data as outliers by learning the latent features of clean data. In our evaluation, SAT-GAN achieves an average [Formula: see text]-score of 0.95 on five datasets, which yields at least 46.2% [Formula: see text]-score improvement over rule-based methods and outperforms state-of-the-art deep learning approaches in the absence of rules and negative samples. MDPI 2022-07-06 /pmc/articles/PMC9317027/ /pubmed/35885159 http://dx.doi.org/10.3390/e24070936 Text en © 2022 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Wu, Xiaoyu Wang, Ning Detecting Errors with Zero-Shot Learning
title	Detecting Errors with Zero-Shot Learning
title_full	Detecting Errors with Zero-Shot Learning
title_fullStr	Detecting Errors with Zero-Shot Learning
title_full_unstemmed	Detecting Errors with Zero-Shot Learning
title_short	Detecting Errors with Zero-Shot Learning
title_sort	detecting errors with zero-shot learning
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9317027/ https://www.ncbi.nlm.nih.gov/pubmed/35885159 http://dx.doi.org/10.3390/e24070936
work_keys_str_mv	AT wuxiaoyu detectingerrorswithzeroshotlearning AT wangning detectingerrorswithzeroshotlearning

Detecting Errors with Zero-Shot Learning

Ejemplares similares