Cargando…

Qualitative Data Clustering to Detect Outliers

Detecting outliers is a widely studied problem in many disciplines, including statistics, data mining, and machine learning. All anomaly detection activities are aimed at identifying cases of unusual behavior compared to most observations. There are many methods to deal with this issue, which are ap...

Descripción completa

Detalles Bibliográficos
Autores principales: Nowak-Brzezińska, Agnieszka, Łazarz, Weronika
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8307081/
https://www.ncbi.nlm.nih.gov/pubmed/34356410
http://dx.doi.org/10.3390/e23070869
_version_ 1783727965214867456
author Nowak-Brzezińska, Agnieszka
Łazarz, Weronika
author_facet Nowak-Brzezińska, Agnieszka
Łazarz, Weronika
author_sort Nowak-Brzezińska, Agnieszka
collection PubMed
description Detecting outliers is a widely studied problem in many disciplines, including statistics, data mining, and machine learning. All anomaly detection activities are aimed at identifying cases of unusual behavior compared to most observations. There are many methods to deal with this issue, which are applicable depending on the size of the data set, the way it is stored, and the type of attributes and their values. Most of them focus on traditional datasets with a large number of quantitative attributes. The multitude of solutions related to detecting outliers in quantitative sets, a large and still has a small number of research solutions is a problem detecting outliers in data containing only qualitative variables. This article was designed to compare three different categorical data clustering algorithms: K- [Formula: see text] algorithm taken from MacQueen’s K- [Formula: see text] algorithm and the [Formula: see text] and [Formula: see text] algorithms. The comparison concerned the method of dividing the set into clusters and, in particular, the outliers detected by algorithms. During the research, the authors analyzed the clusters detected by the indicated algorithms, using several datasets that differ in terms of the number of objects and variables. They have conducted experiments on the parameters of the algorithms. The presented study made it possible to check whether the algorithms similarly detect outliers in the data and how much they depend on individual parameters and parameters of the set, such as the number of variables, tuples, and categories of a qualitative variable.
format Online
Article
Text
id pubmed-8307081
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-83070812021-07-25 Qualitative Data Clustering to Detect Outliers Nowak-Brzezińska, Agnieszka Łazarz, Weronika Entropy (Basel) Article Detecting outliers is a widely studied problem in many disciplines, including statistics, data mining, and machine learning. All anomaly detection activities are aimed at identifying cases of unusual behavior compared to most observations. There are many methods to deal with this issue, which are applicable depending on the size of the data set, the way it is stored, and the type of attributes and their values. Most of them focus on traditional datasets with a large number of quantitative attributes. The multitude of solutions related to detecting outliers in quantitative sets, a large and still has a small number of research solutions is a problem detecting outliers in data containing only qualitative variables. This article was designed to compare three different categorical data clustering algorithms: K- [Formula: see text] algorithm taken from MacQueen’s K- [Formula: see text] algorithm and the [Formula: see text] and [Formula: see text] algorithms. The comparison concerned the method of dividing the set into clusters and, in particular, the outliers detected by algorithms. During the research, the authors analyzed the clusters detected by the indicated algorithms, using several datasets that differ in terms of the number of objects and variables. They have conducted experiments on the parameters of the algorithms. The presented study made it possible to check whether the algorithms similarly detect outliers in the data and how much they depend on individual parameters and parameters of the set, such as the number of variables, tuples, and categories of a qualitative variable. MDPI 2021-07-07 /pmc/articles/PMC8307081/ /pubmed/34356410 http://dx.doi.org/10.3390/e23070869 Text en © 2021 by the authors. https://creativecommons.org/licenses/by/4.0/Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Nowak-Brzezińska, Agnieszka
Łazarz, Weronika
Qualitative Data Clustering to Detect Outliers
title Qualitative Data Clustering to Detect Outliers
title_full Qualitative Data Clustering to Detect Outliers
title_fullStr Qualitative Data Clustering to Detect Outliers
title_full_unstemmed Qualitative Data Clustering to Detect Outliers
title_short Qualitative Data Clustering to Detect Outliers
title_sort qualitative data clustering to detect outliers
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8307081/
https://www.ncbi.nlm.nih.gov/pubmed/34356410
http://dx.doi.org/10.3390/e23070869
work_keys_str_mv AT nowakbrzezinskaagnieszka qualitativedataclusteringtodetectoutliers
AT łazarzweronika qualitativedataclusteringtodetectoutliers