Cargando…

Evaluation of robust outlier detection methods for zero-inflated complex data

Outlier detection can be seen as a pre-processing step for locating data points in a data sample, which do not conform to the majority of observations. Various techniques and methods for outlier detection can be found in the literature dealing with different types of data. However, many data sets ar...

Descripción completa

Detalles Bibliográficos
Autores principales:	Templ, M., Gussenbauer, J., Filzmoser, P.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Taylor & Francis 2019
Materias:	Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9041731/ https://www.ncbi.nlm.nih.gov/pubmed/35707025 http://dx.doi.org/10.1080/02664763.2019.1671961

_version_	1784694565739954176
author	Templ, M. Gussenbauer, J. Filzmoser, P.
author_facet	Templ, M. Gussenbauer, J. Filzmoser, P.
author_sort	Templ, M.
collection	PubMed
description	Outlier detection can be seen as a pre-processing step for locating data points in a data sample, which do not conform to the majority of observations. Various techniques and methods for outlier detection can be found in the literature dealing with different types of data. However, many data sets are inflated by true zeros and, in addition, some components/variables might be of compositional nature. Important examples of such data sets are the Structural Earnings Survey, the Structural Business Statistics, the European Statistics on Income and Living Conditions, tax data or – as in this contribution – household expenditure data which are used, for example, to estimate the Purchase Power Parity of a country. In this work, robust univariate and multivariate outlier detection methods are compared by a complex simulation study that considers various challenges included in data sets, namely structural (true) zeros, missing values, and compositional variables. These circumstances make it difficult or impossible to flag true outliers and influential observations by well-known outlier detection methods. Our aim is to assess the performance of outlier detection methods in terms of their effectiveness to identify outliers when applied to challenging data sets such as the household expenditures data surveyed all over the world. Moreover, different methods are evaluated through a close-to-reality simulation study. Differences in performance of univariate and multivariate robust techniques for outlier detection and their shortcomings are reported. We found that robust multivariate methods outperform robust univariate methods. The best performing methods in finding the outliers and in providing a low false discovery rate were found to be the generalized S estimators (GSE), the BACON-EEM algorithm and a compositional method (CoDa-Cov). In addition, these methods performed also best when the outliers are imputed based on the corresponding outlier detection method and indicators are estimated from the data sets.
format	Online Article Text
id	pubmed-9041731
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Taylor & Francis
record_format	MEDLINE/PubMed
spelling	pubmed-90417312022-06-14 Evaluation of robust outlier detection methods for zero-inflated complex data Templ, M. Gussenbauer, J. Filzmoser, P. J Appl Stat Articles Outlier detection can be seen as a pre-processing step for locating data points in a data sample, which do not conform to the majority of observations. Various techniques and methods for outlier detection can be found in the literature dealing with different types of data. However, many data sets are inflated by true zeros and, in addition, some components/variables might be of compositional nature. Important examples of such data sets are the Structural Earnings Survey, the Structural Business Statistics, the European Statistics on Income and Living Conditions, tax data or – as in this contribution – household expenditure data which are used, for example, to estimate the Purchase Power Parity of a country. In this work, robust univariate and multivariate outlier detection methods are compared by a complex simulation study that considers various challenges included in data sets, namely structural (true) zeros, missing values, and compositional variables. These circumstances make it difficult or impossible to flag true outliers and influential observations by well-known outlier detection methods. Our aim is to assess the performance of outlier detection methods in terms of their effectiveness to identify outliers when applied to challenging data sets such as the household expenditures data surveyed all over the world. Moreover, different methods are evaluated through a close-to-reality simulation study. Differences in performance of univariate and multivariate robust techniques for outlier detection and their shortcomings are reported. We found that robust multivariate methods outperform robust univariate methods. The best performing methods in finding the outliers and in providing a low false discovery rate were found to be the generalized S estimators (GSE), the BACON-EEM algorithm and a compositional method (CoDa-Cov). In addition, these methods performed also best when the outliers are imputed based on the corresponding outlier detection method and indicators are estimated from the data sets. Taylor & Francis 2019-09-27 /pmc/articles/PMC9041731/ /pubmed/35707025 http://dx.doi.org/10.1080/02664763.2019.1671961 Text en © 2019 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group https://creativecommons.org/licenses/by-nc-nd/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License (http://creativecommons.org/licenses/by-nc-nd/4.0/ (https://creativecommons.org/licenses/by-nc-nd/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited, and is not altered, transformed, or built upon in any way.
spellingShingle	Articles Templ, M. Gussenbauer, J. Filzmoser, P. Evaluation of robust outlier detection methods for zero-inflated complex data
title	Evaluation of robust outlier detection methods for zero-inflated complex data
title_full	Evaluation of robust outlier detection methods for zero-inflated complex data
title_fullStr	Evaluation of robust outlier detection methods for zero-inflated complex data
title_full_unstemmed	Evaluation of robust outlier detection methods for zero-inflated complex data
title_short	Evaluation of robust outlier detection methods for zero-inflated complex data
title_sort	evaluation of robust outlier detection methods for zero-inflated complex data
topic	Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9041731/ https://www.ncbi.nlm.nih.gov/pubmed/35707025 http://dx.doi.org/10.1080/02664763.2019.1671961
work_keys_str_mv	AT templm evaluationofrobustoutlierdetectionmethodsforzeroinflatedcomplexdata AT gussenbauerj evaluationofrobustoutlierdetectionmethodsforzeroinflatedcomplexdata AT filzmoserp evaluationofrobustoutlierdetectionmethodsforzeroinflatedcomplexdata

Evaluation of robust outlier detection methods for zero-inflated complex data

Ejemplares similares