Cargando…

Analyzing the fine structure of distributions

One aim of data mining is the identification of interesting structures in data. For better analytical results, the basic properties of an empirical distribution, such as skewness and eventual clipping, i.e. hard limits in value ranges, need to be assessed. Of particular interest is the question of w...

Descripción completa

Detalles Bibliográficos
Autores principales: Thrun, Michael C., Gehlert, Tino, Ultsch, Alfred
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7556505/
https://www.ncbi.nlm.nih.gov/pubmed/33052923
http://dx.doi.org/10.1371/journal.pone.0238835
_version_ 1783594232173297664
author Thrun, Michael C.
Gehlert, Tino
Ultsch, Alfred
author_facet Thrun, Michael C.
Gehlert, Tino
Ultsch, Alfred
author_sort Thrun, Michael C.
collection PubMed
description One aim of data mining is the identification of interesting structures in data. For better analytical results, the basic properties of an empirical distribution, such as skewness and eventual clipping, i.e. hard limits in value ranges, need to be assessed. Of particular interest is the question of whether the data originate from one process or contain subsets related to different states of the data producing process. Data visualization tools should deliver a clear picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs typically use kernel density estimates and include both the classical histogram, as well as the modern tools like ridgeline plots, bean plots and violin plots. If density estimation parameters remain in a default setting, conventional methods pose several problems when visualizing the PDF of uniform, multimodal, skewed distributions and distributions with clipped data, For that reason, a new visualization tool called the mirrored density plot (MD plot), which is specifically designed to discover interesting structures in continuous features, is proposed. The MD plot does not require adjusting any parameters of density estimation, which is what may make the use of this plot compelling particularly to non-experts. The visualization tools in question are evaluated against statistical tests with regard to typical challenges of explorative distribution analysis. The results of the evaluation are presented using bimodal Gaussian, skewed distributions and several features with already published PDFs. In an exploratory data analysis of 12 features describing quarterly financial statements, when statistical testing poses a great difficulty, only the MD plots can identify the structure of their PDFs. In sum, the MD plot outperforms the above mentioned methods.
format Online
Article
Text
id pubmed-7556505
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-75565052020-10-21 Analyzing the fine structure of distributions Thrun, Michael C. Gehlert, Tino Ultsch, Alfred PLoS One Research Article One aim of data mining is the identification of interesting structures in data. For better analytical results, the basic properties of an empirical distribution, such as skewness and eventual clipping, i.e. hard limits in value ranges, need to be assessed. Of particular interest is the question of whether the data originate from one process or contain subsets related to different states of the data producing process. Data visualization tools should deliver a clear picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs typically use kernel density estimates and include both the classical histogram, as well as the modern tools like ridgeline plots, bean plots and violin plots. If density estimation parameters remain in a default setting, conventional methods pose several problems when visualizing the PDF of uniform, multimodal, skewed distributions and distributions with clipped data, For that reason, a new visualization tool called the mirrored density plot (MD plot), which is specifically designed to discover interesting structures in continuous features, is proposed. The MD plot does not require adjusting any parameters of density estimation, which is what may make the use of this plot compelling particularly to non-experts. The visualization tools in question are evaluated against statistical tests with regard to typical challenges of explorative distribution analysis. The results of the evaluation are presented using bimodal Gaussian, skewed distributions and several features with already published PDFs. In an exploratory data analysis of 12 features describing quarterly financial statements, when statistical testing poses a great difficulty, only the MD plots can identify the structure of their PDFs. In sum, the MD plot outperforms the above mentioned methods. Public Library of Science 2020-10-14 /pmc/articles/PMC7556505/ /pubmed/33052923 http://dx.doi.org/10.1371/journal.pone.0238835 Text en © 2020 Thrun et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Thrun, Michael C.
Gehlert, Tino
Ultsch, Alfred
Analyzing the fine structure of distributions
title Analyzing the fine structure of distributions
title_full Analyzing the fine structure of distributions
title_fullStr Analyzing the fine structure of distributions
title_full_unstemmed Analyzing the fine structure of distributions
title_short Analyzing the fine structure of distributions
title_sort analyzing the fine structure of distributions
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7556505/
https://www.ncbi.nlm.nih.gov/pubmed/33052923
http://dx.doi.org/10.1371/journal.pone.0238835
work_keys_str_mv AT thrunmichaelc analyzingthefinestructureofdistributions
AT gehlerttino analyzingthefinestructureofdistributions
AT ultschalfred analyzingthefinestructureofdistributions