Cargando…

Classification at the accuracy limit: facing the problem of data ambiguity

Data classification, the process of analyzing data and organizing it into categories or clusters, is a fundamental computing task of natural and artificial information processing systems. Both supervised classification and unsupervised clustering work best when the input vectors are distributed over...

Descripción completa

Detalles Bibliográficos
Autores principales:	Metzner, Claus, Schilling, Achim, Traxdorf, Maximilian, Tziridis, Konstantin, Maier, Andreas, Schulze, Holger, Krauss, Patrick
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9772417/ https://www.ncbi.nlm.nih.gov/pubmed/36543849 http://dx.doi.org/10.1038/s41598-022-26498-z

_version_	1784854970646921216
author	Metzner, Claus Schilling, Achim Traxdorf, Maximilian Tziridis, Konstantin Maier, Andreas Schulze, Holger Krauss, Patrick
author_facet	Metzner, Claus Schilling, Achim Traxdorf, Maximilian Tziridis, Konstantin Maier, Andreas Schulze, Holger Krauss, Patrick
author_sort	Metzner, Claus
collection	PubMed
description	Data classification, the process of analyzing data and organizing it into categories or clusters, is a fundamental computing task of natural and artificial information processing systems. Both supervised classification and unsupervised clustering work best when the input vectors are distributed over the data space in a highly non-uniform way. These tasks become however challenging in weakly structured data sets, where a significant fraction of data points is located in between the regions of high point density. We derive the theoretical limit for classification accuracy that arises from this overlap of data categories. By using a surrogate data generation model with adjustable statistical properties, we show that sufficiently powerful classifiers based on completely different principles, such as perceptrons and Bayesian models, all perform at this universal accuracy limit under ideal training conditions. Remarkably, the accuracy limit is not affected by certain non-linear transformations of the data, even if these transformations are non-reversible and drastically reduce the information content of the input data. We further compare the data embeddings that emerge by supervised and unsupervised training, using the MNIST data set and human EEG recordings during sleep. We find for MNIST that categories are significantly separated not only after supervised training with back-propagation, but also after unsupervised dimensionality reduction. A qualitatively similar cluster enhancement by unsupervised compression is observed for the EEG sleep data, but with a very small overall degree of cluster separation. We conclude that the handwritten letters in MNIST can be considered as ’natural kinds’, whereas EEG sleep recordings are a relatively weakly structured data set, so that unsupervised clustering will not necessarily re-cover the human-defined sleep stages.
format	Online Article Text
id	pubmed-9772417
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-97724172022-12-23 Classification at the accuracy limit: facing the problem of data ambiguity Metzner, Claus Schilling, Achim Traxdorf, Maximilian Tziridis, Konstantin Maier, Andreas Schulze, Holger Krauss, Patrick Sci Rep Article Data classification, the process of analyzing data and organizing it into categories or clusters, is a fundamental computing task of natural and artificial information processing systems. Both supervised classification and unsupervised clustering work best when the input vectors are distributed over the data space in a highly non-uniform way. These tasks become however challenging in weakly structured data sets, where a significant fraction of data points is located in between the regions of high point density. We derive the theoretical limit for classification accuracy that arises from this overlap of data categories. By using a surrogate data generation model with adjustable statistical properties, we show that sufficiently powerful classifiers based on completely different principles, such as perceptrons and Bayesian models, all perform at this universal accuracy limit under ideal training conditions. Remarkably, the accuracy limit is not affected by certain non-linear transformations of the data, even if these transformations are non-reversible and drastically reduce the information content of the input data. We further compare the data embeddings that emerge by supervised and unsupervised training, using the MNIST data set and human EEG recordings during sleep. We find for MNIST that categories are significantly separated not only after supervised training with back-propagation, but also after unsupervised dimensionality reduction. A qualitatively similar cluster enhancement by unsupervised compression is observed for the EEG sleep data, but with a very small overall degree of cluster separation. We conclude that the handwritten letters in MNIST can be considered as ’natural kinds’, whereas EEG sleep recordings are a relatively weakly structured data set, so that unsupervised clustering will not necessarily re-cover the human-defined sleep stages. Nature Publishing Group UK 2022-12-21 /pmc/articles/PMC9772417/ /pubmed/36543849 http://dx.doi.org/10.1038/s41598-022-26498-z Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle	Article Metzner, Claus Schilling, Achim Traxdorf, Maximilian Tziridis, Konstantin Maier, Andreas Schulze, Holger Krauss, Patrick Classification at the accuracy limit: facing the problem of data ambiguity
title	Classification at the accuracy limit: facing the problem of data ambiguity
title_full	Classification at the accuracy limit: facing the problem of data ambiguity
title_fullStr	Classification at the accuracy limit: facing the problem of data ambiguity
title_full_unstemmed	Classification at the accuracy limit: facing the problem of data ambiguity
title_short	Classification at the accuracy limit: facing the problem of data ambiguity
title_sort	classification at the accuracy limit: facing the problem of data ambiguity
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9772417/ https://www.ncbi.nlm.nih.gov/pubmed/36543849 http://dx.doi.org/10.1038/s41598-022-26498-z
work_keys_str_mv	AT metznerclaus classificationattheaccuracylimitfacingtheproblemofdataambiguity AT schillingachim classificationattheaccuracylimitfacingtheproblemofdataambiguity AT traxdorfmaximilian classificationattheaccuracylimitfacingtheproblemofdataambiguity AT tziridiskonstantin classificationattheaccuracylimitfacingtheproblemofdataambiguity AT maierandreas classificationattheaccuracylimitfacingtheproblemofdataambiguity AT schulzeholger classificationattheaccuracylimitfacingtheproblemofdataambiguity AT krausspatrick classificationattheaccuracylimitfacingtheproblemofdataambiguity

Classification at the accuracy limit: facing the problem of data ambiguity

Ejemplares similares