Cargando…

Data segmentation based on the local intrinsic dimension

One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies...

Descripción completa

Detalles Bibliográficos
Autores principales:	Allegra, Michele, Facco, Elena, Denti, Francesco, Laio, Alessandro, Mira, Antonietta
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Nature Publishing Group UK 2020
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7536196/ https://www.ncbi.nlm.nih.gov/pubmed/33020515 http://dx.doi.org/10.1038/s41598-020-72222-0

_version_	1783590512992714752
author	Allegra, Michele Facco, Elena Denti, Francesco Laio, Alessandro Mira, Antonietta
author_facet	Allegra, Michele Facco, Elena Denti, Francesco Laio, Alessandro Mira, Antonietta
author_sort	Allegra, Michele
collection	PubMed
description	One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms.
format	Online Article Text
id	pubmed-7536196
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Nature Publishing Group UK
record_format	MEDLINE/PubMed
spelling	pubmed-75361962020-10-06 Data segmentation based on the local intrinsic dimension Allegra, Michele Facco, Elena Denti, Francesco Laio, Alessandro Mira, Antonietta Sci Rep Article One of the founding paradigms of machine learning is that a small number of variables is often sufficient to describe high-dimensional data. The minimum number of variables required is called the intrinsic dimension (ID) of the data. Contrary to common intuition, there are cases where the ID varies within the same data set. This fact has been highlighted in technical discussions, but seldom exploited to analyze large data sets and obtain insight into their structure. Here we develop a robust approach to discriminate regions with different local IDs and segment the points accordingly. Our approach is computationally efficient and can be proficiently used even on large data sets. We find that many real-world data sets contain regions with widely heterogeneous dimensions. These regions host points differing in core properties: folded versus unfolded configurations in a protein molecular dynamics trajectory, active versus non-active regions in brain imaging data, and firms with different financial risk in company balance sheets. A simple topological feature, the local ID, is thus sufficient to achieve an unsupervised segmentation of high-dimensional data, complementary to the one given by clustering algorithms. Nature Publishing Group UK 2020-10-05 /pmc/articles/PMC7536196/ /pubmed/33020515 http://dx.doi.org/10.1038/s41598-020-72222-0 Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
spellingShingle	Article Allegra, Michele Facco, Elena Denti, Francesco Laio, Alessandro Mira, Antonietta Data segmentation based on the local intrinsic dimension
title	Data segmentation based on the local intrinsic dimension
title_full	Data segmentation based on the local intrinsic dimension
title_fullStr	Data segmentation based on the local intrinsic dimension
title_full_unstemmed	Data segmentation based on the local intrinsic dimension
title_short	Data segmentation based on the local intrinsic dimension
title_sort	data segmentation based on the local intrinsic dimension
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7536196/ https://www.ncbi.nlm.nih.gov/pubmed/33020515 http://dx.doi.org/10.1038/s41598-020-72222-0
work_keys_str_mv	AT allegramichele datasegmentationbasedonthelocalintrinsicdimension AT faccoelena datasegmentationbasedonthelocalintrinsicdimension AT dentifrancesco datasegmentationbasedonthelocalintrinsicdimension AT laioalessandro datasegmentationbasedonthelocalintrinsicdimension AT miraantonietta datasegmentationbasedonthelocalintrinsicdimension

Data segmentation based on the local intrinsic dimension

Ejemplares similares