Cargando…

The generalized ratios intrinsic dimension estimator

Modern datasets are characterized by numerous features related by complex dependency structures. To deal with these data, dimensionality reduction techniques are essential. Many of these techniques rely on the concept of intrinsic dimension (id), a measure of the complexity of the dataset. However,...

Descripción completa

Detalles Bibliográficos
Autores principales: Denti, Francesco, Doimo, Diego, Laio, Alessandro, Mira, Antonietta
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9678878/
https://www.ncbi.nlm.nih.gov/pubmed/36411305
http://dx.doi.org/10.1038/s41598-022-20991-1
_version_ 1784834085736153088
author Denti, Francesco
Doimo, Diego
Laio, Alessandro
Mira, Antonietta
author_facet Denti, Francesco
Doimo, Diego
Laio, Alessandro
Mira, Antonietta
author_sort Denti, Francesco
collection PubMed
description Modern datasets are characterized by numerous features related by complex dependency structures. To deal with these data, dimensionality reduction techniques are essential. Many of these techniques rely on the concept of intrinsic dimension (id), a measure of the complexity of the dataset. However, the estimation of this quantity is not trivial: often, the id depends rather dramatically on the scale of the distances among data points. At short distances, the id can be grossly overestimated due to the presence of noise, becoming smaller and approximately scale-independent only at large distances. An immediate approach to examining the scale dependence consists in decimating the dataset, which unavoidably induces non-negligible statistical errors at large scale. This article introduces a novel statistical method, Gride, that allows estimating the id as an explicit function of the scale without performing any decimation. Our approach is based on rigorous distributional results that enable the quantification of uncertainty of the estimates. Moreover, our method is simple and computationally efficient since it relies only on the distances among data points. Through simulation studies, we show that Gride is asymptotically unbiased, provides comparable estimates to other state-of-the-art methods, and is more robust to short-scale noise than other likelihood-based approaches.
format Online
Article
Text
id pubmed-9678878
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-96788782022-11-23 The generalized ratios intrinsic dimension estimator Denti, Francesco Doimo, Diego Laio, Alessandro Mira, Antonietta Sci Rep Article Modern datasets are characterized by numerous features related by complex dependency structures. To deal with these data, dimensionality reduction techniques are essential. Many of these techniques rely on the concept of intrinsic dimension (id), a measure of the complexity of the dataset. However, the estimation of this quantity is not trivial: often, the id depends rather dramatically on the scale of the distances among data points. At short distances, the id can be grossly overestimated due to the presence of noise, becoming smaller and approximately scale-independent only at large distances. An immediate approach to examining the scale dependence consists in decimating the dataset, which unavoidably induces non-negligible statistical errors at large scale. This article introduces a novel statistical method, Gride, that allows estimating the id as an explicit function of the scale without performing any decimation. Our approach is based on rigorous distributional results that enable the quantification of uncertainty of the estimates. Moreover, our method is simple and computationally efficient since it relies only on the distances among data points. Through simulation studies, we show that Gride is asymptotically unbiased, provides comparable estimates to other state-of-the-art methods, and is more robust to short-scale noise than other likelihood-based approaches. Nature Publishing Group UK 2022-11-21 /pmc/articles/PMC9678878/ /pubmed/36411305 http://dx.doi.org/10.1038/s41598-022-20991-1 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Denti, Francesco
Doimo, Diego
Laio, Alessandro
Mira, Antonietta
The generalized ratios intrinsic dimension estimator
title The generalized ratios intrinsic dimension estimator
title_full The generalized ratios intrinsic dimension estimator
title_fullStr The generalized ratios intrinsic dimension estimator
title_full_unstemmed The generalized ratios intrinsic dimension estimator
title_short The generalized ratios intrinsic dimension estimator
title_sort generalized ratios intrinsic dimension estimator
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9678878/
https://www.ncbi.nlm.nih.gov/pubmed/36411305
http://dx.doi.org/10.1038/s41598-022-20991-1
work_keys_str_mv AT dentifrancesco thegeneralizedratiosintrinsicdimensionestimator
AT doimodiego thegeneralizedratiosintrinsicdimensionestimator
AT laioalessandro thegeneralizedratiosintrinsicdimensionestimator
AT miraantonietta thegeneralizedratiosintrinsicdimensionestimator
AT dentifrancesco generalizedratiosintrinsicdimensionestimator
AT doimodiego generalizedratiosintrinsicdimensionestimator
AT laioalessandro generalizedratiosintrinsicdimensionestimator
AT miraantonietta generalizedratiosintrinsicdimensionestimator