Cargando…
The generalized ratios intrinsic dimension estimator
Modern datasets are characterized by numerous features related by complex dependency structures. To deal with these data, dimensionality reduction techniques are essential. Many of these techniques rely on the concept of intrinsic dimension (id), a measure of the complexity of the dataset. However,...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9678878/ https://www.ncbi.nlm.nih.gov/pubmed/36411305 http://dx.doi.org/10.1038/s41598-022-20991-1 |
_version_ | 1784834085736153088 |
---|---|
author | Denti, Francesco Doimo, Diego Laio, Alessandro Mira, Antonietta |
author_facet | Denti, Francesco Doimo, Diego Laio, Alessandro Mira, Antonietta |
author_sort | Denti, Francesco |
collection | PubMed |
description | Modern datasets are characterized by numerous features related by complex dependency structures. To deal with these data, dimensionality reduction techniques are essential. Many of these techniques rely on the concept of intrinsic dimension (id), a measure of the complexity of the dataset. However, the estimation of this quantity is not trivial: often, the id depends rather dramatically on the scale of the distances among data points. At short distances, the id can be grossly overestimated due to the presence of noise, becoming smaller and approximately scale-independent only at large distances. An immediate approach to examining the scale dependence consists in decimating the dataset, which unavoidably induces non-negligible statistical errors at large scale. This article introduces a novel statistical method, Gride, that allows estimating the id as an explicit function of the scale without performing any decimation. Our approach is based on rigorous distributional results that enable the quantification of uncertainty of the estimates. Moreover, our method is simple and computationally efficient since it relies only on the distances among data points. Through simulation studies, we show that Gride is asymptotically unbiased, provides comparable estimates to other state-of-the-art methods, and is more robust to short-scale noise than other likelihood-based approaches. |
format | Online Article Text |
id | pubmed-9678878 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-96788782022-11-23 The generalized ratios intrinsic dimension estimator Denti, Francesco Doimo, Diego Laio, Alessandro Mira, Antonietta Sci Rep Article Modern datasets are characterized by numerous features related by complex dependency structures. To deal with these data, dimensionality reduction techniques are essential. Many of these techniques rely on the concept of intrinsic dimension (id), a measure of the complexity of the dataset. However, the estimation of this quantity is not trivial: often, the id depends rather dramatically on the scale of the distances among data points. At short distances, the id can be grossly overestimated due to the presence of noise, becoming smaller and approximately scale-independent only at large distances. An immediate approach to examining the scale dependence consists in decimating the dataset, which unavoidably induces non-negligible statistical errors at large scale. This article introduces a novel statistical method, Gride, that allows estimating the id as an explicit function of the scale without performing any decimation. Our approach is based on rigorous distributional results that enable the quantification of uncertainty of the estimates. Moreover, our method is simple and computationally efficient since it relies only on the distances among data points. Through simulation studies, we show that Gride is asymptotically unbiased, provides comparable estimates to other state-of-the-art methods, and is more robust to short-scale noise than other likelihood-based approaches. Nature Publishing Group UK 2022-11-21 /pmc/articles/PMC9678878/ /pubmed/36411305 http://dx.doi.org/10.1038/s41598-022-20991-1 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Denti, Francesco Doimo, Diego Laio, Alessandro Mira, Antonietta The generalized ratios intrinsic dimension estimator |
title | The generalized ratios intrinsic dimension estimator |
title_full | The generalized ratios intrinsic dimension estimator |
title_fullStr | The generalized ratios intrinsic dimension estimator |
title_full_unstemmed | The generalized ratios intrinsic dimension estimator |
title_short | The generalized ratios intrinsic dimension estimator |
title_sort | generalized ratios intrinsic dimension estimator |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9678878/ https://www.ncbi.nlm.nih.gov/pubmed/36411305 http://dx.doi.org/10.1038/s41598-022-20991-1 |
work_keys_str_mv | AT dentifrancesco thegeneralizedratiosintrinsicdimensionestimator AT doimodiego thegeneralizedratiosintrinsicdimensionestimator AT laioalessandro thegeneralizedratiosintrinsicdimensionestimator AT miraantonietta thegeneralizedratiosintrinsicdimensionestimator AT dentifrancesco generalizedratiosintrinsicdimensionestimator AT doimodiego generalizedratiosintrinsicdimensionestimator AT laioalessandro generalizedratiosintrinsicdimensionestimator AT miraantonietta generalizedratiosintrinsicdimensionestimator |