Cargando…

Ranking the information content of distance measures

Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Finding a small set of features that...

Descripción completa

Detalles Bibliográficos
Autores principales: Glielmo, Aldo, Zeni, Claudio, Cheng, Bingqing, Csányi, Gábor, Laio, Alessandro
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9802303/
https://www.ncbi.nlm.nih.gov/pubmed/36713323
http://dx.doi.org/10.1093/pnasnexus/pgac039
_version_ 1784861655024271360
author Glielmo, Aldo
Zeni, Claudio
Cheng, Bingqing
Csányi, Gábor
Laio, Alessandro
author_facet Glielmo, Aldo
Zeni, Claudio
Cheng, Bingqing
Csányi, Gábor
Laio, Alessandro
author_sort Glielmo, Aldo
collection PubMed
description Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Finding a small set of features that still retains sufficient information about the dataset is important for the successful application of many statistical learning approaches. We introduce a statistical test that can assess the relative information retained when using 2 different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This ranking can in turn be used to identify the most informative distance measure and, therefore, the most informative set of features, out of a pool of candidates. To illustrate the general applicability of our approach, we show that it reproduces the known importance ranking of policy variables for Covid-19 control, and also identifies compact yet informative descriptors for atomic structures. We further provide initial evidence that the information asymmetry measured by the proposed test can be used to infer relationships of causality between the features of a dataset. The method is general and should be applicable to many branches of science.
format Online
Article
Text
id pubmed-9802303
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-98023032023-02-03 Ranking the information content of distance measures Glielmo, Aldo Zeni, Claudio Cheng, Bingqing Csányi, Gábor Laio, Alessandro PNAS Nexus Physical Sciences and Engineering Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Finding a small set of features that still retains sufficient information about the dataset is important for the successful application of many statistical learning approaches. We introduce a statistical test that can assess the relative information retained when using 2 different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This ranking can in turn be used to identify the most informative distance measure and, therefore, the most informative set of features, out of a pool of candidates. To illustrate the general applicability of our approach, we show that it reproduces the known importance ranking of policy variables for Covid-19 control, and also identifies compact yet informative descriptors for atomic structures. We further provide initial evidence that the information asymmetry measured by the proposed test can be used to infer relationships of causality between the features of a dataset. The method is general and should be applicable to many branches of science. Oxford University Press 2022-04-14 /pmc/articles/PMC9802303/ /pubmed/36713323 http://dx.doi.org/10.1093/pnasnexus/pgac039 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of the National Academy of Sciences. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (https://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered or transformed in any way, and that the work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Physical Sciences and Engineering
Glielmo, Aldo
Zeni, Claudio
Cheng, Bingqing
Csányi, Gábor
Laio, Alessandro
Ranking the information content of distance measures
title Ranking the information content of distance measures
title_full Ranking the information content of distance measures
title_fullStr Ranking the information content of distance measures
title_full_unstemmed Ranking the information content of distance measures
title_short Ranking the information content of distance measures
title_sort ranking the information content of distance measures
topic Physical Sciences and Engineering
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9802303/
https://www.ncbi.nlm.nih.gov/pubmed/36713323
http://dx.doi.org/10.1093/pnasnexus/pgac039
work_keys_str_mv AT glielmoaldo rankingtheinformationcontentofdistancemeasures
AT zeniclaudio rankingtheinformationcontentofdistancemeasures
AT chengbingqing rankingtheinformationcontentofdistancemeasures
AT csanyigabor rankingtheinformationcontentofdistancemeasures
AT laioalessandro rankingtheinformationcontentofdistancemeasures