Cargando…

CLARITY: comparing heterogeneous data using dissimilarity

Integrating datasets from different disciplines is hard because the data are often qualitatively different in meaning, scale and reliability. When two datasets describe the same entities, many scientific questions can be phrased around whether the (dis)similarities between entities are conserved acr...

Descripción completa

Detalles Bibliográficos
Autores principales: Lawson, Daniel J., Solanki, Vinesh, Yanovich, Igor, Dellert, Johannes, Ruck, Damian, Endicott, Phillip
Formato: Online Artículo Texto
Lenguaje:English
Publicado: The Royal Society 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8652278/
https://www.ncbi.nlm.nih.gov/pubmed/34909208
http://dx.doi.org/10.1098/rsos.202182
_version_ 1784611562096427008
author Lawson, Daniel J.
Solanki, Vinesh
Yanovich, Igor
Dellert, Johannes
Ruck, Damian
Endicott, Phillip
author_facet Lawson, Daniel J.
Solanki, Vinesh
Yanovich, Igor
Dellert, Johannes
Ruck, Damian
Endicott, Phillip
author_sort Lawson, Daniel J.
collection PubMed
description Integrating datasets from different disciplines is hard because the data are often qualitatively different in meaning, scale and reliability. When two datasets describe the same entities, many scientific questions can be phrased around whether the (dis)similarities between entities are conserved across such different data. Our method, CLARITY, quantifies consistency across datasets, identifies where inconsistencies arise and aids in their interpretation. We illustrate this using three diverse comparisons: gene methylation versus expression, evolution of language sounds versus word use, and country-level economic metrics versus cultural beliefs. The non-parametric approach is robust to noise and differences in scaling, and makes only weak assumptions about how the data were generated. It operates by decomposing similarities into two components: a ‘structural’ component analogous to a clustering, and an underlying ‘relationship’ between those structures. This allows a ‘structural comparison’ between two similarity matrices using their predictability from ‘structure’. Significance is assessed with the help of re-sampling appropriate for each dataset. The software, CLARITY, is available as an R package from github.com/danjlawson/CLARITY.
format Online
Article
Text
id pubmed-8652278
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher The Royal Society
record_format MEDLINE/PubMed
spelling pubmed-86522782021-12-13 CLARITY: comparing heterogeneous data using dissimilarity Lawson, Daniel J. Solanki, Vinesh Yanovich, Igor Dellert, Johannes Ruck, Damian Endicott, Phillip R Soc Open Sci Mathematics Integrating datasets from different disciplines is hard because the data are often qualitatively different in meaning, scale and reliability. When two datasets describe the same entities, many scientific questions can be phrased around whether the (dis)similarities between entities are conserved across such different data. Our method, CLARITY, quantifies consistency across datasets, identifies where inconsistencies arise and aids in their interpretation. We illustrate this using three diverse comparisons: gene methylation versus expression, evolution of language sounds versus word use, and country-level economic metrics versus cultural beliefs. The non-parametric approach is robust to noise and differences in scaling, and makes only weak assumptions about how the data were generated. It operates by decomposing similarities into two components: a ‘structural’ component analogous to a clustering, and an underlying ‘relationship’ between those structures. This allows a ‘structural comparison’ between two similarity matrices using their predictability from ‘structure’. Significance is assessed with the help of re-sampling appropriate for each dataset. The software, CLARITY, is available as an R package from github.com/danjlawson/CLARITY. The Royal Society 2021-12-08 /pmc/articles/PMC8652278/ /pubmed/34909208 http://dx.doi.org/10.1098/rsos.202182 Text en © 2021 The Authors. https://creativecommons.org/licenses/by/4.0/Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, provided the original author and source are credited.
spellingShingle Mathematics
Lawson, Daniel J.
Solanki, Vinesh
Yanovich, Igor
Dellert, Johannes
Ruck, Damian
Endicott, Phillip
CLARITY: comparing heterogeneous data using dissimilarity
title CLARITY: comparing heterogeneous data using dissimilarity
title_full CLARITY: comparing heterogeneous data using dissimilarity
title_fullStr CLARITY: comparing heterogeneous data using dissimilarity
title_full_unstemmed CLARITY: comparing heterogeneous data using dissimilarity
title_short CLARITY: comparing heterogeneous data using dissimilarity
title_sort clarity: comparing heterogeneous data using dissimilarity
topic Mathematics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8652278/
https://www.ncbi.nlm.nih.gov/pubmed/34909208
http://dx.doi.org/10.1098/rsos.202182
work_keys_str_mv AT lawsondanielj claritycomparingheterogeneousdatausingdissimilarity
AT solankivinesh claritycomparingheterogeneousdatausingdissimilarity
AT yanovichigor claritycomparingheterogeneousdatausingdissimilarity
AT dellertjohannes claritycomparingheterogeneousdatausingdissimilarity
AT ruckdamian claritycomparingheterogeneousdatausingdissimilarity
AT endicottphillip claritycomparingheterogeneousdatausingdissimilarity