Cargando…

Visualization of very large high-dimensional data sets as minimum spanning trees

The chemical sciences are producing an unprecedented amount of large, high-dimensional data sets containing chemical structures and associated properties. However, there are currently no algorithms to visualize such data while preserving both global and local features with a sufficient level of deta...

Descripción completa

Detalles Bibliográficos
Autores principales: Probst, Daniel, Reymond, Jean-Louis
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7015965/
https://www.ncbi.nlm.nih.gov/pubmed/33431043
http://dx.doi.org/10.1186/s13321-020-0416-x
_version_ 1783496891769552896
author Probst, Daniel
Reymond, Jean-Louis
author_facet Probst, Daniel
Reymond, Jean-Louis
author_sort Probst, Daniel
collection PubMed
description The chemical sciences are producing an unprecedented amount of large, high-dimensional data sets containing chemical structures and associated properties. However, there are currently no algorithms to visualize such data while preserving both global and local features with a sufficient level of detail to allow for human inspection and interpretation. Here, we propose a solution to this problem with a new data visualization method, TMAP, capable of representing data sets of up to millions of data points and arbitrary high dimensionality as a two-dimensional tree (http://tmap.gdb.tools). Visualizations based on TMAP are better suited than t-SNE or UMAP for the exploration and interpretation of large data sets due to their tree-like nature, increased local and global neighborhood and structure preservation, and the transparency of the methods the algorithm is based on. We apply TMAP to the most used chemistry data sets including databases of molecules such as ChEMBL, FDB17, the Natural Products Atlas, DSSTox, as well as to the MoleculeNet benchmark collection of data sets. We also show its broad applicability with further examples from biology, particle physics, and literature. [Image: see text]
format Online
Article
Text
id pubmed-7015965
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-70159652020-02-20 Visualization of very large high-dimensional data sets as minimum spanning trees Probst, Daniel Reymond, Jean-Louis J Cheminform Research Article The chemical sciences are producing an unprecedented amount of large, high-dimensional data sets containing chemical structures and associated properties. However, there are currently no algorithms to visualize such data while preserving both global and local features with a sufficient level of detail to allow for human inspection and interpretation. Here, we propose a solution to this problem with a new data visualization method, TMAP, capable of representing data sets of up to millions of data points and arbitrary high dimensionality as a two-dimensional tree (http://tmap.gdb.tools). Visualizations based on TMAP are better suited than t-SNE or UMAP for the exploration and interpretation of large data sets due to their tree-like nature, increased local and global neighborhood and structure preservation, and the transparency of the methods the algorithm is based on. We apply TMAP to the most used chemistry data sets including databases of molecules such as ChEMBL, FDB17, the Natural Products Atlas, DSSTox, as well as to the MoleculeNet benchmark collection of data sets. We also show its broad applicability with further examples from biology, particle physics, and literature. [Image: see text] Springer International Publishing 2020-02-12 /pmc/articles/PMC7015965/ /pubmed/33431043 http://dx.doi.org/10.1186/s13321-020-0416-x Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Probst, Daniel
Reymond, Jean-Louis
Visualization of very large high-dimensional data sets as minimum spanning trees
title Visualization of very large high-dimensional data sets as minimum spanning trees
title_full Visualization of very large high-dimensional data sets as minimum spanning trees
title_fullStr Visualization of very large high-dimensional data sets as minimum spanning trees
title_full_unstemmed Visualization of very large high-dimensional data sets as minimum spanning trees
title_short Visualization of very large high-dimensional data sets as minimum spanning trees
title_sort visualization of very large high-dimensional data sets as minimum spanning trees
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7015965/
https://www.ncbi.nlm.nih.gov/pubmed/33431043
http://dx.doi.org/10.1186/s13321-020-0416-x
work_keys_str_mv AT probstdaniel visualizationofverylargehighdimensionaldatasetsasminimumspanningtrees
AT reymondjeanlouis visualizationofverylargehighdimensionaldatasetsasminimumspanningtrees