Cargando…
Visualization of very large high-dimensional data sets as minimum spanning trees
The chemical sciences are producing an unprecedented amount of large, high-dimensional data sets containing chemical structures and associated properties. However, there are currently no algorithms to visualize such data while preserving both global and local features with a sufficient level of deta...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Springer International Publishing
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7015965/ https://www.ncbi.nlm.nih.gov/pubmed/33431043 http://dx.doi.org/10.1186/s13321-020-0416-x |
_version_ | 1783496891769552896 |
---|---|
author | Probst, Daniel Reymond, Jean-Louis |
author_facet | Probst, Daniel Reymond, Jean-Louis |
author_sort | Probst, Daniel |
collection | PubMed |
description | The chemical sciences are producing an unprecedented amount of large, high-dimensional data sets containing chemical structures and associated properties. However, there are currently no algorithms to visualize such data while preserving both global and local features with a sufficient level of detail to allow for human inspection and interpretation. Here, we propose a solution to this problem with a new data visualization method, TMAP, capable of representing data sets of up to millions of data points and arbitrary high dimensionality as a two-dimensional tree (http://tmap.gdb.tools). Visualizations based on TMAP are better suited than t-SNE or UMAP for the exploration and interpretation of large data sets due to their tree-like nature, increased local and global neighborhood and structure preservation, and the transparency of the methods the algorithm is based on. We apply TMAP to the most used chemistry data sets including databases of molecules such as ChEMBL, FDB17, the Natural Products Atlas, DSSTox, as well as to the MoleculeNet benchmark collection of data sets. We also show its broad applicability with further examples from biology, particle physics, and literature. [Image: see text] |
format | Online Article Text |
id | pubmed-7015965 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Springer International Publishing |
record_format | MEDLINE/PubMed |
spelling | pubmed-70159652020-02-20 Visualization of very large high-dimensional data sets as minimum spanning trees Probst, Daniel Reymond, Jean-Louis J Cheminform Research Article The chemical sciences are producing an unprecedented amount of large, high-dimensional data sets containing chemical structures and associated properties. However, there are currently no algorithms to visualize such data while preserving both global and local features with a sufficient level of detail to allow for human inspection and interpretation. Here, we propose a solution to this problem with a new data visualization method, TMAP, capable of representing data sets of up to millions of data points and arbitrary high dimensionality as a two-dimensional tree (http://tmap.gdb.tools). Visualizations based on TMAP are better suited than t-SNE or UMAP for the exploration and interpretation of large data sets due to their tree-like nature, increased local and global neighborhood and structure preservation, and the transparency of the methods the algorithm is based on. We apply TMAP to the most used chemistry data sets including databases of molecules such as ChEMBL, FDB17, the Natural Products Atlas, DSSTox, as well as to the MoleculeNet benchmark collection of data sets. We also show its broad applicability with further examples from biology, particle physics, and literature. [Image: see text] Springer International Publishing 2020-02-12 /pmc/articles/PMC7015965/ /pubmed/33431043 http://dx.doi.org/10.1186/s13321-020-0416-x Text en © The Author(s) 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Probst, Daniel Reymond, Jean-Louis Visualization of very large high-dimensional data sets as minimum spanning trees |
title | Visualization of very large high-dimensional data sets as minimum spanning trees |
title_full | Visualization of very large high-dimensional data sets as minimum spanning trees |
title_fullStr | Visualization of very large high-dimensional data sets as minimum spanning trees |
title_full_unstemmed | Visualization of very large high-dimensional data sets as minimum spanning trees |
title_short | Visualization of very large high-dimensional data sets as minimum spanning trees |
title_sort | visualization of very large high-dimensional data sets as minimum spanning trees |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7015965/ https://www.ncbi.nlm.nih.gov/pubmed/33431043 http://dx.doi.org/10.1186/s13321-020-0416-x |
work_keys_str_mv | AT probstdaniel visualizationofverylargehighdimensionaldatasetsasminimumspanningtrees AT reymondjeanlouis visualizationofverylargehighdimensionaldatasetsasminimumspanningtrees |