Cargando…
Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets
Construction of graph-based approximations for multi-dimensional data point clouds is widely used in a variety of areas. Notable examples of applications of such approximators are cellular trajectory inference in single-cell data analysis, analysis of clinical trajectories from synchronic datasets,...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
MDPI
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7711596/ https://www.ncbi.nlm.nih.gov/pubmed/33287042 http://dx.doi.org/10.3390/e22111274 |
_version_ | 1783618181150015488 |
---|---|
author | Chervov, Alexander Bac, Jonathan Zinovyev, Andrei |
author_facet | Chervov, Alexander Bac, Jonathan Zinovyev, Andrei |
author_sort | Chervov, Alexander |
collection | PubMed |
description | Construction of graph-based approximations for multi-dimensional data point clouds is widely used in a variety of areas. Notable examples of applications of such approximators are cellular trajectory inference in single-cell data analysis, analysis of clinical trajectories from synchronic datasets, and skeletonization of images. Several methods have been proposed to construct such approximating graphs, with some based on computation of minimum spanning trees and some based on principal graphs generalizing principal curves. In this article we propose a methodology to compare and benchmark these two graph-based data approximation approaches, as well as to define their hyperparameters. The main idea is to avoid comparing graphs directly, but at first to induce clustering of the data point cloud from the graph approximation and, secondly, to use well-established methods to compare and score the data cloud partitioning induced by the graphs. In particular, mutual information-based approaches prove to be useful in this context. The induced clustering is based on decomposing a graph into non-branching segments, and then clustering the data point cloud by the nearest segment. Such a method allows efficient comparison of graph-based data approximations of arbitrary topology and complexity. The method is implemented in Python using the standard scikit-learn library which provides high speed and efficiency. As a demonstration of the methodology we analyse and compare graph-based data approximation methods using synthetic as well as real-life single cell datasets. |
format | Online Article Text |
id | pubmed-7711596 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | MDPI |
record_format | MEDLINE/PubMed |
spelling | pubmed-77115962021-02-24 Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets Chervov, Alexander Bac, Jonathan Zinovyev, Andrei Entropy (Basel) Article Construction of graph-based approximations for multi-dimensional data point clouds is widely used in a variety of areas. Notable examples of applications of such approximators are cellular trajectory inference in single-cell data analysis, analysis of clinical trajectories from synchronic datasets, and skeletonization of images. Several methods have been proposed to construct such approximating graphs, with some based on computation of minimum spanning trees and some based on principal graphs generalizing principal curves. In this article we propose a methodology to compare and benchmark these two graph-based data approximation approaches, as well as to define their hyperparameters. The main idea is to avoid comparing graphs directly, but at first to induce clustering of the data point cloud from the graph approximation and, secondly, to use well-established methods to compare and score the data cloud partitioning induced by the graphs. In particular, mutual information-based approaches prove to be useful in this context. The induced clustering is based on decomposing a graph into non-branching segments, and then clustering the data point cloud by the nearest segment. Such a method allows efficient comparison of graph-based data approximations of arbitrary topology and complexity. The method is implemented in Python using the standard scikit-learn library which provides high speed and efficiency. As a demonstration of the methodology we analyse and compare graph-based data approximation methods using synthetic as well as real-life single cell datasets. MDPI 2020-11-11 /pmc/articles/PMC7711596/ /pubmed/33287042 http://dx.doi.org/10.3390/e22111274 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). |
spellingShingle | Article Chervov, Alexander Bac, Jonathan Zinovyev, Andrei Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets |
title | Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets |
title_full | Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets |
title_fullStr | Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets |
title_full_unstemmed | Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets |
title_short | Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets |
title_sort | minimum spanning vs. principal trees for structured approximations of multi-dimensional datasets |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7711596/ https://www.ncbi.nlm.nih.gov/pubmed/33287042 http://dx.doi.org/10.3390/e22111274 |
work_keys_str_mv | AT chervovalexander minimumspanningvsprincipaltreesforstructuredapproximationsofmultidimensionaldatasets AT bacjonathan minimumspanningvsprincipaltreesforstructuredapproximationsofmultidimensionaldatasets AT zinovyevandrei minimumspanningvsprincipaltreesforstructuredapproximationsofmultidimensionaldatasets |