Cargando…

Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets

Construction of graph-based approximations for multi-dimensional data point clouds is widely used in a variety of areas. Notable examples of applications of such approximators are cellular trajectory inference in single-cell data analysis, analysis of clinical trajectories from synchronic datasets,...

Descripción completa

Detalles Bibliográficos
Autores principales: Chervov, Alexander, Bac, Jonathan, Zinovyev, Andrei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7711596/
https://www.ncbi.nlm.nih.gov/pubmed/33287042
http://dx.doi.org/10.3390/e22111274
_version_ 1783618181150015488
author Chervov, Alexander
Bac, Jonathan
Zinovyev, Andrei
author_facet Chervov, Alexander
Bac, Jonathan
Zinovyev, Andrei
author_sort Chervov, Alexander
collection PubMed
description Construction of graph-based approximations for multi-dimensional data point clouds is widely used in a variety of areas. Notable examples of applications of such approximators are cellular trajectory inference in single-cell data analysis, analysis of clinical trajectories from synchronic datasets, and skeletonization of images. Several methods have been proposed to construct such approximating graphs, with some based on computation of minimum spanning trees and some based on principal graphs generalizing principal curves. In this article we propose a methodology to compare and benchmark these two graph-based data approximation approaches, as well as to define their hyperparameters. The main idea is to avoid comparing graphs directly, but at first to induce clustering of the data point cloud from the graph approximation and, secondly, to use well-established methods to compare and score the data cloud partitioning induced by the graphs. In particular, mutual information-based approaches prove to be useful in this context. The induced clustering is based on decomposing a graph into non-branching segments, and then clustering the data point cloud by the nearest segment. Such a method allows efficient comparison of graph-based data approximations of arbitrary topology and complexity. The method is implemented in Python using the standard scikit-learn library which provides high speed and efficiency. As a demonstration of the methodology we analyse and compare graph-based data approximation methods using synthetic as well as real-life single cell datasets.
format Online
Article
Text
id pubmed-7711596
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-77115962021-02-24 Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets Chervov, Alexander Bac, Jonathan Zinovyev, Andrei Entropy (Basel) Article Construction of graph-based approximations for multi-dimensional data point clouds is widely used in a variety of areas. Notable examples of applications of such approximators are cellular trajectory inference in single-cell data analysis, analysis of clinical trajectories from synchronic datasets, and skeletonization of images. Several methods have been proposed to construct such approximating graphs, with some based on computation of minimum spanning trees and some based on principal graphs generalizing principal curves. In this article we propose a methodology to compare and benchmark these two graph-based data approximation approaches, as well as to define their hyperparameters. The main idea is to avoid comparing graphs directly, but at first to induce clustering of the data point cloud from the graph approximation and, secondly, to use well-established methods to compare and score the data cloud partitioning induced by the graphs. In particular, mutual information-based approaches prove to be useful in this context. The induced clustering is based on decomposing a graph into non-branching segments, and then clustering the data point cloud by the nearest segment. Such a method allows efficient comparison of graph-based data approximations of arbitrary topology and complexity. The method is implemented in Python using the standard scikit-learn library which provides high speed and efficiency. As a demonstration of the methodology we analyse and compare graph-based data approximation methods using synthetic as well as real-life single cell datasets. MDPI 2020-11-11 /pmc/articles/PMC7711596/ /pubmed/33287042 http://dx.doi.org/10.3390/e22111274 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Chervov, Alexander
Bac, Jonathan
Zinovyev, Andrei
Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets
title Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets
title_full Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets
title_fullStr Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets
title_full_unstemmed Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets
title_short Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets
title_sort minimum spanning vs. principal trees for structured approximations of multi-dimensional datasets
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7711596/
https://www.ncbi.nlm.nih.gov/pubmed/33287042
http://dx.doi.org/10.3390/e22111274
work_keys_str_mv AT chervovalexander minimumspanningvsprincipaltreesforstructuredapproximationsofmultidimensionaldatasets
AT bacjonathan minimumspanningvsprincipaltreesforstructuredapproximationsofmultidimensionaldatasets
AT zinovyevandrei minimumspanningvsprincipaltreesforstructuredapproximationsofmultidimensionaldatasets