Cargando…

Topological Information Data Analysis

This paper presents methods that quantify the structure of statistical interactions within a given data set, and were applied in a previous article. It establishes new results on the k-multivariate mutual-information ([Formula: see text]) inspired by the topological formulation of Information introd...

Descripción completa

Detalles Bibliográficos
Autores principales: Baudot, Pierre, Tapia, Monica, Bennequin, Daniel, Goaillard, Jean-Marc
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7515398/
http://dx.doi.org/10.3390/e21090869
_version_ 1783586808501633024
author Baudot, Pierre
Tapia, Monica
Bennequin, Daniel
Goaillard, Jean-Marc
author_facet Baudot, Pierre
Tapia, Monica
Bennequin, Daniel
Goaillard, Jean-Marc
author_sort Baudot, Pierre
collection PubMed
description This paper presents methods that quantify the structure of statistical interactions within a given data set, and were applied in a previous article. It establishes new results on the k-multivariate mutual-information ([Formula: see text]) inspired by the topological formulation of Information introduced in a serie of studies. In particular, we show that the vanishing of all [Formula: see text] for [Formula: see text] of n random variables is equivalent to their statistical independence. Pursuing the work of Hu Kuo Ting and Te Sun Han, we show that information functions provide co-ordinates for binary variables, and that they are analytically independent from the probability simplex for any set of finite variables. The maximal positive [Formula: see text] identifies the variables that co-vary the most in the population, whereas the minimal negative [Formula: see text] identifies synergistic clusters and the variables that differentiate–segregate the most in the population. Finite data size effects and estimation biases severely constrain the effective computation of the information topology on data, and we provide simple statistical tests for the undersampling bias and the k-dependences. We give an example of application of these methods to genetic expression and unsupervised cell-type classification. The methods unravel biologically relevant subtypes, with a sample size of 41 genes and with few errors. It establishes generic basic methods to quantify the epigenetic information storage and a unified epigenetic unsupervised learning formalism. We propose that higher-order statistical interactions and non-identically distributed variables are constitutive characteristics of biological systems that should be estimated in order to unravel their significant statistical structure and diversity. The topological information data analysis presented here allows for precisely estimating this higher-order structure characteristic of biological systems.
format Online
Article
Text
id pubmed-7515398
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-75153982020-11-09 Topological Information Data Analysis Baudot, Pierre Tapia, Monica Bennequin, Daniel Goaillard, Jean-Marc Entropy (Basel) Article This paper presents methods that quantify the structure of statistical interactions within a given data set, and were applied in a previous article. It establishes new results on the k-multivariate mutual-information ([Formula: see text]) inspired by the topological formulation of Information introduced in a serie of studies. In particular, we show that the vanishing of all [Formula: see text] for [Formula: see text] of n random variables is equivalent to their statistical independence. Pursuing the work of Hu Kuo Ting and Te Sun Han, we show that information functions provide co-ordinates for binary variables, and that they are analytically independent from the probability simplex for any set of finite variables. The maximal positive [Formula: see text] identifies the variables that co-vary the most in the population, whereas the minimal negative [Formula: see text] identifies synergistic clusters and the variables that differentiate–segregate the most in the population. Finite data size effects and estimation biases severely constrain the effective computation of the information topology on data, and we provide simple statistical tests for the undersampling bias and the k-dependences. We give an example of application of these methods to genetic expression and unsupervised cell-type classification. The methods unravel biologically relevant subtypes, with a sample size of 41 genes and with few errors. It establishes generic basic methods to quantify the epigenetic information storage and a unified epigenetic unsupervised learning formalism. We propose that higher-order statistical interactions and non-identically distributed variables are constitutive characteristics of biological systems that should be estimated in order to unravel their significant statistical structure and diversity. The topological information data analysis presented here allows for precisely estimating this higher-order structure characteristic of biological systems. MDPI 2019-09-06 /pmc/articles/PMC7515398/ http://dx.doi.org/10.3390/e21090869 Text en © 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Baudot, Pierre
Tapia, Monica
Bennequin, Daniel
Goaillard, Jean-Marc
Topological Information Data Analysis
title Topological Information Data Analysis
title_full Topological Information Data Analysis
title_fullStr Topological Information Data Analysis
title_full_unstemmed Topological Information Data Analysis
title_short Topological Information Data Analysis
title_sort topological information data analysis
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7515398/
http://dx.doi.org/10.3390/e21090869
work_keys_str_mv AT baudotpierre topologicalinformationdataanalysis
AT tapiamonica topologicalinformationdataanalysis
AT bennequindaniel topologicalinformationdataanalysis
AT goaillardjeanmarc topologicalinformationdataanalysis