Cargando…

Redundancy-aware unsupervised ranking based on game theory: Ranking pathways in collections of gene sets

In Genetics, gene sets are grouped in collections concerning their biological function. This often leads to high-dimensional, overlapping, and redundant families of sets, thus precluding a straightforward interpretation of their biological meaning. In Data Mining, it is often argued that techniques...

Descripción completa

Detalles Bibliográficos
Autores principales:	Balestra, Chiara, Maj, Carlo, Müller, Emmanuel, Mayr, Andreas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9997904/ https://www.ncbi.nlm.nih.gov/pubmed/36893181 http://dx.doi.org/10.1371/journal.pone.0282699

_version_	1784903355918712832
author	Balestra, Chiara Maj, Carlo Müller, Emmanuel Mayr, Andreas
author_facet	Balestra, Chiara Maj, Carlo Müller, Emmanuel Mayr, Andreas
author_sort	Balestra, Chiara
collection	PubMed
description	In Genetics, gene sets are grouped in collections concerning their biological function. This often leads to high-dimensional, overlapping, and redundant families of sets, thus precluding a straightforward interpretation of their biological meaning. In Data Mining, it is often argued that techniques to reduce the dimensionality of data could increase the maneuverability and consequently the interpretability of large data. In the past years, moreover, we witnessed an increasing consciousness of the importance of understanding data and interpretable models in the machine learning and bioinformatics communities. On the one hand, there exist techniques aiming to aggregate overlapping gene sets to create larger pathways. While these methods could partly solve the large size of the collections’ problem, modifying biological pathways is hardly justifiable in this biological context. On the other hand, the representation methods to increase interpretability of collections of gene sets that have been proposed so far have proved to be insufficient. Inspired by this Bioinformatics context, we propose a method to rank sets within a family of sets based on the distribution of the singletons and their size. We obtain sets’ importance scores by computing Shapley values; Making use of microarray games, we do not incur the typical exponential computational complexity. Moreover, we address the challenge of constructing redundancy-aware rankings where, in our case, redundancy is a quantity proportional to the size of intersections among the sets in the collections. We use the obtained rankings to reduce the dimension of the families, therefore showing lower redundancy among sets while still preserving a high coverage of their elements. We finally evaluate our approach for collections of gene sets and apply Gene Sets Enrichment Analysis techniques to the now smaller collections: As expected, the unsupervised nature of the proposed rankings allows for unremarkable differences in the number of significant gene sets for specific phenotypic traits. In contrast, the number of performed statistical tests can be drastically reduced. The proposed rankings show a practical utility in bioinformatics to increase interpretability of the collections of gene sets and a step forward to include redundancy-awareness into Shapley values computations.
format	Online Article Text
id	pubmed-9997904
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-99979042023-03-10 Redundancy-aware unsupervised ranking based on game theory: Ranking pathways in collections of gene sets Balestra, Chiara Maj, Carlo Müller, Emmanuel Mayr, Andreas PLoS One Research Article In Genetics, gene sets are grouped in collections concerning their biological function. This often leads to high-dimensional, overlapping, and redundant families of sets, thus precluding a straightforward interpretation of their biological meaning. In Data Mining, it is often argued that techniques to reduce the dimensionality of data could increase the maneuverability and consequently the interpretability of large data. In the past years, moreover, we witnessed an increasing consciousness of the importance of understanding data and interpretable models in the machine learning and bioinformatics communities. On the one hand, there exist techniques aiming to aggregate overlapping gene sets to create larger pathways. While these methods could partly solve the large size of the collections’ problem, modifying biological pathways is hardly justifiable in this biological context. On the other hand, the representation methods to increase interpretability of collections of gene sets that have been proposed so far have proved to be insufficient. Inspired by this Bioinformatics context, we propose a method to rank sets within a family of sets based on the distribution of the singletons and their size. We obtain sets’ importance scores by computing Shapley values; Making use of microarray games, we do not incur the typical exponential computational complexity. Moreover, we address the challenge of constructing redundancy-aware rankings where, in our case, redundancy is a quantity proportional to the size of intersections among the sets in the collections. We use the obtained rankings to reduce the dimension of the families, therefore showing lower redundancy among sets while still preserving a high coverage of their elements. We finally evaluate our approach for collections of gene sets and apply Gene Sets Enrichment Analysis techniques to the now smaller collections: As expected, the unsupervised nature of the proposed rankings allows for unremarkable differences in the number of significant gene sets for specific phenotypic traits. In contrast, the number of performed statistical tests can be drastically reduced. The proposed rankings show a practical utility in bioinformatics to increase interpretability of the collections of gene sets and a step forward to include redundancy-awareness into Shapley values computations. Public Library of Science 2023-03-09 /pmc/articles/PMC9997904/ /pubmed/36893181 http://dx.doi.org/10.1371/journal.pone.0282699 Text en © 2023 Balestra et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Balestra, Chiara Maj, Carlo Müller, Emmanuel Mayr, Andreas Redundancy-aware unsupervised ranking based on game theory: Ranking pathways in collections of gene sets
title	Redundancy-aware unsupervised ranking based on game theory: Ranking pathways in collections of gene sets
title_full	Redundancy-aware unsupervised ranking based on game theory: Ranking pathways in collections of gene sets
title_fullStr	Redundancy-aware unsupervised ranking based on game theory: Ranking pathways in collections of gene sets
title_full_unstemmed	Redundancy-aware unsupervised ranking based on game theory: Ranking pathways in collections of gene sets
title_short	Redundancy-aware unsupervised ranking based on game theory: Ranking pathways in collections of gene sets
title_sort	redundancy-aware unsupervised ranking based on game theory: ranking pathways in collections of gene sets
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9997904/ https://www.ncbi.nlm.nih.gov/pubmed/36893181 http://dx.doi.org/10.1371/journal.pone.0282699
work_keys_str_mv	AT balestrachiara redundancyawareunsupervisedrankingbasedongametheoryrankingpathwaysincollectionsofgenesets AT majcarlo redundancyawareunsupervisedrankingbasedongametheoryrankingpathwaysincollectionsofgenesets AT mulleremmanuel redundancyawareunsupervisedrankingbasedongametheoryrankingpathwaysincollectionsofgenesets AT mayrandreas redundancyawareunsupervisedrankingbasedongametheoryrankingpathwaysincollectionsofgenesets

Redundancy-aware unsupervised ranking based on game theory: Ranking pathways in collections of gene sets

Ejemplares similares