Cargando…

A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC

Recent technological advances have made the gathering of comprehensive gene expression datasets a commodity. This has shifted the limiting step of transcriptomic studies from the accumulation of data to their analyses and interpretation. The main problem in analyzing transcriptomics data is that the...

Descripción completa

Detalles Bibliográficos
Autores principales: Bennett, Jason, Pomaznoy, Mikhail, Singhania, Akul, Peters, Bjoern
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8523066/
https://www.ncbi.nlm.nih.gov/pubmed/34613979
http://dx.doi.org/10.1371/journal.pcbi.1009459
_version_ 1784585216975699968
author Bennett, Jason
Pomaznoy, Mikhail
Singhania, Akul
Peters, Bjoern
author_facet Bennett, Jason
Pomaznoy, Mikhail
Singhania, Akul
Peters, Bjoern
author_sort Bennett, Jason
collection PubMed
description Recent technological advances have made the gathering of comprehensive gene expression datasets a commodity. This has shifted the limiting step of transcriptomic studies from the accumulation of data to their analyses and interpretation. The main problem in analyzing transcriptomics data is that the number of independent samples is typically much lower (<100) than the number of genes whose expression is quantified (typically >14,000). To address this, it would be desirable to reduce the gathered data’s dimensionality without losing information. Clustering genes into discrete modules is one of the most commonly used tools to accomplish this task. While there are multiple clustering approaches, there is a lack of informative metrics available to evaluate the resultant clusters’ biological quality. Here we present a metric that incorporates known ground truth gene sets to quantify gene clusters’ biological quality derived from standard clustering techniques. The GECO (Ground truth Evaluation of Clustering Outcomes) metric demonstrates that quantitative and repeatable scoring of gene clusters is not only possible but computationally lightweight and robust. Unlike current methods, it allows direct comparison between gene clusters generated by different clustering techniques. It also reveals that current cluster analysis techniques often underestimate the number of clusters that should be formed from a dataset, which leads to fewer clusters of lower quality. As a test case, we applied GECO combined with k-means clustering to derive an optimal set of co-expressed gene modules derived from PBMC, which we show to be superior to previously generated modules generated on whole-blood. Overall, GECO provides a rational metric to test and compare different clustering approaches to analyze high-dimensional transcriptomic data.
format Online
Article
Text
id pubmed-8523066
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-85230662021-10-19 A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC Bennett, Jason Pomaznoy, Mikhail Singhania, Akul Peters, Bjoern PLoS Comput Biol Research Article Recent technological advances have made the gathering of comprehensive gene expression datasets a commodity. This has shifted the limiting step of transcriptomic studies from the accumulation of data to their analyses and interpretation. The main problem in analyzing transcriptomics data is that the number of independent samples is typically much lower (<100) than the number of genes whose expression is quantified (typically >14,000). To address this, it would be desirable to reduce the gathered data’s dimensionality without losing information. Clustering genes into discrete modules is one of the most commonly used tools to accomplish this task. While there are multiple clustering approaches, there is a lack of informative metrics available to evaluate the resultant clusters’ biological quality. Here we present a metric that incorporates known ground truth gene sets to quantify gene clusters’ biological quality derived from standard clustering techniques. The GECO (Ground truth Evaluation of Clustering Outcomes) metric demonstrates that quantitative and repeatable scoring of gene clusters is not only possible but computationally lightweight and robust. Unlike current methods, it allows direct comparison between gene clusters generated by different clustering techniques. It also reveals that current cluster analysis techniques often underestimate the number of clusters that should be formed from a dataset, which leads to fewer clusters of lower quality. As a test case, we applied GECO combined with k-means clustering to derive an optimal set of co-expressed gene modules derived from PBMC, which we show to be superior to previously generated modules generated on whole-blood. Overall, GECO provides a rational metric to test and compare different clustering approaches to analyze high-dimensional transcriptomic data. Public Library of Science 2021-10-06 /pmc/articles/PMC8523066/ /pubmed/34613979 http://dx.doi.org/10.1371/journal.pcbi.1009459 Text en © 2021 Bennett et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Bennett, Jason
Pomaznoy, Mikhail
Singhania, Akul
Peters, Bjoern
A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC
title A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC
title_full A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC
title_fullStr A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC
title_full_unstemmed A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC
title_short A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC
title_sort metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in pbmc
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8523066/
https://www.ncbi.nlm.nih.gov/pubmed/34613979
http://dx.doi.org/10.1371/journal.pcbi.1009459
work_keys_str_mv AT bennettjason ametricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
AT pomaznoymikhail ametricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
AT singhaniaakul ametricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
AT petersbjoern ametricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
AT bennettjason metricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
AT pomaznoymikhail metricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
AT singhaniaakul metricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc
AT petersbjoern metricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc