Cargando…
A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC
Recent technological advances have made the gathering of comprehensive gene expression datasets a commodity. This has shifted the limiting step of transcriptomic studies from the accumulation of data to their analyses and interpretation. The main problem in analyzing transcriptomics data is that the...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8523066/ https://www.ncbi.nlm.nih.gov/pubmed/34613979 http://dx.doi.org/10.1371/journal.pcbi.1009459 |
_version_ | 1784585216975699968 |
---|---|
author | Bennett, Jason Pomaznoy, Mikhail Singhania, Akul Peters, Bjoern |
author_facet | Bennett, Jason Pomaznoy, Mikhail Singhania, Akul Peters, Bjoern |
author_sort | Bennett, Jason |
collection | PubMed |
description | Recent technological advances have made the gathering of comprehensive gene expression datasets a commodity. This has shifted the limiting step of transcriptomic studies from the accumulation of data to their analyses and interpretation. The main problem in analyzing transcriptomics data is that the number of independent samples is typically much lower (<100) than the number of genes whose expression is quantified (typically >14,000). To address this, it would be desirable to reduce the gathered data’s dimensionality without losing information. Clustering genes into discrete modules is one of the most commonly used tools to accomplish this task. While there are multiple clustering approaches, there is a lack of informative metrics available to evaluate the resultant clusters’ biological quality. Here we present a metric that incorporates known ground truth gene sets to quantify gene clusters’ biological quality derived from standard clustering techniques. The GECO (Ground truth Evaluation of Clustering Outcomes) metric demonstrates that quantitative and repeatable scoring of gene clusters is not only possible but computationally lightweight and robust. Unlike current methods, it allows direct comparison between gene clusters generated by different clustering techniques. It also reveals that current cluster analysis techniques often underestimate the number of clusters that should be formed from a dataset, which leads to fewer clusters of lower quality. As a test case, we applied GECO combined with k-means clustering to derive an optimal set of co-expressed gene modules derived from PBMC, which we show to be superior to previously generated modules generated on whole-blood. Overall, GECO provides a rational metric to test and compare different clustering approaches to analyze high-dimensional transcriptomic data. |
format | Online Article Text |
id | pubmed-8523066 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-85230662021-10-19 A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC Bennett, Jason Pomaznoy, Mikhail Singhania, Akul Peters, Bjoern PLoS Comput Biol Research Article Recent technological advances have made the gathering of comprehensive gene expression datasets a commodity. This has shifted the limiting step of transcriptomic studies from the accumulation of data to their analyses and interpretation. The main problem in analyzing transcriptomics data is that the number of independent samples is typically much lower (<100) than the number of genes whose expression is quantified (typically >14,000). To address this, it would be desirable to reduce the gathered data’s dimensionality without losing information. Clustering genes into discrete modules is one of the most commonly used tools to accomplish this task. While there are multiple clustering approaches, there is a lack of informative metrics available to evaluate the resultant clusters’ biological quality. Here we present a metric that incorporates known ground truth gene sets to quantify gene clusters’ biological quality derived from standard clustering techniques. The GECO (Ground truth Evaluation of Clustering Outcomes) metric demonstrates that quantitative and repeatable scoring of gene clusters is not only possible but computationally lightweight and robust. Unlike current methods, it allows direct comparison between gene clusters generated by different clustering techniques. It also reveals that current cluster analysis techniques often underestimate the number of clusters that should be formed from a dataset, which leads to fewer clusters of lower quality. As a test case, we applied GECO combined with k-means clustering to derive an optimal set of co-expressed gene modules derived from PBMC, which we show to be superior to previously generated modules generated on whole-blood. Overall, GECO provides a rational metric to test and compare different clustering approaches to analyze high-dimensional transcriptomic data. Public Library of Science 2021-10-06 /pmc/articles/PMC8523066/ /pubmed/34613979 http://dx.doi.org/10.1371/journal.pcbi.1009459 Text en © 2021 Bennett et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Bennett, Jason Pomaznoy, Mikhail Singhania, Akul Peters, Bjoern A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC |
title | A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC |
title_full | A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC |
title_fullStr | A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC |
title_full_unstemmed | A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC |
title_short | A metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in PBMC |
title_sort | metric for evaluating biological information in gene sets and its application to identify co-expressed gene clusters in pbmc |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8523066/ https://www.ncbi.nlm.nih.gov/pubmed/34613979 http://dx.doi.org/10.1371/journal.pcbi.1009459 |
work_keys_str_mv | AT bennettjason ametricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc AT pomaznoymikhail ametricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc AT singhaniaakul ametricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc AT petersbjoern ametricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc AT bennettjason metricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc AT pomaznoymikhail metricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc AT singhaniaakul metricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc AT petersbjoern metricforevaluatingbiologicalinformationingenesetsanditsapplicationtoidentifycoexpressedgeneclustersinpbmc |