Cargando…

Leveraging global gene expression patterns to predict expression of unmeasured genes

BACKGROUND: Large collections of paraffin-embedded tissue represent a rich resource to test hypotheses based on gene expression patterns; however, measurement of genome-wide expression is cost-prohibitive on a large scale. Using the known expression correlation structure within a given disease type...

Descripción completa

Detalles Bibliográficos
Autores principales: Rudd, James, Zelaya, René A., Demidenko, Eugene, Goode, Ellen L., Greene, Casey S., Doherty, Jennifer A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4678722/
https://www.ncbi.nlm.nih.gov/pubmed/26666289
http://dx.doi.org/10.1186/s12864-015-2250-5
_version_ 1782405496612847616
author Rudd, James
Zelaya, René A.
Demidenko, Eugene
Goode, Ellen L.
Greene, Casey S.
Doherty, Jennifer A.
author_facet Rudd, James
Zelaya, René A.
Demidenko, Eugene
Goode, Ellen L.
Greene, Casey S.
Doherty, Jennifer A.
author_sort Rudd, James
collection PubMed
description BACKGROUND: Large collections of paraffin-embedded tissue represent a rich resource to test hypotheses based on gene expression patterns; however, measurement of genome-wide expression is cost-prohibitive on a large scale. Using the known expression correlation structure within a given disease type (in this case, high grade serous ovarian cancer; HGSC), we sought to identify reduced sets of directly measured (DM) genes which could accurately predict the expression of a maximized number of unmeasured genes. RESULTS: We developed a greedy gene set selection (GGS) algorithm which returns a DM set of user specified size based on a specific correlation threshold (|r(P)|) and minimum number of DM genes that must be correlated to an unmeasured gene in order to infer the value of the unmeasured gene (redundancy). We evaluated GGS in the Cancer Genome Atlas (TCGA) HGSC data across 144 combinations of DM size, redundancy (1–3), and |r(P)| (0.60, 0.65, 0.70). Across the parameter sweep, GGS allows on average 9 times more gene expression information to be captured compared to the DM set alone. GGS successfully augments prognostic HGSC gene sets; the addition of 20 GGS selected genes more than doubles the number of genes whose expression is predictable. Moreover, the expression prediction is highly accurate. After training regression models for the predictable gene set using 2/3 of the TCGA data, the average accuracy (ranked correlation of true and predicted values) in the 1/3 testing partition and four independent populations is above 0.65 and approaches 0.8 for conservative parameter sets. We observe similar accuracies in the TCGA HGSC RNA-sequencing data. Specifically, the prediction accuracy increases with increasing redundancy and increasing |r(P)|. CONCLUSIONS: GGS-selected genes, which maximize expression information about unmeasured genes, can be combined with candidate gene sets as a cost effective way to increase the amount of gene expression information obtained in large studies. This method can be applied to any organism, model system, disease, or tissue type for which whole genome gene expression data exists. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-015-2250-5) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4678722
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-46787222015-12-16 Leveraging global gene expression patterns to predict expression of unmeasured genes Rudd, James Zelaya, René A. Demidenko, Eugene Goode, Ellen L. Greene, Casey S. Doherty, Jennifer A. BMC Genomics Methodology Article BACKGROUND: Large collections of paraffin-embedded tissue represent a rich resource to test hypotheses based on gene expression patterns; however, measurement of genome-wide expression is cost-prohibitive on a large scale. Using the known expression correlation structure within a given disease type (in this case, high grade serous ovarian cancer; HGSC), we sought to identify reduced sets of directly measured (DM) genes which could accurately predict the expression of a maximized number of unmeasured genes. RESULTS: We developed a greedy gene set selection (GGS) algorithm which returns a DM set of user specified size based on a specific correlation threshold (|r(P)|) and minimum number of DM genes that must be correlated to an unmeasured gene in order to infer the value of the unmeasured gene (redundancy). We evaluated GGS in the Cancer Genome Atlas (TCGA) HGSC data across 144 combinations of DM size, redundancy (1–3), and |r(P)| (0.60, 0.65, 0.70). Across the parameter sweep, GGS allows on average 9 times more gene expression information to be captured compared to the DM set alone. GGS successfully augments prognostic HGSC gene sets; the addition of 20 GGS selected genes more than doubles the number of genes whose expression is predictable. Moreover, the expression prediction is highly accurate. After training regression models for the predictable gene set using 2/3 of the TCGA data, the average accuracy (ranked correlation of true and predicted values) in the 1/3 testing partition and four independent populations is above 0.65 and approaches 0.8 for conservative parameter sets. We observe similar accuracies in the TCGA HGSC RNA-sequencing data. Specifically, the prediction accuracy increases with increasing redundancy and increasing |r(P)|. CONCLUSIONS: GGS-selected genes, which maximize expression information about unmeasured genes, can be combined with candidate gene sets as a cost effective way to increase the amount of gene expression information obtained in large studies. This method can be applied to any organism, model system, disease, or tissue type for which whole genome gene expression data exists. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-015-2250-5) contains supplementary material, which is available to authorized users. BioMed Central 2015-12-15 /pmc/articles/PMC4678722/ /pubmed/26666289 http://dx.doi.org/10.1186/s12864-015-2250-5 Text en © Rudd et al. 2015 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Rudd, James
Zelaya, René A.
Demidenko, Eugene
Goode, Ellen L.
Greene, Casey S.
Doherty, Jennifer A.
Leveraging global gene expression patterns to predict expression of unmeasured genes
title Leveraging global gene expression patterns to predict expression of unmeasured genes
title_full Leveraging global gene expression patterns to predict expression of unmeasured genes
title_fullStr Leveraging global gene expression patterns to predict expression of unmeasured genes
title_full_unstemmed Leveraging global gene expression patterns to predict expression of unmeasured genes
title_short Leveraging global gene expression patterns to predict expression of unmeasured genes
title_sort leveraging global gene expression patterns to predict expression of unmeasured genes
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4678722/
https://www.ncbi.nlm.nih.gov/pubmed/26666289
http://dx.doi.org/10.1186/s12864-015-2250-5
work_keys_str_mv AT ruddjames leveragingglobalgeneexpressionpatternstopredictexpressionofunmeasuredgenes
AT zelayarenea leveragingglobalgeneexpressionpatternstopredictexpressionofunmeasuredgenes
AT demidenkoeugene leveragingglobalgeneexpressionpatternstopredictexpressionofunmeasuredgenes
AT goodeellenl leveragingglobalgeneexpressionpatternstopredictexpressionofunmeasuredgenes
AT greenecaseys leveragingglobalgeneexpressionpatternstopredictexpressionofunmeasuredgenes
AT dohertyjennifera leveragingglobalgeneexpressionpatternstopredictexpressionofunmeasuredgenes