Cargando…

Strategies for aggregating gene expression data: The collapseRows R function

BACKGROUND: Genomic and other high dimensional analyses often require one to summarize multiple related variables by a single representative. This task is also variously referred to as collapsing, combining, reducing, or aggregating variables. Examples include summarizing several probe measurements...

Descripción completa

Detalles Bibliográficos
Autores principales: Miller, Jeremy A, Cai, Chaochao, Langfelder, Peter, Geschwind, Daniel H, Kurian, Sunil M, Salomon, Daniel R, Horvath, Steve
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3166942/
https://www.ncbi.nlm.nih.gov/pubmed/21816037
http://dx.doi.org/10.1186/1471-2105-12-322
_version_ 1782211211071324160
author Miller, Jeremy A
Cai, Chaochao
Langfelder, Peter
Geschwind, Daniel H
Kurian, Sunil M
Salomon, Daniel R
Horvath, Steve
author_facet Miller, Jeremy A
Cai, Chaochao
Langfelder, Peter
Geschwind, Daniel H
Kurian, Sunil M
Salomon, Daniel R
Horvath, Steve
author_sort Miller, Jeremy A
collection PubMed
description BACKGROUND: Genomic and other high dimensional analyses often require one to summarize multiple related variables by a single representative. This task is also variously referred to as collapsing, combining, reducing, or aggregating variables. Examples include summarizing several probe measurements corresponding to a single gene, representing the expression profiles of a co-expression module by a single expression profile, and aggregating cell-type marker information to de-convolute expression data. Several standard statistical summary techniques can be used, but network methods also provide useful alternative methods to find representatives. Currently few collapsing functions are developed and widely applied. RESULTS: We introduce the R function collapseRows that implements several collapsing methods and evaluate its performance in three applications. First, we study a crucial step of the meta-analysis of microarray data: the merging of independent gene expression data sets, which may have been measured on different platforms. Toward this end, we collapse multiple microarray probes for a single gene and then merge the data by gene identifier. We find that choosing the probe with the highest average expression leads to best between-study consistency. Second, we study methods for summarizing the gene expression profiles of a co-expression module. Several gene co-expression network analysis applications show that the optimal collapsing strategy depends on the analysis goal. Third, we study aggregating the information of cell type marker genes when the aim is to predict the abundance of cell types in a tissue sample based on gene expression data ("expression deconvolution"). We apply different collapsing methods to predict cell type abundances in peripheral human blood and in mixtures of blood cell lines. Interestingly, the most accurate prediction method involves choosing the most highly connected "hub" marker gene. Finally, to facilitate biological interpretation of collapsed gene lists, we introduce the function userListEnrichment, which assesses the enrichment of gene lists for known brain and blood cell type markers, and for other published biological pathways. CONCLUSIONS: The R function collapseRows implements several standard and network-based collapsing methods. In various genomic applications we provide evidence that both types of methods are robust and biologically relevant tools.
format Online
Article
Text
id pubmed-3166942
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-31669422011-09-06 Strategies for aggregating gene expression data: The collapseRows R function Miller, Jeremy A Cai, Chaochao Langfelder, Peter Geschwind, Daniel H Kurian, Sunil M Salomon, Daniel R Horvath, Steve BMC Bioinformatics Methodology Article BACKGROUND: Genomic and other high dimensional analyses often require one to summarize multiple related variables by a single representative. This task is also variously referred to as collapsing, combining, reducing, or aggregating variables. Examples include summarizing several probe measurements corresponding to a single gene, representing the expression profiles of a co-expression module by a single expression profile, and aggregating cell-type marker information to de-convolute expression data. Several standard statistical summary techniques can be used, but network methods also provide useful alternative methods to find representatives. Currently few collapsing functions are developed and widely applied. RESULTS: We introduce the R function collapseRows that implements several collapsing methods and evaluate its performance in three applications. First, we study a crucial step of the meta-analysis of microarray data: the merging of independent gene expression data sets, which may have been measured on different platforms. Toward this end, we collapse multiple microarray probes for a single gene and then merge the data by gene identifier. We find that choosing the probe with the highest average expression leads to best between-study consistency. Second, we study methods for summarizing the gene expression profiles of a co-expression module. Several gene co-expression network analysis applications show that the optimal collapsing strategy depends on the analysis goal. Third, we study aggregating the information of cell type marker genes when the aim is to predict the abundance of cell types in a tissue sample based on gene expression data ("expression deconvolution"). We apply different collapsing methods to predict cell type abundances in peripheral human blood and in mixtures of blood cell lines. Interestingly, the most accurate prediction method involves choosing the most highly connected "hub" marker gene. Finally, to facilitate biological interpretation of collapsed gene lists, we introduce the function userListEnrichment, which assesses the enrichment of gene lists for known brain and blood cell type markers, and for other published biological pathways. CONCLUSIONS: The R function collapseRows implements several standard and network-based collapsing methods. In various genomic applications we provide evidence that both types of methods are robust and biologically relevant tools. BioMed Central 2011-08-04 /pmc/articles/PMC3166942/ /pubmed/21816037 http://dx.doi.org/10.1186/1471-2105-12-322 Text en Copyright ©2011 Miller et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Miller, Jeremy A
Cai, Chaochao
Langfelder, Peter
Geschwind, Daniel H
Kurian, Sunil M
Salomon, Daniel R
Horvath, Steve
Strategies for aggregating gene expression data: The collapseRows R function
title Strategies for aggregating gene expression data: The collapseRows R function
title_full Strategies for aggregating gene expression data: The collapseRows R function
title_fullStr Strategies for aggregating gene expression data: The collapseRows R function
title_full_unstemmed Strategies for aggregating gene expression data: The collapseRows R function
title_short Strategies for aggregating gene expression data: The collapseRows R function
title_sort strategies for aggregating gene expression data: the collapserows r function
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3166942/
https://www.ncbi.nlm.nih.gov/pubmed/21816037
http://dx.doi.org/10.1186/1471-2105-12-322
work_keys_str_mv AT millerjeremya strategiesforaggregatinggeneexpressiondatathecollapserowsrfunction
AT caichaochao strategiesforaggregatinggeneexpressiondatathecollapserowsrfunction
AT langfelderpeter strategiesforaggregatinggeneexpressiondatathecollapserowsrfunction
AT geschwinddanielh strategiesforaggregatinggeneexpressiondatathecollapserowsrfunction
AT kuriansunilm strategiesforaggregatinggeneexpressiondatathecollapserowsrfunction
AT salomondanielr strategiesforaggregatinggeneexpressiondatathecollapserowsrfunction
AT horvathsteve strategiesforaggregatinggeneexpressiondatathecollapserowsrfunction