Cargando…

Integrative analysis of gene expression and copy number alterations using canonical correlation analysis

BACKGROUND: With the rapid development of new genetic measurement methods, several types of genetic alterations can be quantified in a high-throughput manner. While the initial focus has been on investigating each data set separately, there is an increasing interest in studying the correlation struc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Soneson, Charlotte, Lilljebjörn, Henrik, Fioretos, Thoas, Fontes, Magnus
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2873536/ https://www.ncbi.nlm.nih.gov/pubmed/20398334 http://dx.doi.org/10.1186/1471-2105-11-191

_version_	1782181364043350016
author	Soneson, Charlotte Lilljebjörn, Henrik Fioretos, Thoas Fontes, Magnus
author_facet	Soneson, Charlotte Lilljebjörn, Henrik Fioretos, Thoas Fontes, Magnus
author_sort	Soneson, Charlotte
collection	PubMed
description	BACKGROUND: With the rapid development of new genetic measurement methods, several types of genetic alterations can be quantified in a high-throughput manner. While the initial focus has been on investigating each data set separately, there is an increasing interest in studying the correlation structure between two or more data sets. Multivariate methods based on Canonical Correlation Analysis (CCA) have been proposed for integrating paired genetic data sets. The high dimensionality of microarray data imposes computational difficulties, which have been addressed for instance by studying the covariance structure of the data, or by reducing the number of variables prior to applying the CCA. In this work, we propose a new method for analyzing high-dimensional paired genetic data sets, which mainly emphasizes the correlation structure and still permits efficient application to very large data sets. The method is implemented by translating a regularized CCA to its dual form, where the computational complexity depends mainly on the number of samples instead of the number of variables. The optimal regularization parameters are chosen by cross-validation. We apply the regularized dual CCA, as well as a classical CCA preceded by a dimension-reducing Principal Components Analysis (PCA), to a paired data set of gene expression changes and copy number alterations in leukemia. RESULTS: Using the correlation-maximizing methods, regularized dual CCA and PCA+CCA, we show that without pre-selection of known disease-relevant genes, and without using information about clinical class membership, an exploratory analysis singles out two patient groups, corresponding to well-known leukemia subtypes. Furthermore, the variables showing the highest relevance to the extracted features agree with previous biological knowledge concerning copy number alterations and gene expression changes in these subtypes. Finally, the correlation-maximizing methods are shown to yield results which are more biologically interpretable than those resulting from a covariance-maximizing method, and provide different insight compared to when each variable set is studied separately using PCA. CONCLUSIONS: We conclude that regularized dual CCA as well as PCA+CCA are useful methods for exploratory analysis of paired genetic data sets, and can be efficiently implemented also when the number of variables is very large.
format	Text
id	pubmed-2873536
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-28735362010-05-20 Integrative analysis of gene expression and copy number alterations using canonical correlation analysis Soneson, Charlotte Lilljebjörn, Henrik Fioretos, Thoas Fontes, Magnus BMC Bioinformatics Research article BACKGROUND: With the rapid development of new genetic measurement methods, several types of genetic alterations can be quantified in a high-throughput manner. While the initial focus has been on investigating each data set separately, there is an increasing interest in studying the correlation structure between two or more data sets. Multivariate methods based on Canonical Correlation Analysis (CCA) have been proposed for integrating paired genetic data sets. The high dimensionality of microarray data imposes computational difficulties, which have been addressed for instance by studying the covariance structure of the data, or by reducing the number of variables prior to applying the CCA. In this work, we propose a new method for analyzing high-dimensional paired genetic data sets, which mainly emphasizes the correlation structure and still permits efficient application to very large data sets. The method is implemented by translating a regularized CCA to its dual form, where the computational complexity depends mainly on the number of samples instead of the number of variables. The optimal regularization parameters are chosen by cross-validation. We apply the regularized dual CCA, as well as a classical CCA preceded by a dimension-reducing Principal Components Analysis (PCA), to a paired data set of gene expression changes and copy number alterations in leukemia. RESULTS: Using the correlation-maximizing methods, regularized dual CCA and PCA+CCA, we show that without pre-selection of known disease-relevant genes, and without using information about clinical class membership, an exploratory analysis singles out two patient groups, corresponding to well-known leukemia subtypes. Furthermore, the variables showing the highest relevance to the extracted features agree with previous biological knowledge concerning copy number alterations and gene expression changes in these subtypes. Finally, the correlation-maximizing methods are shown to yield results which are more biologically interpretable than those resulting from a covariance-maximizing method, and provide different insight compared to when each variable set is studied separately using PCA. CONCLUSIONS: We conclude that regularized dual CCA as well as PCA+CCA are useful methods for exploratory analysis of paired genetic data sets, and can be efficiently implemented also when the number of variables is very large. BioMed Central 2010-04-15 /pmc/articles/PMC2873536/ /pubmed/20398334 http://dx.doi.org/10.1186/1471-2105-11-191 Text en Copyright ©2010 Soneson et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research article Soneson, Charlotte Lilljebjörn, Henrik Fioretos, Thoas Fontes, Magnus Integrative analysis of gene expression and copy number alterations using canonical correlation analysis
title	Integrative analysis of gene expression and copy number alterations using canonical correlation analysis
title_full	Integrative analysis of gene expression and copy number alterations using canonical correlation analysis
title_fullStr	Integrative analysis of gene expression and copy number alterations using canonical correlation analysis
title_full_unstemmed	Integrative analysis of gene expression and copy number alterations using canonical correlation analysis
title_short	Integrative analysis of gene expression and copy number alterations using canonical correlation analysis
title_sort	integrative analysis of gene expression and copy number alterations using canonical correlation analysis
topic	Research article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2873536/ https://www.ncbi.nlm.nih.gov/pubmed/20398334 http://dx.doi.org/10.1186/1471-2105-11-191
work_keys_str_mv	AT sonesoncharlotte integrativeanalysisofgeneexpressionandcopynumberalterationsusingcanonicalcorrelationanalysis AT lilljebjornhenrik integrativeanalysisofgeneexpressionandcopynumberalterationsusingcanonicalcorrelationanalysis AT fioretosthoas integrativeanalysisofgeneexpressionandcopynumberalterationsusingcanonicalcorrelationanalysis AT fontesmagnus integrativeanalysisofgeneexpressionandcopynumberalterationsusingcanonicalcorrelationanalysis

Integrative analysis of gene expression and copy number alterations using canonical correlation analysis

Ejemplares similares