Cargando…

Group sparse canonical correlation analysis for genomic data integration

BACKGROUND: The emergence of high-throughput genomic datasets from different sources and platforms (e.g., gene expression, single nucleotide polymorphisms (SNP), and copy number variation (CNV)) has greatly enhanced our understandings of the interplay of these genomic factors as well as their influe...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lin, Dongdong, Zhang, Jigang, Li, Jingyao, Calhoun, Vince D, Deng, Hong-Wen, Wang, Yu-Ping
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3751310/ https://www.ncbi.nlm.nih.gov/pubmed/23937249 http://dx.doi.org/10.1186/1471-2105-14-245

_version_	1782281574068256768
author	Lin, Dongdong Zhang, Jigang Li, Jingyao Calhoun, Vince D Deng, Hong-Wen Wang, Yu-Ping
author_facet	Lin, Dongdong Zhang, Jigang Li, Jingyao Calhoun, Vince D Deng, Hong-Wen Wang, Yu-Ping
author_sort	Lin, Dongdong
collection	PubMed
description	BACKGROUND: The emergence of high-throughput genomic datasets from different sources and platforms (e.g., gene expression, single nucleotide polymorphisms (SNP), and copy number variation (CNV)) has greatly enhanced our understandings of the interplay of these genomic factors as well as their influences on the complex diseases. It is challenging to explore the relationship between these different types of genomic data sets. In this paper, we focus on a multivariate statistical method, canonical correlation analysis (CCA) method for this problem. Conventional CCA method does not work effectively if the number of data samples is significantly less than that of biomarkers, which is a typical case for genomic data (e.g., SNPs). Sparse CCA (sCCA) methods were introduced to overcome such difficulty, mostly using penalizations with l-1 norm (CCA-l1) or the combination of l-1and l-2 norm (CCA-elastic net). However, they overlook the structural or group effect within genomic data in the analysis, which often exist and are important (e.g., SNPs spanning a gene interact and work together as a group). RESULTS: We propose a new group sparse CCA method (CCA-sparse group) along with an effective numerical algorithm to study the mutual relationship between two different types of genomic data (i.e., SNP and gene expression). We then extend the model to a more general formulation that can include the existing sCCA models. We apply the model to feature/variable selection from two data sets and compare our group sparse CCA method with existing sCCA methods on both simulation and two real datasets (human gliomas data and NCI60 data). We use a graphical representation of the samples with a pair of canonical variates to demonstrate the discriminating characteristic of the selected features. Pathway analysis is further performed for biological interpretation of those features. CONCLUSIONS: The CCA-sparse group method incorporates group effects of features into the correlation analysis while performs individual feature selection simultaneously. It outperforms the two sCCA methods (CCA-l1 and CCA-group) by identifying the correlated features with more true positives while controlling total discordance at a lower level on the simulated data, even if the group effect does not exist or there are irrelevant features grouped with true correlated features. Compared with our proposed CCA-group sparse models, CCA-l1 tends to select less true correlated features while CCA-group inclines to select more redundant features.
format	Online Article Text
id	pubmed-3751310
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-37513102013-08-28 Group sparse canonical correlation analysis for genomic data integration Lin, Dongdong Zhang, Jigang Li, Jingyao Calhoun, Vince D Deng, Hong-Wen Wang, Yu-Ping BMC Bioinformatics Methodology Article BACKGROUND: The emergence of high-throughput genomic datasets from different sources and platforms (e.g., gene expression, single nucleotide polymorphisms (SNP), and copy number variation (CNV)) has greatly enhanced our understandings of the interplay of these genomic factors as well as their influences on the complex diseases. It is challenging to explore the relationship between these different types of genomic data sets. In this paper, we focus on a multivariate statistical method, canonical correlation analysis (CCA) method for this problem. Conventional CCA method does not work effectively if the number of data samples is significantly less than that of biomarkers, which is a typical case for genomic data (e.g., SNPs). Sparse CCA (sCCA) methods were introduced to overcome such difficulty, mostly using penalizations with l-1 norm (CCA-l1) or the combination of l-1and l-2 norm (CCA-elastic net). However, they overlook the structural or group effect within genomic data in the analysis, which often exist and are important (e.g., SNPs spanning a gene interact and work together as a group). RESULTS: We propose a new group sparse CCA method (CCA-sparse group) along with an effective numerical algorithm to study the mutual relationship between two different types of genomic data (i.e., SNP and gene expression). We then extend the model to a more general formulation that can include the existing sCCA models. We apply the model to feature/variable selection from two data sets and compare our group sparse CCA method with existing sCCA methods on both simulation and two real datasets (human gliomas data and NCI60 data). We use a graphical representation of the samples with a pair of canonical variates to demonstrate the discriminating characteristic of the selected features. Pathway analysis is further performed for biological interpretation of those features. CONCLUSIONS: The CCA-sparse group method incorporates group effects of features into the correlation analysis while performs individual feature selection simultaneously. It outperforms the two sCCA methods (CCA-l1 and CCA-group) by identifying the correlated features with more true positives while controlling total discordance at a lower level on the simulated data, even if the group effect does not exist or there are irrelevant features grouped with true correlated features. Compared with our proposed CCA-group sparse models, CCA-l1 tends to select less true correlated features while CCA-group inclines to select more redundant features. BioMed Central 2013-08-12 /pmc/articles/PMC3751310/ /pubmed/23937249 http://dx.doi.org/10.1186/1471-2105-14-245 Text en Copyright © 2013 Lin et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Lin, Dongdong Zhang, Jigang Li, Jingyao Calhoun, Vince D Deng, Hong-Wen Wang, Yu-Ping Group sparse canonical correlation analysis for genomic data integration
title	Group sparse canonical correlation analysis for genomic data integration
title_full	Group sparse canonical correlation analysis for genomic data integration
title_fullStr	Group sparse canonical correlation analysis for genomic data integration
title_full_unstemmed	Group sparse canonical correlation analysis for genomic data integration
title_short	Group sparse canonical correlation analysis for genomic data integration
title_sort	group sparse canonical correlation analysis for genomic data integration
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3751310/ https://www.ncbi.nlm.nih.gov/pubmed/23937249 http://dx.doi.org/10.1186/1471-2105-14-245
work_keys_str_mv	AT lindongdong groupsparsecanonicalcorrelationanalysisforgenomicdataintegration AT zhangjigang groupsparsecanonicalcorrelationanalysisforgenomicdataintegration AT lijingyao groupsparsecanonicalcorrelationanalysisforgenomicdataintegration AT calhounvinced groupsparsecanonicalcorrelationanalysisforgenomicdataintegration AT denghongwen groupsparsecanonicalcorrelationanalysisforgenomicdataintegration AT wangyuping groupsparsecanonicalcorrelationanalysisforgenomicdataintegration

Group sparse canonical correlation analysis for genomic data integration

Ejemplares similares