Cargando…

Subtype identification from heterogeneous TCGA datasets on a genomic scale by multi-view clustering with enhanced consensus

BACKGROUND: The Cancer Genome Atlas (TCGA) has collected transcriptome, genome and epigenome information for over 20 cancers from thousands of patients. The availability of these diverse data types makes it necessary to combine these data to capture the heterogeneity of biological processes and phen...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cai, Menglan, Li, Limin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5763310/ https://www.ncbi.nlm.nih.gov/pubmed/29322925 http://dx.doi.org/10.1186/s12920-017-0306-x

_version_	1783291859331710976
author	Cai, Menglan Li, Limin
author_facet	Cai, Menglan Li, Limin
author_sort	Cai, Menglan
collection	PubMed
description	BACKGROUND: The Cancer Genome Atlas (TCGA) has collected transcriptome, genome and epigenome information for over 20 cancers from thousands of patients. The availability of these diverse data types makes it necessary to combine these data to capture the heterogeneity of biological processes and phenotypes and further identify homogeneous subtypes for cancers such as breast cancer. Many multi-view clustering approaches are proposed to discover clusters across different data types. The problem is challenging when different data types show poor agreement of clustering structure. RESULTS: In this work, we first propose a multi-view clustering approach with consensus (CMC), which tries to find consensus kernels among views by using Hilbert Schmidt Independence Criterion. To tackle the problem when poor agreement among views exists, we further propose a multi-view clustering approach with enhanced consensus (ECMC) to solve this problem by decomposing the kernel information in each view into a consensus part and a disagreement part. The consensus parts for different views are supposed to be similar, and the disagreement parts should be independent with the consensus parts. Both the CMC and ECMC models can be solved by alternative updating with semi-definite programming. Our experiments on both simulation datasets and real-world benchmark datasets show that ECMC model could achieve higher clustering accuracies than other state-of-art multi-view clustering approaches. We also apply the ECMC model to integrate mRNA expression, DNA methylation and microRNA (miRNA) expression data for five cancer data sets, and the survival analysis show that our ECMC model outperforms other methods when identifying cancer subtypes. By Fisher’s combination test method, we found that three computed subtypes roughly correspond to three known breast cancer subtypes including luminal B, HER2 and basal-like subtypes. CONCLUSION: Integrating heterogeneous TCGA datasets by our proposed multi-view clustering approach ECMC could effectively identify cancer subtypes.
format	Online Article Text
id	pubmed-5763310
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-57633102018-01-17 Subtype identification from heterogeneous TCGA datasets on a genomic scale by multi-view clustering with enhanced consensus Cai, Menglan Li, Limin BMC Med Genomics Research BACKGROUND: The Cancer Genome Atlas (TCGA) has collected transcriptome, genome and epigenome information for over 20 cancers from thousands of patients. The availability of these diverse data types makes it necessary to combine these data to capture the heterogeneity of biological processes and phenotypes and further identify homogeneous subtypes for cancers such as breast cancer. Many multi-view clustering approaches are proposed to discover clusters across different data types. The problem is challenging when different data types show poor agreement of clustering structure. RESULTS: In this work, we first propose a multi-view clustering approach with consensus (CMC), which tries to find consensus kernels among views by using Hilbert Schmidt Independence Criterion. To tackle the problem when poor agreement among views exists, we further propose a multi-view clustering approach with enhanced consensus (ECMC) to solve this problem by decomposing the kernel information in each view into a consensus part and a disagreement part. The consensus parts for different views are supposed to be similar, and the disagreement parts should be independent with the consensus parts. Both the CMC and ECMC models can be solved by alternative updating with semi-definite programming. Our experiments on both simulation datasets and real-world benchmark datasets show that ECMC model could achieve higher clustering accuracies than other state-of-art multi-view clustering approaches. We also apply the ECMC model to integrate mRNA expression, DNA methylation and microRNA (miRNA) expression data for five cancer data sets, and the survival analysis show that our ECMC model outperforms other methods when identifying cancer subtypes. By Fisher’s combination test method, we found that three computed subtypes roughly correspond to three known breast cancer subtypes including luminal B, HER2 and basal-like subtypes. CONCLUSION: Integrating heterogeneous TCGA datasets by our proposed multi-view clustering approach ECMC could effectively identify cancer subtypes. BioMed Central 2017-12-21 /pmc/articles/PMC5763310/ /pubmed/29322925 http://dx.doi.org/10.1186/s12920-017-0306-x Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Cai, Menglan Li, Limin Subtype identification from heterogeneous TCGA datasets on a genomic scale by multi-view clustering with enhanced consensus
title	Subtype identification from heterogeneous TCGA datasets on a genomic scale by multi-view clustering with enhanced consensus
title_full	Subtype identification from heterogeneous TCGA datasets on a genomic scale by multi-view clustering with enhanced consensus
title_fullStr	Subtype identification from heterogeneous TCGA datasets on a genomic scale by multi-view clustering with enhanced consensus
title_full_unstemmed	Subtype identification from heterogeneous TCGA datasets on a genomic scale by multi-view clustering with enhanced consensus
title_short	Subtype identification from heterogeneous TCGA datasets on a genomic scale by multi-view clustering with enhanced consensus
title_sort	subtype identification from heterogeneous tcga datasets on a genomic scale by multi-view clustering with enhanced consensus
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5763310/ https://www.ncbi.nlm.nih.gov/pubmed/29322925 http://dx.doi.org/10.1186/s12920-017-0306-x
work_keys_str_mv	AT caimenglan subtypeidentificationfromheterogeneoustcgadatasetsonagenomicscalebymultiviewclusteringwithenhancedconsensus AT lilimin subtypeidentificationfromheterogeneoustcgadatasetsonagenomicscalebymultiviewclusteringwithenhancedconsensus

Subtype identification from heterogeneous TCGA datasets on a genomic scale by multi-view clustering with enhanced consensus

Ejemplares similares