Cargando…
Critical limitations of consensus clustering in class discovery
Consensus clustering (CC) has been adopted for unsupervised class discovery in many genomic studies. It calculates how frequently two samples are grouped together in repeated clustering runs, and uses the resulting pairwise "consensus rates" for visual demonstration that clusters exist, fo...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4145288/ https://www.ncbi.nlm.nih.gov/pubmed/25158761 http://dx.doi.org/10.1038/srep06207 |
_version_ | 1782332145755226112 |
---|---|
author | Șenbabaoğlu, Yasin Michailidis, George Li, Jun Z. |
author_facet | Șenbabaoğlu, Yasin Michailidis, George Li, Jun Z. |
author_sort | Șenbabaoğlu, Yasin |
collection | PubMed |
description | Consensus clustering (CC) has been adopted for unsupervised class discovery in many genomic studies. It calculates how frequently two samples are grouped together in repeated clustering runs, and uses the resulting pairwise "consensus rates" for visual demonstration that clusters exist, for comparing cluster stability, and for estimating the optimal cluster number (K). However, the sensitivity and specificity of CC have not been systemically assessed. Through simulations we find that CC is able to divide randomly generated unimodal data into apparently stable clusters for a range of K, essentially reporting chance partitions of cluster-less data. For data with known structure, the common implementations of CC perform poorly in identifying the true K. These results suggest that CC should be applied and interpreted with caution. We found that a new metric based on CC, the proportion of ambiguously clustered pairs (PAC), infers K equally or more reliably than similar methods in simulated data with known K. Our overall approach involves the use of realistic null distributions based on the observed gene-gene correlation structure in a given study, and the implementation of PAC to more accurately estimate K. We discuss the strength of our approach in the context of other ensemble-based methods. |
format | Online Article Text |
id | pubmed-4145288 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | Nature Publishing Group |
record_format | MEDLINE/PubMed |
spelling | pubmed-41452882014-09-02 Critical limitations of consensus clustering in class discovery Șenbabaoğlu, Yasin Michailidis, George Li, Jun Z. Sci Rep Article Consensus clustering (CC) has been adopted for unsupervised class discovery in many genomic studies. It calculates how frequently two samples are grouped together in repeated clustering runs, and uses the resulting pairwise "consensus rates" for visual demonstration that clusters exist, for comparing cluster stability, and for estimating the optimal cluster number (K). However, the sensitivity and specificity of CC have not been systemically assessed. Through simulations we find that CC is able to divide randomly generated unimodal data into apparently stable clusters for a range of K, essentially reporting chance partitions of cluster-less data. For data with known structure, the common implementations of CC perform poorly in identifying the true K. These results suggest that CC should be applied and interpreted with caution. We found that a new metric based on CC, the proportion of ambiguously clustered pairs (PAC), infers K equally or more reliably than similar methods in simulated data with known K. Our overall approach involves the use of realistic null distributions based on the observed gene-gene correlation structure in a given study, and the implementation of PAC to more accurately estimate K. We discuss the strength of our approach in the context of other ensemble-based methods. Nature Publishing Group 2014-08-27 /pmc/articles/PMC4145288/ /pubmed/25158761 http://dx.doi.org/10.1038/srep06207 Text en Copyright © 2014, Macmillan Publishers Limited. All rights reserved http://creativecommons.org/licenses/by-nc-sa/4.0/ This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder in order to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ |
spellingShingle | Article Șenbabaoğlu, Yasin Michailidis, George Li, Jun Z. Critical limitations of consensus clustering in class discovery |
title | Critical limitations of consensus clustering in class discovery |
title_full | Critical limitations of consensus clustering in class discovery |
title_fullStr | Critical limitations of consensus clustering in class discovery |
title_full_unstemmed | Critical limitations of consensus clustering in class discovery |
title_short | Critical limitations of consensus clustering in class discovery |
title_sort | critical limitations of consensus clustering in class discovery |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4145288/ https://www.ncbi.nlm.nih.gov/pubmed/25158761 http://dx.doi.org/10.1038/srep06207 |
work_keys_str_mv | AT senbabaogluyasin criticallimitationsofconsensusclusteringinclassdiscovery AT michailidisgeorge criticallimitationsofconsensusclusteringinclassdiscovery AT lijunz criticallimitationsofconsensusclusteringinclassdiscovery |