Cargando…
Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study
BACKGROUND: Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster ana...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6894875/ https://www.ncbi.nlm.nih.gov/pubmed/31805048 http://dx.doi.org/10.1371/journal.pone.0219102 |
_version_ | 1783476477519462400 |
---|---|
author | Vidman, Linda Källberg, David Rydén, Patrik |
author_facet | Vidman, Linda Källberg, David Rydén, Patrik |
author_sort | Vidman, Linda |
collection | PubMed |
description | BACKGROUND: Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance. RESULTS: In general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males. CONCLUSIONS: The number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data. |
format | Online Article Text |
id | pubmed-6894875 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-68948752019-12-14 Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study Vidman, Linda Källberg, David Rydén, Patrik PLoS One Research Article BACKGROUND: Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance. RESULTS: In general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males. CONCLUSIONS: The number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data. Public Library of Science 2019-12-05 /pmc/articles/PMC6894875/ /pubmed/31805048 http://dx.doi.org/10.1371/journal.pone.0219102 Text en © 2019 Vidman et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Vidman, Linda Källberg, David Rydén, Patrik Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study |
title | Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study |
title_full | Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study |
title_fullStr | Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study |
title_full_unstemmed | Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study |
title_short | Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study |
title_sort | cluster analysis on high dimensional rna-seq data with applications to cancer research - an evaluation study |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6894875/ https://www.ncbi.nlm.nih.gov/pubmed/31805048 http://dx.doi.org/10.1371/journal.pone.0219102 |
work_keys_str_mv | AT vidmanlinda clusteranalysisonhighdimensionalrnaseqdatawithapplicationstocancerresearchanevaluationstudy AT kallbergdavid clusteranalysisonhighdimensionalrnaseqdatawithapplicationstocancerresearchanevaluationstudy AT rydenpatrik clusteranalysisonhighdimensionalrnaseqdatawithapplicationstocancerresearchanevaluationstudy |