Cargando…

Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study

BACKGROUND: Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster ana...

Descripción completa

Detalles Bibliográficos
Autores principales: Vidman, Linda, Källberg, David, Rydén, Patrik
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6894875/
https://www.ncbi.nlm.nih.gov/pubmed/31805048
http://dx.doi.org/10.1371/journal.pone.0219102
_version_ 1783476477519462400
author Vidman, Linda
Källberg, David
Rydén, Patrik
author_facet Vidman, Linda
Källberg, David
Rydén, Patrik
author_sort Vidman, Linda
collection PubMed
description BACKGROUND: Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance. RESULTS: In general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males. CONCLUSIONS: The number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data.
format Online
Article
Text
id pubmed-6894875
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-68948752019-12-14 Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study Vidman, Linda Källberg, David Rydén, Patrik PLoS One Research Article BACKGROUND: Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance. RESULTS: In general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males. CONCLUSIONS: The number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data. Public Library of Science 2019-12-05 /pmc/articles/PMC6894875/ /pubmed/31805048 http://dx.doi.org/10.1371/journal.pone.0219102 Text en © 2019 Vidman et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Vidman, Linda
Källberg, David
Rydén, Patrik
Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study
title Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study
title_full Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study
title_fullStr Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study
title_full_unstemmed Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study
title_short Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study
title_sort cluster analysis on high dimensional rna-seq data with applications to cancer research - an evaluation study
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6894875/
https://www.ncbi.nlm.nih.gov/pubmed/31805048
http://dx.doi.org/10.1371/journal.pone.0219102
work_keys_str_mv AT vidmanlinda clusteranalysisonhighdimensionalrnaseqdatawithapplicationstocancerresearchanevaluationstudy
AT kallbergdavid clusteranalysisonhighdimensionalrnaseqdatawithapplicationstocancerresearchanevaluationstudy
AT rydenpatrik clusteranalysisonhighdimensionalrnaseqdatawithapplicationstocancerresearchanevaluationstudy