Cargando…

On the selection of appropriate distances for gene expression data clustering

BACKGROUND: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves i...

Descripción completa

Detalles Bibliográficos
Autores principales: Jaskowiak, Pablo A, Campello, Ricardo JGB, Costa, Ivan G
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4072854/
https://www.ncbi.nlm.nih.gov/pubmed/24564555
http://dx.doi.org/10.1186/1471-2105-15-S2-S2
_version_ 1782323028660584448
author Jaskowiak, Pablo A
Campello, Ricardo JGB
Costa, Ivan G
author_facet Jaskowiak, Pablo A
Campello, Ricardo JGB
Costa, Ivan G
author_sort Jaskowiak, Pablo A
collection PubMed
description BACKGROUND: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions. RESULTS AND CONCLUSIONS: We analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method.
format Online
Article
Text
id pubmed-4072854
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40728542014-07-01 On the selection of appropriate distances for gene expression data clustering Jaskowiak, Pablo A Campello, Ricardo JGB Costa, Ivan G BMC Bioinformatics Proceedings BACKGROUND: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions. RESULTS AND CONCLUSIONS: We analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method. BioMed Central 2014-01-24 /pmc/articles/PMC4072854/ /pubmed/24564555 http://dx.doi.org/10.1186/1471-2105-15-S2-S2 Text en Copyright © 2014 Jaskowiak et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Proceedings
Jaskowiak, Pablo A
Campello, Ricardo JGB
Costa, Ivan G
On the selection of appropriate distances for gene expression data clustering
title On the selection of appropriate distances for gene expression data clustering
title_full On the selection of appropriate distances for gene expression data clustering
title_fullStr On the selection of appropriate distances for gene expression data clustering
title_full_unstemmed On the selection of appropriate distances for gene expression data clustering
title_short On the selection of appropriate distances for gene expression data clustering
title_sort on the selection of appropriate distances for gene expression data clustering
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4072854/
https://www.ncbi.nlm.nih.gov/pubmed/24564555
http://dx.doi.org/10.1186/1471-2105-15-S2-S2
work_keys_str_mv AT jaskowiakpabloa ontheselectionofappropriatedistancesforgeneexpressiondataclustering
AT campelloricardojgb ontheselectionofappropriatedistancesforgeneexpressiondataclustering
AT costaivang ontheselectionofappropriatedistancesforgeneexpressiondataclustering