Cargando…

Clustering cancer gene expression data: a comparative study

BACKGROUND: The use of clustering methods for the discovery of cancer subtypes has drawn a great deal of attention in the scientific community. While bioinformaticians have proposed new clustering methods that take advantage of characteristics of the gene expression data, the medical community has a...

Descripción completa

Detalles Bibliográficos
Autores principales: de Souto, Marcilio CP, Costa, Ivan G, de Araujo, Daniel SA, Ludermir, Teresa B, Schliep, Alexander
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2632677/
https://www.ncbi.nlm.nih.gov/pubmed/19038021
http://dx.doi.org/10.1186/1471-2105-9-497
_version_ 1782164039639498752
author de Souto, Marcilio CP
Costa, Ivan G
de Araujo, Daniel SA
Ludermir, Teresa B
Schliep, Alexander
author_facet de Souto, Marcilio CP
Costa, Ivan G
de Araujo, Daniel SA
Ludermir, Teresa B
Schliep, Alexander
author_sort de Souto, Marcilio CP
collection PubMed
description BACKGROUND: The use of clustering methods for the discovery of cancer subtypes has drawn a great deal of attention in the scientific community. While bioinformaticians have proposed new clustering methods that take advantage of characteristics of the gene expression data, the medical community has a preference for using "classic" clustering methods. There have been no studies thus far performing a large-scale evaluation of different clustering methods in this context. RESULTS/CONCLUSION: We present the first large-scale analysis of seven different clustering methods and four proximity measures for the analysis of 35 cancer gene expression data sets. Our results reveal that the finite mixture of Gaussians, followed closely by k-means, exhibited the best performance in terms of recovering the true structure of the data sets. These methods also exhibited, on average, the smallest difference between the actual number of classes in the data sets and the best number of clusters as indicated by our validation criteria. Furthermore, hierarchical methods, which have been widely used by the medical community, exhibited a poorer recovery performance than that of the other methods evaluated. Moreover, as a stable basis for the assessment and comparison of different clustering methods for cancer gene expression data, this study provides a common group of data sets (benchmark data sets) to be shared among researchers and used for comparisons with new methods. The data sets analyzed in this study are available at .
format Text
id pubmed-2632677
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-26326772009-01-30 Clustering cancer gene expression data: a comparative study de Souto, Marcilio CP Costa, Ivan G de Araujo, Daniel SA Ludermir, Teresa B Schliep, Alexander BMC Bioinformatics Research Article BACKGROUND: The use of clustering methods for the discovery of cancer subtypes has drawn a great deal of attention in the scientific community. While bioinformaticians have proposed new clustering methods that take advantage of characteristics of the gene expression data, the medical community has a preference for using "classic" clustering methods. There have been no studies thus far performing a large-scale evaluation of different clustering methods in this context. RESULTS/CONCLUSION: We present the first large-scale analysis of seven different clustering methods and four proximity measures for the analysis of 35 cancer gene expression data sets. Our results reveal that the finite mixture of Gaussians, followed closely by k-means, exhibited the best performance in terms of recovering the true structure of the data sets. These methods also exhibited, on average, the smallest difference between the actual number of classes in the data sets and the best number of clusters as indicated by our validation criteria. Furthermore, hierarchical methods, which have been widely used by the medical community, exhibited a poorer recovery performance than that of the other methods evaluated. Moreover, as a stable basis for the assessment and comparison of different clustering methods for cancer gene expression data, this study provides a common group of data sets (benchmark data sets) to be shared among researchers and used for comparisons with new methods. The data sets analyzed in this study are available at . BioMed Central 2008-11-27 /pmc/articles/PMC2632677/ /pubmed/19038021 http://dx.doi.org/10.1186/1471-2105-9-497 Text en Copyright © 2008 de Souto et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
de Souto, Marcilio CP
Costa, Ivan G
de Araujo, Daniel SA
Ludermir, Teresa B
Schliep, Alexander
Clustering cancer gene expression data: a comparative study
title Clustering cancer gene expression data: a comparative study
title_full Clustering cancer gene expression data: a comparative study
title_fullStr Clustering cancer gene expression data: a comparative study
title_full_unstemmed Clustering cancer gene expression data: a comparative study
title_short Clustering cancer gene expression data: a comparative study
title_sort clustering cancer gene expression data: a comparative study
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2632677/
https://www.ncbi.nlm.nih.gov/pubmed/19038021
http://dx.doi.org/10.1186/1471-2105-9-497
work_keys_str_mv AT desoutomarciliocp clusteringcancergeneexpressiondataacomparativestudy
AT costaivang clusteringcancergeneexpressiondataacomparativestudy
AT dearaujodanielsa clusteringcancergeneexpressiondataacomparativestudy
AT ludermirteresab clusteringcancergeneexpressiondataacomparativestudy
AT schliepalexander clusteringcancergeneexpressiondataacomparativestudy