Cargando…

Clustering of gene expression data: performance and similarity analysis

BACKGROUND: DNA Microarray technology is an innovative methodology in experimental molecular biology, which has produced huge amounts of valuable data in the profile of gene expression. Many clustering algorithms have been proposed to analyze gene expression data, but little guidance is available to...

Descripción completa

Detalles Bibliográficos
Autores principales: Yin, Longde, Huang, Chun-Hsi, Ni, Jun
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1780119/
https://www.ncbi.nlm.nih.gov/pubmed/17217511
http://dx.doi.org/10.1186/1471-2105-7-S4-S19
_version_ 1782131848679260160
author Yin, Longde
Huang, Chun-Hsi
Ni, Jun
author_facet Yin, Longde
Huang, Chun-Hsi
Ni, Jun
author_sort Yin, Longde
collection PubMed
description BACKGROUND: DNA Microarray technology is an innovative methodology in experimental molecular biology, which has produced huge amounts of valuable data in the profile of gene expression. Many clustering algorithms have been proposed to analyze gene expression data, but little guidance is available to help choose among them. The evaluation of feasible and applicable clustering algorithms is becoming an important issue in today's bioinformatics research. RESULTS: In this paper we first experimentally study three major clustering algorithms: Hierarchical Clustering (HC), Self-Organizing Map (SOM), and Self Organizing Tree Algorithm (SOTA) using Yeast Saccharomyces cerevisiae gene expression data, and compare their performance. We then introduce Cluster Diff, a new data mining tool, to conduct the similarity analysis of clusters generated by different algorithms. The performance study shows that SOTA is more efficient than SOM while HC is the least efficient. The results of similarity analysis show that when given a target cluster, the Cluster Diff can efficiently determine the closest match from a set of clusters. Therefore, it is an effective approach for evaluating different clustering algorithms. CONCLUSION: HC methods allow a visual, convenient representation of genes. However, they are neither robust nor efficient. The SOM is more robust against noise. A disadvantage of SOM is that the number of clusters has to be fixed beforehand. The SOTA combines the advantages of both hierarchical and SOM clustering. It allows a visual representation of the clusters and their structure and is not sensitive to noises. The SOTA is also more flexible than the other two clustering methods. By using our data mining tool, Cluster Diff, it is possible to analyze the similarity of clusters generated by different algorithms and thereby enable comparisons of different clustering methods.
format Text
id pubmed-1780119
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-17801192007-01-24 Clustering of gene expression data: performance and similarity analysis Yin, Longde Huang, Chun-Hsi Ni, Jun BMC Bioinformatics Research BACKGROUND: DNA Microarray technology is an innovative methodology in experimental molecular biology, which has produced huge amounts of valuable data in the profile of gene expression. Many clustering algorithms have been proposed to analyze gene expression data, but little guidance is available to help choose among them. The evaluation of feasible and applicable clustering algorithms is becoming an important issue in today's bioinformatics research. RESULTS: In this paper we first experimentally study three major clustering algorithms: Hierarchical Clustering (HC), Self-Organizing Map (SOM), and Self Organizing Tree Algorithm (SOTA) using Yeast Saccharomyces cerevisiae gene expression data, and compare their performance. We then introduce Cluster Diff, a new data mining tool, to conduct the similarity analysis of clusters generated by different algorithms. The performance study shows that SOTA is more efficient than SOM while HC is the least efficient. The results of similarity analysis show that when given a target cluster, the Cluster Diff can efficiently determine the closest match from a set of clusters. Therefore, it is an effective approach for evaluating different clustering algorithms. CONCLUSION: HC methods allow a visual, convenient representation of genes. However, they are neither robust nor efficient. The SOM is more robust against noise. A disadvantage of SOM is that the number of clusters has to be fixed beforehand. The SOTA combines the advantages of both hierarchical and SOM clustering. It allows a visual representation of the clusters and their structure and is not sensitive to noises. The SOTA is also more flexible than the other two clustering methods. By using our data mining tool, Cluster Diff, it is possible to analyze the similarity of clusters generated by different algorithms and thereby enable comparisons of different clustering methods. BioMed Central 2006-12-12 /pmc/articles/PMC1780119/ /pubmed/17217511 http://dx.doi.org/10.1186/1471-2105-7-S4-S19 Text en Copyright © 2006 Yin et al; licensee BioMed Central Ltd http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Yin, Longde
Huang, Chun-Hsi
Ni, Jun
Clustering of gene expression data: performance and similarity analysis
title Clustering of gene expression data: performance and similarity analysis
title_full Clustering of gene expression data: performance and similarity analysis
title_fullStr Clustering of gene expression data: performance and similarity analysis
title_full_unstemmed Clustering of gene expression data: performance and similarity analysis
title_short Clustering of gene expression data: performance and similarity analysis
title_sort clustering of gene expression data: performance and similarity analysis
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1780119/
https://www.ncbi.nlm.nih.gov/pubmed/17217511
http://dx.doi.org/10.1186/1471-2105-7-S4-S19
work_keys_str_mv AT yinlongde clusteringofgeneexpressiondataperformanceandsimilarityanalysis
AT huangchunhsi clusteringofgeneexpressiondataperformanceandsimilarityanalysis
AT nijun clusteringofgeneexpressiondataperformanceandsimilarityanalysis