Cargando…

Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering

BACKGROUND: Cluster analysis, and in particular hierarchical clustering, is widely used to extract information from gene expression data. The aim is to discover new classes, or sub-classes, of either individuals or genes. Performing a cluster analysis commonly involve decisions on how to; handle mis...

Descripción completa

Detalles Bibliográficos
Autores principales:	Freyhult, Eva, Landfors, Mattias, Önskog, Jenny, Hvidsten, Torgeir R, Rydén, Patrik
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098084/ https://www.ncbi.nlm.nih.gov/pubmed/20937082 http://dx.doi.org/10.1186/1471-2105-11-503

_version_	1782203913769844736
author	Freyhult, Eva Landfors, Mattias Önskog, Jenny Hvidsten, Torgeir R Rydén, Patrik
author_facet	Freyhult, Eva Landfors, Mattias Önskog, Jenny Hvidsten, Torgeir R Rydén, Patrik
author_sort	Freyhult, Eva
collection	PubMed
description	BACKGROUND: Cluster analysis, and in particular hierarchical clustering, is widely used to extract information from gene expression data. The aim is to discover new classes, or sub-classes, of either individuals or genes. Performing a cluster analysis commonly involve decisions on how to; handle missing values, standardize the data and select genes. In addition, pre-processing, involving various types of filtration and normalization procedures, can have an effect on the ability to discover biologically relevant classes. Here we consider cluster analysis in a broad sense and perform a comprehensive evaluation that covers several aspects of cluster analyses, including normalization. RESULT: We evaluated 2780 cluster analysis methods on seven publicly available 2-channel microarray data sets with common reference designs. Each cluster analysis method differed in data normalization (5 normalizations were considered), missing value imputation (2), standardization of data (2), gene selection (19) or clustering method (11). The cluster analyses are evaluated using known classes, such as cancer types, and the adjusted Rand index. The performances of the different analyses vary between the data sets and it is difficult to give general recommendations. However, normalization, gene selection and clustering method are all variables that have a significant impact on the performance. In particular, gene selection is important and it is generally necessary to include a relatively large number of genes in order to get good performance. Selecting genes with high standard deviation or using principal component analysis are shown to be the preferred gene selection methods. Hierarchical clustering using Ward's method, k-means clustering and Mclust are the clustering methods considered in this paper that achieves the highest adjusted Rand. Normalization can have a significant positive impact on the ability to cluster individuals, and there are indications that background correction is preferable, in particular if the gene selection is successful. However, this is an area that needs to be studied further in order to draw any general conclusions. CONCLUSIONS: The choice of cluster analysis, and in particular gene selection, has a large impact on the ability to cluster individuals correctly based on expression profiles. Normalization has a positive effect, but the relative performance of different normalizations is an area that needs more research. In summary, although clustering, gene selection and normalization are considered standard methods in bioinformatics, our comprehensive analysis shows that selecting the right methods, and the right combinations of methods, is far from trivial and that much is still unexplored in what is considered to be the most basic analysis of genomic data.
format	Text
id	pubmed-3098084
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-30980842011-07-08 Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering Freyhult, Eva Landfors, Mattias Önskog, Jenny Hvidsten, Torgeir R Rydén, Patrik BMC Bioinformatics Research Article BACKGROUND: Cluster analysis, and in particular hierarchical clustering, is widely used to extract information from gene expression data. The aim is to discover new classes, or sub-classes, of either individuals or genes. Performing a cluster analysis commonly involve decisions on how to; handle missing values, standardize the data and select genes. In addition, pre-processing, involving various types of filtration and normalization procedures, can have an effect on the ability to discover biologically relevant classes. Here we consider cluster analysis in a broad sense and perform a comprehensive evaluation that covers several aspects of cluster analyses, including normalization. RESULT: We evaluated 2780 cluster analysis methods on seven publicly available 2-channel microarray data sets with common reference designs. Each cluster analysis method differed in data normalization (5 normalizations were considered), missing value imputation (2), standardization of data (2), gene selection (19) or clustering method (11). The cluster analyses are evaluated using known classes, such as cancer types, and the adjusted Rand index. The performances of the different analyses vary between the data sets and it is difficult to give general recommendations. However, normalization, gene selection and clustering method are all variables that have a significant impact on the performance. In particular, gene selection is important and it is generally necessary to include a relatively large number of genes in order to get good performance. Selecting genes with high standard deviation or using principal component analysis are shown to be the preferred gene selection methods. Hierarchical clustering using Ward's method, k-means clustering and Mclust are the clustering methods considered in this paper that achieves the highest adjusted Rand. Normalization can have a significant positive impact on the ability to cluster individuals, and there are indications that background correction is preferable, in particular if the gene selection is successful. However, this is an area that needs to be studied further in order to draw any general conclusions. CONCLUSIONS: The choice of cluster analysis, and in particular gene selection, has a large impact on the ability to cluster individuals correctly based on expression profiles. Normalization has a positive effect, but the relative performance of different normalizations is an area that needs more research. In summary, although clustering, gene selection and normalization are considered standard methods in bioinformatics, our comprehensive analysis shows that selecting the right methods, and the right combinations of methods, is far from trivial and that much is still unexplored in what is considered to be the most basic analysis of genomic data. BioMed Central 2010-10-11 /pmc/articles/PMC3098084/ /pubmed/20937082 http://dx.doi.org/10.1186/1471-2105-11-503 Text en Copyright ©2010 Freyhult et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Freyhult, Eva Landfors, Mattias Önskog, Jenny Hvidsten, Torgeir R Rydén, Patrik Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering
title	Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering
title_full	Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering
title_fullStr	Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering
title_full_unstemmed	Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering
title_short	Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering
title_sort	challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098084/ https://www.ncbi.nlm.nih.gov/pubmed/20937082 http://dx.doi.org/10.1186/1471-2105-11-503
work_keys_str_mv	AT freyhulteva challengesinmicroarrayclassdiscoveryacomprehensiveexaminationofnormalizationgeneselectionandclustering AT landforsmattias challengesinmicroarrayclassdiscoveryacomprehensiveexaminationofnormalizationgeneselectionandclustering AT onskogjenny challengesinmicroarrayclassdiscoveryacomprehensiveexaminationofnormalizationgeneselectionandclustering AT hvidstentorgeirr challengesinmicroarrayclassdiscoveryacomprehensiveexaminationofnormalizationgeneselectionandclustering AT rydenpatrik challengesinmicroarrayclassdiscoveryacomprehensiveexaminationofnormalizationgeneselectionandclustering

Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering

Ejemplares similares