Cargando…

Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data

BACKGROUND: With the advance of microarray technology, several methods for gene classification and prognosis have been already designed. However, under various denominations, some of these methods have similar approaches. This study evaluates the influence of gene expression variance structure on th...

Descripción completa

Detalles Bibliográficos
Autores principales:	Truntzer, Caroline, Mercier, Catherine, Estève, Jacques, Gautier, Christian, Roy, Pascal
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2007
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1831790/ https://www.ncbi.nlm.nih.gov/pubmed/17355634 http://dx.doi.org/10.1186/1471-2105-8-90

_version_	1782132799962087424
author	Truntzer, Caroline Mercier, Catherine Estève, Jacques Gautier, Christian Roy, Pascal
author_facet	Truntzer, Caroline Mercier, Catherine Estève, Jacques Gautier, Christian Roy, Pascal
author_sort	Truntzer, Caroline
collection	PubMed
description	BACKGROUND: With the advance of microarray technology, several methods for gene classification and prognosis have been already designed. However, under various denominations, some of these methods have similar approaches. This study evaluates the influence of gene expression variance structure on the performance of methods that describe the relationship between gene expression levels and a given phenotype through projection of data onto discriminant axes. RESULTS: We compared Between-Group Analysis and Discriminant Analysis (with prior dimension reduction through Partial Least Squares or Principal Components Analysis). A geometric approach showed that these two methods are strongly related, but differ in the way they handle data structure. Yet, data structure helps understanding the predictive efficiency of these methods. Three main structure situations may be identified. When the clusters of points are clearly split, both methods perform equally well. When the clusters superpose, both methods fail to give interesting predictions. In intermediate situations, the configuration of the clusters of points has to be handled by the projection to improve prediction. For this, we recommend Discriminant Analysis. Besides, an innovative way of simulation generated the three main structures by modelling different partitions of the whole variance into within-group and between-group variances. These simulated datasets were used in complement to some well-known public datasets to investigate the methods behaviour in a large diversity of structure situations. To examine the structure of a dataset before analysis and preselect an a priori appropriate method for its analysis, we proposed a two-graph preliminary visualization tool: plotting patients on the Between-Group Analysis discriminant axis (x-axis) and on the first and the second within-group Principal Components Analysis component (y-axis), respectively. CONCLUSION: Discriminant Analysis outperformed Between-Group Analysis because it allows for the dataset structure. An a priori knowledge of that structure may guide the choice of the analysis method. Simulated datasets with known properties are valuable to assess and compare the performance of analysis methods, then implementation on real datasets checks and validates the results. Thus, we warn against the use of unchallenging datasets for method comparison, such as the Golub dataset, because their structure is such that any method would be efficient.
format	Text
id	pubmed-1831790
institution	National Center for Biotechnology Information
language	English
publishDate	2007
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-18317902007-03-26 Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data Truntzer, Caroline Mercier, Catherine Estève, Jacques Gautier, Christian Roy, Pascal BMC Bioinformatics Research Article BACKGROUND: With the advance of microarray technology, several methods for gene classification and prognosis have been already designed. However, under various denominations, some of these methods have similar approaches. This study evaluates the influence of gene expression variance structure on the performance of methods that describe the relationship between gene expression levels and a given phenotype through projection of data onto discriminant axes. RESULTS: We compared Between-Group Analysis and Discriminant Analysis (with prior dimension reduction through Partial Least Squares or Principal Components Analysis). A geometric approach showed that these two methods are strongly related, but differ in the way they handle data structure. Yet, data structure helps understanding the predictive efficiency of these methods. Three main structure situations may be identified. When the clusters of points are clearly split, both methods perform equally well. When the clusters superpose, both methods fail to give interesting predictions. In intermediate situations, the configuration of the clusters of points has to be handled by the projection to improve prediction. For this, we recommend Discriminant Analysis. Besides, an innovative way of simulation generated the three main structures by modelling different partitions of the whole variance into within-group and between-group variances. These simulated datasets were used in complement to some well-known public datasets to investigate the methods behaviour in a large diversity of structure situations. To examine the structure of a dataset before analysis and preselect an a priori appropriate method for its analysis, we proposed a two-graph preliminary visualization tool: plotting patients on the Between-Group Analysis discriminant axis (x-axis) and on the first and the second within-group Principal Components Analysis component (y-axis), respectively. CONCLUSION: Discriminant Analysis outperformed Between-Group Analysis because it allows for the dataset structure. An a priori knowledge of that structure may guide the choice of the analysis method. Simulated datasets with known properties are valuable to assess and compare the performance of analysis methods, then implementation on real datasets checks and validates the results. Thus, we warn against the use of unchallenging datasets for method comparison, such as the Golub dataset, because their structure is such that any method would be efficient. BioMed Central 2007-03-13 /pmc/articles/PMC1831790/ /pubmed/17355634 http://dx.doi.org/10.1186/1471-2105-8-90 Text en Copyright © 2007 Truntzer et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Truntzer, Caroline Mercier, Catherine Estève, Jacques Gautier, Christian Roy, Pascal Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
title	Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
title_full	Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
title_fullStr	Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
title_full_unstemmed	Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
title_short	Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
title_sort	importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1831790/ https://www.ncbi.nlm.nih.gov/pubmed/17355634 http://dx.doi.org/10.1186/1471-2105-8-90
work_keys_str_mv	AT truntzercaroline importanceofdatastructureincomparingtwodimensionreductionmethodsforclassificationofmicroarraygeneexpressiondata AT merciercatherine importanceofdatastructureincomparingtwodimensionreductionmethodsforclassificationofmicroarraygeneexpressiondata AT estevejacques importanceofdatastructureincomparingtwodimensionreductionmethodsforclassificationofmicroarraygeneexpressiondata AT gautierchristian importanceofdatastructureincomparingtwodimensionreductionmethodsforclassificationofmicroarraygeneexpressiondata AT roypascal importanceofdatastructureincomparingtwodimensionreductionmethodsforclassificationofmicroarraygeneexpressiondata

Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data

Ejemplares similares