Cargando…

Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data

BACKGROUND: With the advance of microarray technology, several methods for gene classification and prognosis have been already designed. However, under various denominations, some of these methods have similar approaches. This study evaluates the influence of gene expression variance structure on th...

Descripción completa

Detalles Bibliográficos
Autores principales: Truntzer, Caroline, Mercier, Catherine, Estève, Jacques, Gautier, Christian, Roy, Pascal
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1831790/
https://www.ncbi.nlm.nih.gov/pubmed/17355634
http://dx.doi.org/10.1186/1471-2105-8-90
_version_ 1782132799962087424
author Truntzer, Caroline
Mercier, Catherine
Estève, Jacques
Gautier, Christian
Roy, Pascal
author_facet Truntzer, Caroline
Mercier, Catherine
Estève, Jacques
Gautier, Christian
Roy, Pascal
author_sort Truntzer, Caroline
collection PubMed
description BACKGROUND: With the advance of microarray technology, several methods for gene classification and prognosis have been already designed. However, under various denominations, some of these methods have similar approaches. This study evaluates the influence of gene expression variance structure on the performance of methods that describe the relationship between gene expression levels and a given phenotype through projection of data onto discriminant axes. RESULTS: We compared Between-Group Analysis and Discriminant Analysis (with prior dimension reduction through Partial Least Squares or Principal Components Analysis). A geometric approach showed that these two methods are strongly related, but differ in the way they handle data structure. Yet, data structure helps understanding the predictive efficiency of these methods. Three main structure situations may be identified. When the clusters of points are clearly split, both methods perform equally well. When the clusters superpose, both methods fail to give interesting predictions. In intermediate situations, the configuration of the clusters of points has to be handled by the projection to improve prediction. For this, we recommend Discriminant Analysis. Besides, an innovative way of simulation generated the three main structures by modelling different partitions of the whole variance into within-group and between-group variances. These simulated datasets were used in complement to some well-known public datasets to investigate the methods behaviour in a large diversity of structure situations. To examine the structure of a dataset before analysis and preselect an a priori appropriate method for its analysis, we proposed a two-graph preliminary visualization tool: plotting patients on the Between-Group Analysis discriminant axis (x-axis) and on the first and the second within-group Principal Components Analysis component (y-axis), respectively. CONCLUSION: Discriminant Analysis outperformed Between-Group Analysis because it allows for the dataset structure. An a priori knowledge of that structure may guide the choice of the analysis method. Simulated datasets with known properties are valuable to assess and compare the performance of analysis methods, then implementation on real datasets checks and validates the results. Thus, we warn against the use of unchallenging datasets for method comparison, such as the Golub dataset, because their structure is such that any method would be efficient.
format Text
id pubmed-1831790
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18317902007-03-26 Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data Truntzer, Caroline Mercier, Catherine Estève, Jacques Gautier, Christian Roy, Pascal BMC Bioinformatics Research Article BACKGROUND: With the advance of microarray technology, several methods for gene classification and prognosis have been already designed. However, under various denominations, some of these methods have similar approaches. This study evaluates the influence of gene expression variance structure on the performance of methods that describe the relationship between gene expression levels and a given phenotype through projection of data onto discriminant axes. RESULTS: We compared Between-Group Analysis and Discriminant Analysis (with prior dimension reduction through Partial Least Squares or Principal Components Analysis). A geometric approach showed that these two methods are strongly related, but differ in the way they handle data structure. Yet, data structure helps understanding the predictive efficiency of these methods. Three main structure situations may be identified. When the clusters of points are clearly split, both methods perform equally well. When the clusters superpose, both methods fail to give interesting predictions. In intermediate situations, the configuration of the clusters of points has to be handled by the projection to improve prediction. For this, we recommend Discriminant Analysis. Besides, an innovative way of simulation generated the three main structures by modelling different partitions of the whole variance into within-group and between-group variances. These simulated datasets were used in complement to some well-known public datasets to investigate the methods behaviour in a large diversity of structure situations. To examine the structure of a dataset before analysis and preselect an a priori appropriate method for its analysis, we proposed a two-graph preliminary visualization tool: plotting patients on the Between-Group Analysis discriminant axis (x-axis) and on the first and the second within-group Principal Components Analysis component (y-axis), respectively. CONCLUSION: Discriminant Analysis outperformed Between-Group Analysis because it allows for the dataset structure. An a priori knowledge of that structure may guide the choice of the analysis method. Simulated datasets with known properties are valuable to assess and compare the performance of analysis methods, then implementation on real datasets checks and validates the results. Thus, we warn against the use of unchallenging datasets for method comparison, such as the Golub dataset, because their structure is such that any method would be efficient. BioMed Central 2007-03-13 /pmc/articles/PMC1831790/ /pubmed/17355634 http://dx.doi.org/10.1186/1471-2105-8-90 Text en Copyright © 2007 Truntzer et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Truntzer, Caroline
Mercier, Catherine
Estève, Jacques
Gautier, Christian
Roy, Pascal
Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
title Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
title_full Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
title_fullStr Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
title_full_unstemmed Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
title_short Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
title_sort importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1831790/
https://www.ncbi.nlm.nih.gov/pubmed/17355634
http://dx.doi.org/10.1186/1471-2105-8-90
work_keys_str_mv AT truntzercaroline importanceofdatastructureincomparingtwodimensionreductionmethodsforclassificationofmicroarraygeneexpressiondata
AT merciercatherine importanceofdatastructureincomparingtwodimensionreductionmethodsforclassificationofmicroarraygeneexpressiondata
AT estevejacques importanceofdatastructureincomparingtwodimensionreductionmethodsforclassificationofmicroarraygeneexpressiondata
AT gautierchristian importanceofdatastructureincomparingtwodimensionreductionmethodsforclassificationofmicroarraygeneexpressiondata
AT roypascal importanceofdatastructureincomparingtwodimensionreductionmethodsforclassificationofmicroarraygeneexpressiondata