Cargando…

On the statistical assessment of classifiers using DNA microarray data

BACKGROUND: In this paper we present a method for the statistical assessment of cancer predictors which make use of gene expression profiles. The methodology is applied to a new data set of microarray gene expression data collected in Casa Sollievo della Sofferenza Hospital, Foggia – Italy. The data...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ancona, N, Maglietta, R, Piepoli, A, D'Addabbo, A, Cotugno, R, Savino, M, Liuni, S, Carella, M, Pesole, G, Perri, F
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2006
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1564153/ https://www.ncbi.nlm.nih.gov/pubmed/16919171 http://dx.doi.org/10.1186/1471-2105-7-387

_version_	1782129550431354880
author	Ancona, N Maglietta, R Piepoli, A D'Addabbo, A Cotugno, R Savino, M Liuni, S Carella, M Pesole, G Perri, F
author_facet	Ancona, N Maglietta, R Piepoli, A D'Addabbo, A Cotugno, R Savino, M Liuni, S Carella, M Pesole, G Perri, F
author_sort	Ancona, N
collection	PubMed
description	BACKGROUND: In this paper we present a method for the statistical assessment of cancer predictors which make use of gene expression profiles. The methodology is applied to a new data set of microarray gene expression data collected in Casa Sollievo della Sofferenza Hospital, Foggia – Italy. The data set is made up of normal (22) and tumor (25) specimens extracted from 25 patients affected by colon cancer. We propose to give answers to some questions which are relevant for the automatic diagnosis of cancer such as: Is the size of the available data set sufficient to build accurate classifiers? What is the statistical significance of the associated error rates? In what ways can accuracy be considered dependant on the adopted classification scheme? How many genes are correlated with the pathology and how many are sufficient for an accurate colon cancer classification? The method we propose answers these questions whilst avoiding the potential pitfalls hidden in the analysis and interpretation of microarray data. RESULTS: We estimate the generalization error, evaluated through the Leave-K-Out Cross Validation error, for three different classification schemes by varying the number of training examples and the number of the genes used. The statistical significance of the error rate is measured by using a permutation test. We provide a statistical analysis in terms of the frequencies of the genes involved in the classification. Using the whole set of genes, we found that the Weighted Voting Algorithm (WVA) classifier learns the distinction between normal and tumor specimens with 25 training examples, providing e = 21% (p = 0.045) as an error rate. This remains constant even when the number of examples increases. Moreover, Regularized Least Squares (RLS) and Support Vector Machines (SVM) classifiers can learn with only 15 training examples, with an error rate of e = 19% (p = 0.035) and e = 18% (p = 0.037) respectively. Moreover, the error rate decreases as the training set size increases, reaching its best performances with 35 training examples. In this case, RLS and SVM have error rates of e = 14% (p = 0.027) and e = 11% (p = 0.019). Concerning the number of genes, we found about 6000 genes (p < 0.05) correlated with the pathology, resulting from the signal-to-noise statistic. Moreover the performances of RLS and SVM classifiers do not change when 74% of genes is used. They progressively reduce up to e = 16% (p < 0.05) when only 2 genes are employed. The biological relevance of a set of genes determined by our statistical analysis and the major roles they play in colorectal tumorigenesis is discussed. CONCLUSIONS: The method proposed provides statistically significant answers to precise questions relevant for the diagnosis and prognosis of cancer. We found that, with as few as 15 examples, it is possible to train statistically significant classifiers for colon cancer diagnosis. As for the definition of the number of genes sufficient for a reliable classification of colon cancer, our results suggest that it depends on the accuracy required.
format	Text
id	pubmed-1564153
institution	National Center for Biotechnology Information
language	English
publishDate	2006
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-15641532006-09-14 On the statistical assessment of classifiers using DNA microarray data Ancona, N Maglietta, R Piepoli, A D'Addabbo, A Cotugno, R Savino, M Liuni, S Carella, M Pesole, G Perri, F BMC Bioinformatics Research Article BACKGROUND: In this paper we present a method for the statistical assessment of cancer predictors which make use of gene expression profiles. The methodology is applied to a new data set of microarray gene expression data collected in Casa Sollievo della Sofferenza Hospital, Foggia – Italy. The data set is made up of normal (22) and tumor (25) specimens extracted from 25 patients affected by colon cancer. We propose to give answers to some questions which are relevant for the automatic diagnosis of cancer such as: Is the size of the available data set sufficient to build accurate classifiers? What is the statistical significance of the associated error rates? In what ways can accuracy be considered dependant on the adopted classification scheme? How many genes are correlated with the pathology and how many are sufficient for an accurate colon cancer classification? The method we propose answers these questions whilst avoiding the potential pitfalls hidden in the analysis and interpretation of microarray data. RESULTS: We estimate the generalization error, evaluated through the Leave-K-Out Cross Validation error, for three different classification schemes by varying the number of training examples and the number of the genes used. The statistical significance of the error rate is measured by using a permutation test. We provide a statistical analysis in terms of the frequencies of the genes involved in the classification. Using the whole set of genes, we found that the Weighted Voting Algorithm (WVA) classifier learns the distinction between normal and tumor specimens with 25 training examples, providing e = 21% (p = 0.045) as an error rate. This remains constant even when the number of examples increases. Moreover, Regularized Least Squares (RLS) and Support Vector Machines (SVM) classifiers can learn with only 15 training examples, with an error rate of e = 19% (p = 0.035) and e = 18% (p = 0.037) respectively. Moreover, the error rate decreases as the training set size increases, reaching its best performances with 35 training examples. In this case, RLS and SVM have error rates of e = 14% (p = 0.027) and e = 11% (p = 0.019). Concerning the number of genes, we found about 6000 genes (p < 0.05) correlated with the pathology, resulting from the signal-to-noise statistic. Moreover the performances of RLS and SVM classifiers do not change when 74% of genes is used. They progressively reduce up to e = 16% (p < 0.05) when only 2 genes are employed. The biological relevance of a set of genes determined by our statistical analysis and the major roles they play in colorectal tumorigenesis is discussed. CONCLUSIONS: The method proposed provides statistically significant answers to precise questions relevant for the diagnosis and prognosis of cancer. We found that, with as few as 15 examples, it is possible to train statistically significant classifiers for colon cancer diagnosis. As for the definition of the number of genes sufficient for a reliable classification of colon cancer, our results suggest that it depends on the accuracy required. BioMed Central 2006-08-19 /pmc/articles/PMC1564153/ /pubmed/16919171 http://dx.doi.org/10.1186/1471-2105-7-387 Text en Copyright © 2006 Ancona et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Ancona, N Maglietta, R Piepoli, A D'Addabbo, A Cotugno, R Savino, M Liuni, S Carella, M Pesole, G Perri, F On the statistical assessment of classifiers using DNA microarray data
title	On the statistical assessment of classifiers using DNA microarray data
title_full	On the statistical assessment of classifiers using DNA microarray data
title_fullStr	On the statistical assessment of classifiers using DNA microarray data
title_full_unstemmed	On the statistical assessment of classifiers using DNA microarray data
title_short	On the statistical assessment of classifiers using DNA microarray data
title_sort	on the statistical assessment of classifiers using dna microarray data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1564153/ https://www.ncbi.nlm.nih.gov/pubmed/16919171 http://dx.doi.org/10.1186/1471-2105-7-387
work_keys_str_mv	AT anconan onthestatisticalassessmentofclassifiersusingdnamicroarraydata AT magliettar onthestatisticalassessmentofclassifiersusingdnamicroarraydata AT piepolia onthestatisticalassessmentofclassifiersusingdnamicroarraydata AT daddabboa onthestatisticalassessmentofclassifiersusingdnamicroarraydata AT cotugnor onthestatisticalassessmentofclassifiersusingdnamicroarraydata AT savinom onthestatisticalassessmentofclassifiersusingdnamicroarraydata AT liunis onthestatisticalassessmentofclassifiersusingdnamicroarraydata AT carellam onthestatisticalassessmentofclassifiersusingdnamicroarraydata AT pesoleg onthestatisticalassessmentofclassifiersusingdnamicroarraydata AT perrif onthestatisticalassessmentofclassifiersusingdnamicroarraydata

On the statistical assessment of classifiers using DNA microarray data

Ejemplares similares