Cargando…

Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets

BACKGROUND: Independently derived expression profiles of the same biological condition often have few genes in common. In this study, we created populations of expression profiles from publicly available microarray datasets of cancer (breast, lymphoma and renal) samples linked to clinical informatio...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gormley, Michael, Dampier, William, Ertel, Adam, Karacali, Bilge, Tozeren, Aydin
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2007
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2211325/ https://www.ncbi.nlm.nih.gov/pubmed/17963508 http://dx.doi.org/10.1186/1471-2105-8-415

_version_	1782148512538951680
author	Gormley, Michael Dampier, William Ertel, Adam Karacali, Bilge Tozeren, Aydin
author_facet	Gormley, Michael Dampier, William Ertel, Adam Karacali, Bilge Tozeren, Aydin
author_sort	Gormley, Michael
collection	PubMed
description	BACKGROUND: Independently derived expression profiles of the same biological condition often have few genes in common. In this study, we created populations of expression profiles from publicly available microarray datasets of cancer (breast, lymphoma and renal) samples linked to clinical information with an iterative machine learning algorithm. ROC curves were used to assess the prediction error of each profile for classification. We compared the prediction error of profiles correlated with molecular phenotype against profiles correlated with relapse-free status. Prediction error of profiles identified with supervised univariate feature selection algorithms were compared to profiles selected randomly from a) all genes on the microarray platform and b) a list of known disease-related genes (a priori selection). We also determined the relevance of expression profiles on test arrays from independent datasets, measured on either the same or different microarray platforms. RESULTS: Highly discriminative expression profiles were produced on both simulated gene expression data and expression data from breast cancer and lymphoma datasets on the basis of ER and BCL-6 expression, respectively. Use of relapse-free status to identify profiles for prognosis prediction resulted in poorly discriminative decision rules. Supervised feature selection resulted in more accurate classifications than random or a priori selection, however, the difference in prediction error decreased as the number of features increased. These results held when decision rules were applied across-datasets to samples profiled on the same microarray platform. CONCLUSION: Our results show that many gene sets predict molecular phenotypes accurately. Given this, expression profiles identified using different training datasets should be expected to show little agreement. In addition, we demonstrate the difficulty in predicting relapse directly from microarray data using supervised machine learning approaches. These findings are relevant to the use of molecular profiling for the identification of candidate biomarker panels.
format	Text
id	pubmed-2211325
institution	National Center for Biotechnology Information
language	English
publishDate	2007
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-22113252008-01-23 Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets Gormley, Michael Dampier, William Ertel, Adam Karacali, Bilge Tozeren, Aydin BMC Bioinformatics Research Article BACKGROUND: Independently derived expression profiles of the same biological condition often have few genes in common. In this study, we created populations of expression profiles from publicly available microarray datasets of cancer (breast, lymphoma and renal) samples linked to clinical information with an iterative machine learning algorithm. ROC curves were used to assess the prediction error of each profile for classification. We compared the prediction error of profiles correlated with molecular phenotype against profiles correlated with relapse-free status. Prediction error of profiles identified with supervised univariate feature selection algorithms were compared to profiles selected randomly from a) all genes on the microarray platform and b) a list of known disease-related genes (a priori selection). We also determined the relevance of expression profiles on test arrays from independent datasets, measured on either the same or different microarray platforms. RESULTS: Highly discriminative expression profiles were produced on both simulated gene expression data and expression data from breast cancer and lymphoma datasets on the basis of ER and BCL-6 expression, respectively. Use of relapse-free status to identify profiles for prognosis prediction resulted in poorly discriminative decision rules. Supervised feature selection resulted in more accurate classifications than random or a priori selection, however, the difference in prediction error decreased as the number of features increased. These results held when decision rules were applied across-datasets to samples profiled on the same microarray platform. CONCLUSION: Our results show that many gene sets predict molecular phenotypes accurately. Given this, expression profiles identified using different training datasets should be expected to show little agreement. In addition, we demonstrate the difficulty in predicting relapse directly from microarray data using supervised machine learning approaches. These findings are relevant to the use of molecular profiling for the identification of candidate biomarker panels. BioMed Central 2007-10-26 /pmc/articles/PMC2211325/ /pubmed/17963508 http://dx.doi.org/10.1186/1471-2105-8-415 Text en Copyright © 2007 Gormley et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Gormley, Michael Dampier, William Ertel, Adam Karacali, Bilge Tozeren, Aydin Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets
title	Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets
title_full	Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets
title_fullStr	Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets
title_full_unstemmed	Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets
title_short	Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets
title_sort	prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2211325/ https://www.ncbi.nlm.nih.gov/pubmed/17963508 http://dx.doi.org/10.1186/1471-2105-8-415
work_keys_str_mv	AT gormleymichael predictionpotentialofcandidatebiomarkersetsidentifiedandvalidatedongeneexpressiondatafrommultipledatasets AT dampierwilliam predictionpotentialofcandidatebiomarkersetsidentifiedandvalidatedongeneexpressiondatafrommultipledatasets AT erteladam predictionpotentialofcandidatebiomarkersetsidentifiedandvalidatedongeneexpressiondatafrommultipledatasets AT karacalibilge predictionpotentialofcandidatebiomarkersetsidentifiedandvalidatedongeneexpressiondatafrommultipledatasets AT tozerenaydin predictionpotentialofcandidatebiomarkersetsidentifiedandvalidatedongeneexpressiondatafrommultipledatasets

Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets

Ejemplares similares