Cargando…

Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?

BACKGROUND: High-throughput gene expression profiling technologies generating a wealth of data, are increasingly used for characterization of tumor biopsies for clinical trials. By applying machine learning algorithms to such clinically documented data sets, one hopes to improve tumor diagnosis, pro...

Descripción completa

Detalles Bibliográficos
Autores principales:	Yasrebi, Haleh, Sperisen, Peter, Praz, Viviane, Bucher, Philipp
Formato:	Texto
Lenguaje:	English
Publicado:	Public Library of Science 2009
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2761544/ https://www.ncbi.nlm.nih.gov/pubmed/19851466 http://dx.doi.org/10.1371/journal.pone.0007431

_version_	1782172841030975488
author	Yasrebi, Haleh Sperisen, Peter Praz, Viviane Bucher, Philipp
author_facet	Yasrebi, Haleh Sperisen, Peter Praz, Viviane Bucher, Philipp
author_sort	Yasrebi, Haleh
collection	PubMed
description	BACKGROUND: High-throughput gene expression profiling technologies generating a wealth of data, are increasingly used for characterization of tumor biopsies for clinical trials. By applying machine learning algorithms to such clinically documented data sets, one hopes to improve tumor diagnosis, prognosis, as well as prediction of treatment response. However, the limited number of patients enrolled in a single trial study limits the power of machine learning approaches due to over-fitting. One could partially overcome this limitation by merging data from different studies. Nevertheless, such data sets differ from each other with regard to technical biases, patient selection criteria and follow-up treatment. It is therefore not clear at all whether the advantage of increased sample size outweighs the disadvantage of higher heterogeneity of merged data sets. Here, we present a systematic study to answer this question specifically for breast cancer data sets. We use survival prediction based on Cox regression as an assay to measure the added value of merged data sets. RESULTS: Using time-dependent Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) and hazard ratio as performance measures, we see in overall no significant improvement or deterioration of survival prediction with merged data sets as compared to individual data sets. This apparently was due to the fact that a few genes with strong prognostic power were not available on all microarray platforms and thus were not retained in the merged data sets. Surprisingly, we found that the overall best performance was achieved with a single-gene predictor consisting of CYB5D1. CONCLUSIONS: Merging did not deteriorate performance on average despite (a) The diversity of microarray platforms used. (b) The heterogeneity of patients cohorts. (c) The heterogeneity of breast cancer disease. (d) Substantial variation of time to death or relapse. (e) The reduced number of genes in the merged data sets. Predictors derived from the merged data sets were more robust, consistent and reproducible across microarray platforms. Moreover, merging data sets from different studies helps to better understand the biases of individual studies and can lead to the identification of strong survival factors like CYB5D1 expression.
format	Text
id	pubmed-2761544
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-27615442009-10-23 Can Survival Prediction Be Improved By Merging Gene Expression Data Sets? Yasrebi, Haleh Sperisen, Peter Praz, Viviane Bucher, Philipp PLoS One Research Article BACKGROUND: High-throughput gene expression profiling technologies generating a wealth of data, are increasingly used for characterization of tumor biopsies for clinical trials. By applying machine learning algorithms to such clinically documented data sets, one hopes to improve tumor diagnosis, prognosis, as well as prediction of treatment response. However, the limited number of patients enrolled in a single trial study limits the power of machine learning approaches due to over-fitting. One could partially overcome this limitation by merging data from different studies. Nevertheless, such data sets differ from each other with regard to technical biases, patient selection criteria and follow-up treatment. It is therefore not clear at all whether the advantage of increased sample size outweighs the disadvantage of higher heterogeneity of merged data sets. Here, we present a systematic study to answer this question specifically for breast cancer data sets. We use survival prediction based on Cox regression as an assay to measure the added value of merged data sets. RESULTS: Using time-dependent Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) and hazard ratio as performance measures, we see in overall no significant improvement or deterioration of survival prediction with merged data sets as compared to individual data sets. This apparently was due to the fact that a few genes with strong prognostic power were not available on all microarray platforms and thus were not retained in the merged data sets. Surprisingly, we found that the overall best performance was achieved with a single-gene predictor consisting of CYB5D1. CONCLUSIONS: Merging did not deteriorate performance on average despite (a) The diversity of microarray platforms used. (b) The heterogeneity of patients cohorts. (c) The heterogeneity of breast cancer disease. (d) Substantial variation of time to death or relapse. (e) The reduced number of genes in the merged data sets. Predictors derived from the merged data sets were more robust, consistent and reproducible across microarray platforms. Moreover, merging data sets from different studies helps to better understand the biases of individual studies and can lead to the identification of strong survival factors like CYB5D1 expression. Public Library of Science 2009-10-23 /pmc/articles/PMC2761544/ /pubmed/19851466 http://dx.doi.org/10.1371/journal.pone.0007431 Text en Yasrebi et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Yasrebi, Haleh Sperisen, Peter Praz, Viviane Bucher, Philipp Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?
title	Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?
title_full	Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?
title_fullStr	Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?
title_full_unstemmed	Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?
title_short	Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?
title_sort	can survival prediction be improved by merging gene expression data sets?
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2761544/ https://www.ncbi.nlm.nih.gov/pubmed/19851466 http://dx.doi.org/10.1371/journal.pone.0007431
work_keys_str_mv	AT yasrebihaleh cansurvivalpredictionbeimprovedbymerginggeneexpressiondatasets AT sperisenpeter cansurvivalpredictionbeimprovedbymerginggeneexpressiondatasets AT prazviviane cansurvivalpredictionbeimprovedbymerginggeneexpressiondatasets AT bucherphilipp cansurvivalpredictionbeimprovedbymerginggeneexpressiondatasets

Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?

Ejemplares similares