Cargando…

Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability

BACKGROUND: Michiels et al. (Lancet 2005; 365: 488–92) employed a resampling strategy to show that the genes identified as predictors of prognosis from resamplings of a single gene expression dataset are highly variable. The genes most frequently identified in the separate resamplings were put forwa...

Descripción completa

Detalles Bibliográficos
Autores principales: van Vliet, Martin H, Reyal, Fabien, Horlings, Hugo M, van de Vijver, Marc J, Reinders, Marcel JT, Wessels, Lodewyk FA
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2527336/
https://www.ncbi.nlm.nih.gov/pubmed/18684329
http://dx.doi.org/10.1186/1471-2164-9-375
_version_ 1782158802800345088
author van Vliet, Martin H
Reyal, Fabien
Horlings, Hugo M
van de Vijver, Marc J
Reinders, Marcel JT
Wessels, Lodewyk FA
author_facet van Vliet, Martin H
Reyal, Fabien
Horlings, Hugo M
van de Vijver, Marc J
Reinders, Marcel JT
Wessels, Lodewyk FA
author_sort van Vliet, Martin H
collection PubMed
description BACKGROUND: Michiels et al. (Lancet 2005; 365: 488–92) employed a resampling strategy to show that the genes identified as predictors of prognosis from resamplings of a single gene expression dataset are highly variable. The genes most frequently identified in the separate resamplings were put forward as a 'gold standard'. On a higher level, breast cancer datasets collected by different institutions can be considered as resamplings from the underlying breast cancer population. The limited overlap between published prognostic signatures confirms the trend of signature instability identified by the resampling strategy. Six breast cancer datasets, totaling 947 samples, all measured on the Affymetrix platform, are currently available. This provides a unique opportunity to employ a substantial dataset to investigate the effects of pooling datasets on classifier accuracy, signature stability and enrichment of functional categories. RESULTS: We show that the resampling strategy produces a suboptimal ranking of genes, which can not be considered to be a 'gold standard'. When pooling breast cancer datasets, we observed a synergetic effect on the classification performance in 73% of the cases. We also observe a significant positive correlation between the number of datasets that is pooled, the validation performance, the number of genes selected, and the enrichment of specific functional categories. In addition, we have evaluated the support for five explanations that have been postulated for the limited overlap of signatures. CONCLUSION: The limited overlap of current signature genes can be attributed to small sample size. Pooling datasets results in more accurate classification and a convergence of signature genes. We therefore advocate the analysis of new data within the context of a compendium, rather than analysis in isolation.
format Text
id pubmed-2527336
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-25273362008-09-02 Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability van Vliet, Martin H Reyal, Fabien Horlings, Hugo M van de Vijver, Marc J Reinders, Marcel JT Wessels, Lodewyk FA BMC Genomics Research Article BACKGROUND: Michiels et al. (Lancet 2005; 365: 488–92) employed a resampling strategy to show that the genes identified as predictors of prognosis from resamplings of a single gene expression dataset are highly variable. The genes most frequently identified in the separate resamplings were put forward as a 'gold standard'. On a higher level, breast cancer datasets collected by different institutions can be considered as resamplings from the underlying breast cancer population. The limited overlap between published prognostic signatures confirms the trend of signature instability identified by the resampling strategy. Six breast cancer datasets, totaling 947 samples, all measured on the Affymetrix platform, are currently available. This provides a unique opportunity to employ a substantial dataset to investigate the effects of pooling datasets on classifier accuracy, signature stability and enrichment of functional categories. RESULTS: We show that the resampling strategy produces a suboptimal ranking of genes, which can not be considered to be a 'gold standard'. When pooling breast cancer datasets, we observed a synergetic effect on the classification performance in 73% of the cases. We also observe a significant positive correlation between the number of datasets that is pooled, the validation performance, the number of genes selected, and the enrichment of specific functional categories. In addition, we have evaluated the support for five explanations that have been postulated for the limited overlap of signatures. CONCLUSION: The limited overlap of current signature genes can be attributed to small sample size. Pooling datasets results in more accurate classification and a convergence of signature genes. We therefore advocate the analysis of new data within the context of a compendium, rather than analysis in isolation. BioMed Central 2008-08-06 /pmc/articles/PMC2527336/ /pubmed/18684329 http://dx.doi.org/10.1186/1471-2164-9-375 Text en Copyright © 2008 van Vliet et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
van Vliet, Martin H
Reyal, Fabien
Horlings, Hugo M
van de Vijver, Marc J
Reinders, Marcel JT
Wessels, Lodewyk FA
Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability
title Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability
title_full Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability
title_fullStr Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability
title_full_unstemmed Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability
title_short Pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability
title_sort pooling breast cancer datasets has a synergetic effect on classification performance and improves signature stability
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2527336/
https://www.ncbi.nlm.nih.gov/pubmed/18684329
http://dx.doi.org/10.1186/1471-2164-9-375
work_keys_str_mv AT vanvlietmartinh poolingbreastcancerdatasetshasasynergeticeffectonclassificationperformanceandimprovessignaturestability
AT reyalfabien poolingbreastcancerdatasetshasasynergeticeffectonclassificationperformanceandimprovessignaturestability
AT horlingshugom poolingbreastcancerdatasetshasasynergeticeffectonclassificationperformanceandimprovessignaturestability
AT vandevijvermarcj poolingbreastcancerdatasetshasasynergeticeffectonclassificationperformanceandimprovessignaturestability
AT reindersmarceljt poolingbreastcancerdatasetshasasynergeticeffectonclassificationperformanceandimprovessignaturestability
AT wesselslodewykfa poolingbreastcancerdatasetshasasynergeticeffectonclassificationperformanceandimprovessignaturestability