Cargando…

Improved high-dimensional prediction with Random Forests by the use of co-data

BACKGROUND: Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary ‘co-data’ can be used to improve the performance of a Random Forest in such a setting. RESULTS: Co-data are incorporated in the Random Fores...

Descripción completa

Detalles Bibliográficos
Autores principales:	te Beest, Dennis E., Mes, Steven W., Wilting, Saskia M., Brakenhoff, Ruud H., van de Wiel, Mark A.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5745983/ https://www.ncbi.nlm.nih.gov/pubmed/29281963 http://dx.doi.org/10.1186/s12859-017-1993-1

_version_	1783289017757859840
author	te Beest, Dennis E. Mes, Steven W. Wilting, Saskia M. Brakenhoff, Ruud H. van de Wiel, Mark A.
author_facet	te Beest, Dennis E. Mes, Steven W. Wilting, Saskia M. Brakenhoff, Ruud H. van de Wiel, Mark A.
author_sort	te Beest, Dennis E.
collection	PubMed
description	BACKGROUND: Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary ‘co-data’ can be used to improve the performance of a Random Forest in such a setting. RESULTS: Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities that are used to draw candidate variables by co-data moderated sampling probabilities. Co-data here are defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate the co-data moderated Random Forest (CoRF) with two examples. In the first example we aim to predict the presence of a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance. In the second example we demonstrate how the prediction of cervical (pre-)cancer with methylation data can be improved by including the location of the probe relative to the known CpG islands, the number of CpG sites targeted by a probe, and a set of p-values from a related study. CONCLUSION: The proposed method is able to utilize auxiliary co-data to improve the performance of a Random Forest. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1993-1) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5745983
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-57459832018-01-03 Improved high-dimensional prediction with Random Forests by the use of co-data te Beest, Dennis E. Mes, Steven W. Wilting, Saskia M. Brakenhoff, Ruud H. van de Wiel, Mark A. BMC Bioinformatics Methodology Article BACKGROUND: Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary ‘co-data’ can be used to improve the performance of a Random Forest in such a setting. RESULTS: Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities that are used to draw candidate variables by co-data moderated sampling probabilities. Co-data here are defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate the co-data moderated Random Forest (CoRF) with two examples. In the first example we aim to predict the presence of a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance. In the second example we demonstrate how the prediction of cervical (pre-)cancer with methylation data can be improved by including the location of the probe relative to the known CpG islands, the number of CpG sites targeted by a probe, and a set of p-values from a related study. CONCLUSION: The proposed method is able to utilize auxiliary co-data to improve the performance of a Random Forest. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1993-1) contains supplementary material, which is available to authorized users. BioMed Central 2017-12-28 /pmc/articles/PMC5745983/ /pubmed/29281963 http://dx.doi.org/10.1186/s12859-017-1993-1 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article te Beest, Dennis E. Mes, Steven W. Wilting, Saskia M. Brakenhoff, Ruud H. van de Wiel, Mark A. Improved high-dimensional prediction with Random Forests by the use of co-data
title	Improved high-dimensional prediction with Random Forests by the use of co-data
title_full	Improved high-dimensional prediction with Random Forests by the use of co-data
title_fullStr	Improved high-dimensional prediction with Random Forests by the use of co-data
title_full_unstemmed	Improved high-dimensional prediction with Random Forests by the use of co-data
title_short	Improved high-dimensional prediction with Random Forests by the use of co-data
title_sort	improved high-dimensional prediction with random forests by the use of co-data
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5745983/ https://www.ncbi.nlm.nih.gov/pubmed/29281963 http://dx.doi.org/10.1186/s12859-017-1993-1
work_keys_str_mv	AT tebeestdennise improvedhighdimensionalpredictionwithrandomforestsbytheuseofcodata AT messtevenw improvedhighdimensionalpredictionwithrandomforestsbytheuseofcodata AT wiltingsaskiam improvedhighdimensionalpredictionwithrandomforestsbytheuseofcodata AT brakenhoffruudh improvedhighdimensionalpredictionwithrandomforestsbytheuseofcodata AT vandewielmarka improvedhighdimensionalpredictionwithrandomforestsbytheuseofcodata

Improved high-dimensional prediction with Random Forests by the use of co-data

Ejemplares similares