Cargando…

Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes

BACKGROUND: The identification of very small subsets of predictive variables is an important toπc that has not often been considered in the literature. In order to discover highly predictive yet compact gene set classifiers from whole genome expression data, a non-parametric, iterative algorithm, Sp...

Descripción completa

Detalles Bibliográficos
Autores principales:	Guan, Xiaowei, Chance, Mark R, Barnholtz-Sloan, Jill S
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Methodology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3444418/ https://www.ncbi.nlm.nih.gov/pubmed/22616791 http://dx.doi.org/10.1186/2043-9113-2-13

_version_	1782243682082095104
author	Guan, Xiaowei Chance, Mark R Barnholtz-Sloan, Jill S
author_facet	Guan, Xiaowei Chance, Mark R Barnholtz-Sloan, Jill S
author_sort	Guan, Xiaowei
collection	PubMed
description	BACKGROUND: The identification of very small subsets of predictive variables is an important toπc that has not often been considered in the literature. In order to discover highly predictive yet compact gene set classifiers from whole genome expression data, a non-parametric, iterative algorithm, Splitting Random Forest (SRF), was developed to robustly identify genes that distinguish between molecular subtypes. The goal is to improve the prediction accuracy while considering sparsity. RESULTS: The optimal SRF 50 run (SRF50) gene classifiers for glioblastoma (GB), breast (BC) and ovarian cancer (OC) subtypes had overall prediction rates comparable to those from published datasets upon validation (80.1%-91.7%). The SRF50 sets outperformed other methods by identifying compact gene sets needed for distinguishing between tested cancer subtypes (10–200 fold fewer genes than ANOVA or published gene sets). The SRF50 sets achieved superior and robust overall and subtype prediction accuracies when compared with single random forest (RF) and the Top 50 ANOVA results (80.1% vs 77.8% for GB; 84.0% vs 74.1% for BC; 89.8% vs 88.9% for OC in SRF50 vs single RF comparison; 80.1% vs 77.2% for GB; 84.0% vs 82.7% for BC; 89.8% vs 87.0% for OC in SRF50 vs Top 50 ANOVA comparison). There was significant overlap between SRF50 and published gene sets, showing that SRF identifies the relevant sub-sets of important gene lists. Through Ingenuity Pathway Analysis (IPA), the overlap in “hub” genes between the SRF50 and published genes sets were RB1, πK3R1, PDGFBB and ERK1/2 for GB; ESR1, MYC, NFkB and ERK1/2 for BC; and Akt, FN1, NFkB, PDGFBB and ERK1/2 for OC. CONCLUSIONS: The SRF approach is an effective driver of biomarker discovery research that reduces the number of genes needed for robust classification, dissects complex, high dimensional “omic” data and provides novel insights into the cellular mechanisms that define cancer subtypes.
format	Online Article Text
id	pubmed-3444418
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-34444182012-09-20 Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes Guan, Xiaowei Chance, Mark R Barnholtz-Sloan, Jill S J Clin Bioinforma Methodology BACKGROUND: The identification of very small subsets of predictive variables is an important toπc that has not often been considered in the literature. In order to discover highly predictive yet compact gene set classifiers from whole genome expression data, a non-parametric, iterative algorithm, Splitting Random Forest (SRF), was developed to robustly identify genes that distinguish between molecular subtypes. The goal is to improve the prediction accuracy while considering sparsity. RESULTS: The optimal SRF 50 run (SRF50) gene classifiers for glioblastoma (GB), breast (BC) and ovarian cancer (OC) subtypes had overall prediction rates comparable to those from published datasets upon validation (80.1%-91.7%). The SRF50 sets outperformed other methods by identifying compact gene sets needed for distinguishing between tested cancer subtypes (10–200 fold fewer genes than ANOVA or published gene sets). The SRF50 sets achieved superior and robust overall and subtype prediction accuracies when compared with single random forest (RF) and the Top 50 ANOVA results (80.1% vs 77.8% for GB; 84.0% vs 74.1% for BC; 89.8% vs 88.9% for OC in SRF50 vs single RF comparison; 80.1% vs 77.2% for GB; 84.0% vs 82.7% for BC; 89.8% vs 87.0% for OC in SRF50 vs Top 50 ANOVA comparison). There was significant overlap between SRF50 and published gene sets, showing that SRF identifies the relevant sub-sets of important gene lists. Through Ingenuity Pathway Analysis (IPA), the overlap in “hub” genes between the SRF50 and published genes sets were RB1, πK3R1, PDGFBB and ERK1/2 for GB; ESR1, MYC, NFkB and ERK1/2 for BC; and Akt, FN1, NFkB, PDGFBB and ERK1/2 for OC. CONCLUSIONS: The SRF approach is an effective driver of biomarker discovery research that reduces the number of genes needed for robust classification, dissects complex, high dimensional “omic” data and provides novel insights into the cellular mechanisms that define cancer subtypes. BioMed Central 2012-05-22 /pmc/articles/PMC3444418/ /pubmed/22616791 http://dx.doi.org/10.1186/2043-9113-2-13 Text en Copyright ©2012 Guan et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Guan, Xiaowei Chance, Mark R Barnholtz-Sloan, Jill S Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes
title	Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes
title_full	Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes
title_fullStr	Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes
title_full_unstemmed	Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes
title_short	Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes
title_sort	splitting random forest (srf) for determining compact sets of genes that distinguish between cancer subtypes
topic	Methodology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3444418/ https://www.ncbi.nlm.nih.gov/pubmed/22616791 http://dx.doi.org/10.1186/2043-9113-2-13
work_keys_str_mv	AT guanxiaowei splittingrandomforestsrffordeterminingcompactsetsofgenesthatdistinguishbetweencancersubtypes AT chancemarkr splittingrandomforestsrffordeterminingcompactsetsofgenesthatdistinguishbetweencancersubtypes AT barnholtzsloanjills splittingrandomforestsrffordeterminingcompactsetsofgenesthatdistinguishbetweencancersubtypes

Splitting random forest (SRF) for determining compact sets of genes that distinguish between cancer subtypes

Ejemplares similares