Cargando…

Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study

MOTIVATION: High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural a...

Descripción completa

Detalles Bibliográficos
Autores principales: Seifert, Stephan, Gundlach, Sven, Junge, Olaf, Szymczak, Silke
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7520048/
https://www.ncbi.nlm.nih.gov/pubmed/32399562
http://dx.doi.org/10.1093/bioinformatics/btaa483
_version_ 1783587699599343616
author Seifert, Stephan
Gundlach, Sven
Junge, Olaf
Szymczak, Silke
author_facet Seifert, Stephan
Gundlach, Sven
Junge, Olaf
Szymczak, Silke
author_sort Seifert, Stephan
collection PubMed
description MOTIVATION: High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets. RESULTS: The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate. AVAILABILITY AND IMPLEMENTATION: An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-7520048
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-75200482020-09-30 Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study Seifert, Stephan Gundlach, Sven Junge, Olaf Szymczak, Silke Bioinformatics Original Papers MOTIVATION: High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets. RESULTS: The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate. AVAILABILITY AND IMPLEMENTATION: An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-05-12 /pmc/articles/PMC7520048/ /pubmed/32399562 http://dx.doi.org/10.1093/bioinformatics/btaa483 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Seifert, Stephan
Gundlach, Sven
Junge, Olaf
Szymczak, Silke
Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study
title Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study
title_full Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study
title_fullStr Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study
title_full_unstemmed Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study
title_short Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study
title_sort integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7520048/
https://www.ncbi.nlm.nih.gov/pubmed/32399562
http://dx.doi.org/10.1093/bioinformatics/btaa483
work_keys_str_mv AT seifertstephan integratingbiologicalknowledgeandgeneexpressiondatausingpathwayguidedrandomforestsabenchmarkingstudy
AT gundlachsven integratingbiologicalknowledgeandgeneexpressiondatausingpathwayguidedrandomforestsabenchmarkingstudy
AT jungeolaf integratingbiologicalknowledgeandgeneexpressiondatausingpathwayguidedrandomforestsabenchmarkingstudy
AT szymczaksilke integratingbiologicalknowledgeandgeneexpressiondatausingpathwayguidedrandomforestsabenchmarkingstudy