Cargando…

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

In the Life Sciences ‘omics’ data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are requ...

Descripción completa

Detalles Bibliográficos
Autores principales: Touw, Wouter G., Bayjanov, Jumamurat R., Overmars, Lex, Backus, Lennart, Boekhorst, Jos, Wels, Michiel, van Hijum, Sacha A. F. T.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3659301/
https://www.ncbi.nlm.nih.gov/pubmed/22786785
http://dx.doi.org/10.1093/bib/bbs034
_version_ 1782270429472227328
author Touw, Wouter G.
Bayjanov, Jumamurat R.
Overmars, Lex
Backus, Lennart
Boekhorst, Jos
Wels, Michiel
van Hijum, Sacha A. F. T.
author_facet Touw, Wouter G.
Bayjanov, Jumamurat R.
Overmars, Lex
Backus, Lennart
Boekhorst, Jos
Wels, Michiel
van Hijum, Sacha A. F. T.
author_sort Touw, Wouter G.
collection PubMed
description In the Life Sciences ‘omics’ data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.
format Online
Article
Text
id pubmed-3659301
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-36593012013-05-21 Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle? Touw, Wouter G. Bayjanov, Jumamurat R. Overmars, Lex Backus, Lennart Boekhorst, Jos Wels, Michiel van Hijum, Sacha A. F. T. Brief Bioinform Papers In the Life Sciences ‘omics’ data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF. Oxford University Press 2013-05 2012-07-10 /pmc/articles/PMC3659301/ /pubmed/22786785 http://dx.doi.org/10.1093/bib/bbs034 Text en © The Author 2012. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Papers
Touw, Wouter G.
Bayjanov, Jumamurat R.
Overmars, Lex
Backus, Lennart
Boekhorst, Jos
Wels, Michiel
van Hijum, Sacha A. F. T.
Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?
title Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?
title_full Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?
title_fullStr Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?
title_full_unstemmed Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?
title_short Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?
title_sort data mining in the life sciences with random forest: a walk in the park or lost in the jungle?
topic Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3659301/
https://www.ncbi.nlm.nih.gov/pubmed/22786785
http://dx.doi.org/10.1093/bib/bbs034
work_keys_str_mv AT touwwouterg datamininginthelifescienceswithrandomforestawalkintheparkorlostinthejungle
AT bayjanovjumamuratr datamininginthelifescienceswithrandomforestawalkintheparkorlostinthejungle
AT overmarslex datamininginthelifescienceswithrandomforestawalkintheparkorlostinthejungle
AT backuslennart datamininginthelifescienceswithrandomforestawalkintheparkorlostinthejungle
AT boekhorstjos datamininginthelifescienceswithrandomforestawalkintheparkorlostinthejungle
AT welsmichiel datamininginthelifescienceswithrandomforestawalkintheparkorlostinthejungle
AT vanhijumsachaaft datamininginthelifescienceswithrandomforestawalkintheparkorlostinthejungle