Cargando…

Evaluation of tree-based statistical learning methods for constructing genetic risk scores

BACKGROUND: Genetic risk scores (GRS) summarize genetic features such as single nucleotide polymorphisms (SNPs) in a single statistic with respect to a given trait. So far, GRS are typically built using generalized linear models or regularized extensions. However, these linear methods are usually no...

Descripción completa

Detalles Bibliográficos
Autores principales: Lau, Michael, Wigmann, Claudia, Kress, Sara, Schikowski, Tamara, Schwender, Holger
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8935722/
https://www.ncbi.nlm.nih.gov/pubmed/35313824
http://dx.doi.org/10.1186/s12859-022-04634-w
_version_ 1784672088597987328
author Lau, Michael
Wigmann, Claudia
Kress, Sara
Schikowski, Tamara
Schwender, Holger
author_facet Lau, Michael
Wigmann, Claudia
Kress, Sara
Schikowski, Tamara
Schwender, Holger
author_sort Lau, Michael
collection PubMed
description BACKGROUND: Genetic risk scores (GRS) summarize genetic features such as single nucleotide polymorphisms (SNPs) in a single statistic with respect to a given trait. So far, GRS are typically built using generalized linear models or regularized extensions. However, these linear methods are usually not able to incorporate gene-gene interactions or non-linear SNP-response relationships. Tree-based statistical learning methods such as random forests and logic regression may be an alternative to such regularized-regression-based methods and are investigated in this article. Moreover, we consider modifications of random forests and logic regression for the construction of GRS. RESULTS: In an extensive simulation study and an application to a real data set from a German cohort study, we show that both tree-based approaches can outperform elastic net when constructing GRS for binary traits. Especially a modification of logic regression called logic bagging could induce comparatively high predictive power as measured by the area under the curve and the statistical power. Even when considering no epistatic interaction effects but only marginal genetic effects, the regularized regression method lead in most cases to inferior results. CONCLUSIONS: When constructing GRS, we recommend taking random forests and logic bagging into account, in particular, if it can be assumed that possibly unknown epistasis between SNPs is present. To develop the best possible prediction models, extensive joint hyperparameter optimizations should be conducted. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04634-w.
format Online
Article
Text
id pubmed-8935722
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-89357222022-03-23 Evaluation of tree-based statistical learning methods for constructing genetic risk scores Lau, Michael Wigmann, Claudia Kress, Sara Schikowski, Tamara Schwender, Holger BMC Bioinformatics Research BACKGROUND: Genetic risk scores (GRS) summarize genetic features such as single nucleotide polymorphisms (SNPs) in a single statistic with respect to a given trait. So far, GRS are typically built using generalized linear models or regularized extensions. However, these linear methods are usually not able to incorporate gene-gene interactions or non-linear SNP-response relationships. Tree-based statistical learning methods such as random forests and logic regression may be an alternative to such regularized-regression-based methods and are investigated in this article. Moreover, we consider modifications of random forests and logic regression for the construction of GRS. RESULTS: In an extensive simulation study and an application to a real data set from a German cohort study, we show that both tree-based approaches can outperform elastic net when constructing GRS for binary traits. Especially a modification of logic regression called logic bagging could induce comparatively high predictive power as measured by the area under the curve and the statistical power. Even when considering no epistatic interaction effects but only marginal genetic effects, the regularized regression method lead in most cases to inferior results. CONCLUSIONS: When constructing GRS, we recommend taking random forests and logic bagging into account, in particular, if it can be assumed that possibly unknown epistasis between SNPs is present. To develop the best possible prediction models, extensive joint hyperparameter optimizations should be conducted. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04634-w. BioMed Central 2022-03-21 /pmc/articles/PMC8935722/ /pubmed/35313824 http://dx.doi.org/10.1186/s12859-022-04634-w Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Lau, Michael
Wigmann, Claudia
Kress, Sara
Schikowski, Tamara
Schwender, Holger
Evaluation of tree-based statistical learning methods for constructing genetic risk scores
title Evaluation of tree-based statistical learning methods for constructing genetic risk scores
title_full Evaluation of tree-based statistical learning methods for constructing genetic risk scores
title_fullStr Evaluation of tree-based statistical learning methods for constructing genetic risk scores
title_full_unstemmed Evaluation of tree-based statistical learning methods for constructing genetic risk scores
title_short Evaluation of tree-based statistical learning methods for constructing genetic risk scores
title_sort evaluation of tree-based statistical learning methods for constructing genetic risk scores
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8935722/
https://www.ncbi.nlm.nih.gov/pubmed/35313824
http://dx.doi.org/10.1186/s12859-022-04634-w
work_keys_str_mv AT laumichael evaluationoftreebasedstatisticallearningmethodsforconstructinggeneticriskscores
AT wigmannclaudia evaluationoftreebasedstatisticallearningmethodsforconstructinggeneticriskscores
AT kresssara evaluationoftreebasedstatisticallearningmethodsforconstructinggeneticriskscores
AT schikowskitamara evaluationoftreebasedstatisticallearningmethodsforconstructinggeneticriskscores
AT schwenderholger evaluationoftreebasedstatisticallearningmethodsforconstructinggeneticriskscores