Cargando…
Evaluation of tree-based statistical learning methods for constructing genetic risk scores
BACKGROUND: Genetic risk scores (GRS) summarize genetic features such as single nucleotide polymorphisms (SNPs) in a single statistic with respect to a given trait. So far, GRS are typically built using generalized linear models or regularized extensions. However, these linear methods are usually no...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8935722/ https://www.ncbi.nlm.nih.gov/pubmed/35313824 http://dx.doi.org/10.1186/s12859-022-04634-w |
_version_ | 1784672088597987328 |
---|---|
author | Lau, Michael Wigmann, Claudia Kress, Sara Schikowski, Tamara Schwender, Holger |
author_facet | Lau, Michael Wigmann, Claudia Kress, Sara Schikowski, Tamara Schwender, Holger |
author_sort | Lau, Michael |
collection | PubMed |
description | BACKGROUND: Genetic risk scores (GRS) summarize genetic features such as single nucleotide polymorphisms (SNPs) in a single statistic with respect to a given trait. So far, GRS are typically built using generalized linear models or regularized extensions. However, these linear methods are usually not able to incorporate gene-gene interactions or non-linear SNP-response relationships. Tree-based statistical learning methods such as random forests and logic regression may be an alternative to such regularized-regression-based methods and are investigated in this article. Moreover, we consider modifications of random forests and logic regression for the construction of GRS. RESULTS: In an extensive simulation study and an application to a real data set from a German cohort study, we show that both tree-based approaches can outperform elastic net when constructing GRS for binary traits. Especially a modification of logic regression called logic bagging could induce comparatively high predictive power as measured by the area under the curve and the statistical power. Even when considering no epistatic interaction effects but only marginal genetic effects, the regularized regression method lead in most cases to inferior results. CONCLUSIONS: When constructing GRS, we recommend taking random forests and logic bagging into account, in particular, if it can be assumed that possibly unknown epistasis between SNPs is present. To develop the best possible prediction models, extensive joint hyperparameter optimizations should be conducted. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04634-w. |
format | Online Article Text |
id | pubmed-8935722 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-89357222022-03-23 Evaluation of tree-based statistical learning methods for constructing genetic risk scores Lau, Michael Wigmann, Claudia Kress, Sara Schikowski, Tamara Schwender, Holger BMC Bioinformatics Research BACKGROUND: Genetic risk scores (GRS) summarize genetic features such as single nucleotide polymorphisms (SNPs) in a single statistic with respect to a given trait. So far, GRS are typically built using generalized linear models or regularized extensions. However, these linear methods are usually not able to incorporate gene-gene interactions or non-linear SNP-response relationships. Tree-based statistical learning methods such as random forests and logic regression may be an alternative to such regularized-regression-based methods and are investigated in this article. Moreover, we consider modifications of random forests and logic regression for the construction of GRS. RESULTS: In an extensive simulation study and an application to a real data set from a German cohort study, we show that both tree-based approaches can outperform elastic net when constructing GRS for binary traits. Especially a modification of logic regression called logic bagging could induce comparatively high predictive power as measured by the area under the curve and the statistical power. Even when considering no epistatic interaction effects but only marginal genetic effects, the regularized regression method lead in most cases to inferior results. CONCLUSIONS: When constructing GRS, we recommend taking random forests and logic bagging into account, in particular, if it can be assumed that possibly unknown epistasis between SNPs is present. To develop the best possible prediction models, extensive joint hyperparameter optimizations should be conducted. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04634-w. BioMed Central 2022-03-21 /pmc/articles/PMC8935722/ /pubmed/35313824 http://dx.doi.org/10.1186/s12859-022-04634-w Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Lau, Michael Wigmann, Claudia Kress, Sara Schikowski, Tamara Schwender, Holger Evaluation of tree-based statistical learning methods for constructing genetic risk scores |
title | Evaluation of tree-based statistical learning methods for constructing genetic risk scores |
title_full | Evaluation of tree-based statistical learning methods for constructing genetic risk scores |
title_fullStr | Evaluation of tree-based statistical learning methods for constructing genetic risk scores |
title_full_unstemmed | Evaluation of tree-based statistical learning methods for constructing genetic risk scores |
title_short | Evaluation of tree-based statistical learning methods for constructing genetic risk scores |
title_sort | evaluation of tree-based statistical learning methods for constructing genetic risk scores |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8935722/ https://www.ncbi.nlm.nih.gov/pubmed/35313824 http://dx.doi.org/10.1186/s12859-022-04634-w |
work_keys_str_mv | AT laumichael evaluationoftreebasedstatisticallearningmethodsforconstructinggeneticriskscores AT wigmannclaudia evaluationoftreebasedstatisticallearningmethodsforconstructinggeneticriskscores AT kresssara evaluationoftreebasedstatisticallearningmethodsforconstructinggeneticriskscores AT schikowskitamara evaluationoftreebasedstatisticallearningmethodsforconstructinggeneticriskscores AT schwenderholger evaluationoftreebasedstatisticallearningmethodsforconstructinggeneticriskscores |