Cargando…

Lost in Translation: On the Problem of Data Coding in Penalized Whole Genome Regression with Interactions

Mixed models can be considered as a type of penalized regression and are everyday tools in statistical genetics. The standard mixed model for whole genome regression (WGR) is ridge regression best linear unbiased prediction (RRBLUP) which is based on an additive marker effect model. Many publication...

Descripción completa

Detalles Bibliográficos
Autores principales:	Martini, Johannes W. R., Rosales, Francisco, Ha, Ngoc-Thuy, Heise, Johannes, Wimmer, Valentin, Kneib, Thomas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Genetics Society of America 2019
Materias:	Genomic Prediction
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6469405/ https://www.ncbi.nlm.nih.gov/pubmed/30760541 http://dx.doi.org/10.1534/g3.118.200961

_version_	1783411634978422784
author	Martini, Johannes W. R. Rosales, Francisco Ha, Ngoc-Thuy Heise, Johannes Wimmer, Valentin Kneib, Thomas
author_facet	Martini, Johannes W. R. Rosales, Francisco Ha, Ngoc-Thuy Heise, Johannes Wimmer, Valentin Kneib, Thomas
author_sort	Martini, Johannes W. R.
collection	PubMed
description	Mixed models can be considered as a type of penalized regression and are everyday tools in statistical genetics. The standard mixed model for whole genome regression (WGR) is ridge regression best linear unbiased prediction (RRBLUP) which is based on an additive marker effect model. Many publications have extended the additive WGR approach by incorporating interactions between loci or between genes and environment. In this context of penalized regressions with interactions, it has been reported that translating the coding of single nucleotide polymorphisms -for instance from -1,0,1 to 0,1,2- has an impact on the prediction of genetic values and interaction effects. In this work, we identify the reason for the relevance of variable coding in the general context of penalized polynomial regression. We show that in many cases, predictions of the genetic values are not invariant to translations of the variable coding, with an exception when only the sizes of the coefficients of monomials of highest total degree are penalized. The invariance of RRBLUP can be considered as a special case of this setting, with a polynomial of total degree 1, penalizing additive effects (total degree 1) but not the fixed effect (total degree 0). The extended RRBLUP (eRRBLUP), which includes interactions, is not invariant to translations because it does not only penalize interactions (total degree 2), but also additive effects (total degree 1). This observation implies that translation-invariance can be maintained in a pair-wise epistatic WGR if only interaction effects are penalized, but not the additive effects. In this regard, approaches of pre-selecting loci may not only reduce computation time, but can also help to avoid the variable coding issue. To illustrate the practical relevance, we compare different regressions on a publicly available wheat data set. We show that for an eRRBLUP, the relevance of the marker coding for interaction effect estimates increases with the number of variables included in the model. A biological interpretation of estimated interaction effects may therefore become more difficult. Consequently, comparing reproducing kernel Hilbert space (RKHS) approaches to WGR approaches modeling effects explicitly, the supposed advantage of an increased interpretability of the latter may not be real. Our theoretical results are generally valid for penalized regressions, for instance also for the least absolute shrinkage and selection operator (LASSO). Moreover, they apply to any type of interaction modeled by products of predictor variables in a penalized regression approach or by Hadamard products of covariance matrices in a mixed model.
format	Online Article Text
id	pubmed-6469405
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Genetics Society of America
record_format	MEDLINE/PubMed
spelling	pubmed-64694052019-04-23 Lost in Translation: On the Problem of Data Coding in Penalized Whole Genome Regression with Interactions Martini, Johannes W. R. Rosales, Francisco Ha, Ngoc-Thuy Heise, Johannes Wimmer, Valentin Kneib, Thomas G3 (Bethesda) Genomic Prediction Mixed models can be considered as a type of penalized regression and are everyday tools in statistical genetics. The standard mixed model for whole genome regression (WGR) is ridge regression best linear unbiased prediction (RRBLUP) which is based on an additive marker effect model. Many publications have extended the additive WGR approach by incorporating interactions between loci or between genes and environment. In this context of penalized regressions with interactions, it has been reported that translating the coding of single nucleotide polymorphisms -for instance from -1,0,1 to 0,1,2- has an impact on the prediction of genetic values and interaction effects. In this work, we identify the reason for the relevance of variable coding in the general context of penalized polynomial regression. We show that in many cases, predictions of the genetic values are not invariant to translations of the variable coding, with an exception when only the sizes of the coefficients of monomials of highest total degree are penalized. The invariance of RRBLUP can be considered as a special case of this setting, with a polynomial of total degree 1, penalizing additive effects (total degree 1) but not the fixed effect (total degree 0). The extended RRBLUP (eRRBLUP), which includes interactions, is not invariant to translations because it does not only penalize interactions (total degree 2), but also additive effects (total degree 1). This observation implies that translation-invariance can be maintained in a pair-wise epistatic WGR if only interaction effects are penalized, but not the additive effects. In this regard, approaches of pre-selecting loci may not only reduce computation time, but can also help to avoid the variable coding issue. To illustrate the practical relevance, we compare different regressions on a publicly available wheat data set. We show that for an eRRBLUP, the relevance of the marker coding for interaction effect estimates increases with the number of variables included in the model. A biological interpretation of estimated interaction effects may therefore become more difficult. Consequently, comparing reproducing kernel Hilbert space (RKHS) approaches to WGR approaches modeling effects explicitly, the supposed advantage of an increased interpretability of the latter may not be real. Our theoretical results are generally valid for penalized regressions, for instance also for the least absolute shrinkage and selection operator (LASSO). Moreover, they apply to any type of interaction modeled by products of predictor variables in a penalized regression approach or by Hadamard products of covariance matrices in a mixed model. Genetics Society of America 2019-02-21 /pmc/articles/PMC6469405/ /pubmed/30760541 http://dx.doi.org/10.1534/g3.118.200961 Text en Copyright © 2019 Martini et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Genomic Prediction Martini, Johannes W. R. Rosales, Francisco Ha, Ngoc-Thuy Heise, Johannes Wimmer, Valentin Kneib, Thomas Lost in Translation: On the Problem of Data Coding in Penalized Whole Genome Regression with Interactions
title	Lost in Translation: On the Problem of Data Coding in Penalized Whole Genome Regression with Interactions
title_full	Lost in Translation: On the Problem of Data Coding in Penalized Whole Genome Regression with Interactions
title_fullStr	Lost in Translation: On the Problem of Data Coding in Penalized Whole Genome Regression with Interactions
title_full_unstemmed	Lost in Translation: On the Problem of Data Coding in Penalized Whole Genome Regression with Interactions
title_short	Lost in Translation: On the Problem of Data Coding in Penalized Whole Genome Regression with Interactions
title_sort	lost in translation: on the problem of data coding in penalized whole genome regression with interactions
topic	Genomic Prediction
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6469405/ https://www.ncbi.nlm.nih.gov/pubmed/30760541 http://dx.doi.org/10.1534/g3.118.200961
work_keys_str_mv	AT martinijohanneswr lostintranslationontheproblemofdatacodinginpenalizedwholegenomeregressionwithinteractions AT rosalesfrancisco lostintranslationontheproblemofdatacodinginpenalizedwholegenomeregressionwithinteractions AT hangocthuy lostintranslationontheproblemofdatacodinginpenalizedwholegenomeregressionwithinteractions AT heisejohannes lostintranslationontheproblemofdatacodinginpenalizedwholegenomeregressionwithinteractions AT wimmervalentin lostintranslationontheproblemofdatacodinginpenalizedwholegenomeregressionwithinteractions AT kneibthomas lostintranslationontheproblemofdatacodinginpenalizedwholegenomeregressionwithinteractions

Lost in Translation: On the Problem of Data Coding in Penalized Whole Genome Regression with Interactions

Ejemplares similares