Cargando…

High dimensional model representation of log likelihood ratio: binary classification with SNP data

BACKGROUND: Developing binary classification rules based on SNP observations has been a major challenge for many modern bioinformatics applications, e.g., predicting risk of future disease events in complex conditions such as cancer. Small-sample, high-dimensional nature of SNP data, weak effect of...

Descripción completa

Detalles Bibliográficos
Autores principales: pour, Ali Foroughi, Pietrzak, Maciej, Sucheston-Campbell, Lara E., Karaesmen, Ezgi, Dalton, Lori A., Rempała, Grzegorz A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7504683/
https://www.ncbi.nlm.nih.gov/pubmed/32957998
http://dx.doi.org/10.1186/s12920-020-00774-1
_version_ 1783584680285569024
author pour, Ali Foroughi
Pietrzak, Maciej
Sucheston-Campbell, Lara E.
Karaesmen, Ezgi
Dalton, Lori A.
Rempała, Grzegorz A.
author_facet pour, Ali Foroughi
Pietrzak, Maciej
Sucheston-Campbell, Lara E.
Karaesmen, Ezgi
Dalton, Lori A.
Rempała, Grzegorz A.
author_sort pour, Ali Foroughi
collection PubMed
description BACKGROUND: Developing binary classification rules based on SNP observations has been a major challenge for many modern bioinformatics applications, e.g., predicting risk of future disease events in complex conditions such as cancer. Small-sample, high-dimensional nature of SNP data, weak effect of each SNP on the outcome, and highly non-linear SNP interactions are several key factors complicating the analysis. Additionally, SNPs take a finite number of values which may be best understood as ordinal or categorical variables, but are treated as continuous ones by many algorithms. METHODS: We use the theory of high dimensional model representation (HDMR) to build appropriate low dimensional glass-box models, allowing us to account for the effects of feature interactions. We compute the second order HDMR expansion of the log-likelihood ratio to account for the effects of single SNPs and their pairwise interactions. We propose a regression based approach, called linear approximation for block second order HDMR expansion of categorical observations (LABS-HDMR-CO), to approximate the HDMR coefficients. We show how HDMR can be used to detect pairwise SNP interactions, and propose the fixed pattern test (FPT) to identify statistically significant pairwise interactions. RESULTS: We apply LABS-HDMR-CO and FPT to synthetically generated HAPGEN2 data as well as to two GWAS cancer datasets. In these examples LABS-HDMR-CO enjoys superior accuracy compared with several algorithms used for SNP classification, while also taking pairwise interactions into account. FPT declares very few significant interactions in the small sample GWAS datasets when bounding false discovery rate (FDR) by 5%, due to the large number of tests performed. On the other hand, LABS-HDMR-CO utilizes a large number of SNP pairs to improve its prediction accuracy. In the larger HAPGEN2 dataset FTP declares a larger portion of SNP pairs used by LABS-HDMR-CO as significant. CONCLUSION: LABS-HDMR-CO and FPT are interesting methods to design prediction rules and detect pairwise feature interactions for SNP data. Reliably detecting pairwise SNP interactions and taking advantage of potential interactions to improve prediction accuracy are two different objectives addressed by these methods. While the large number of potential SNP interactions may result in low power of detection, potentially interacting SNP pairs, of which many might be false alarms, can still be used to improve prediction accuracy.
format Online
Article
Text
id pubmed-7504683
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-75046832020-09-23 High dimensional model representation of log likelihood ratio: binary classification with SNP data pour, Ali Foroughi Pietrzak, Maciej Sucheston-Campbell, Lara E. Karaesmen, Ezgi Dalton, Lori A. Rempała, Grzegorz A. BMC Med Genomics Research BACKGROUND: Developing binary classification rules based on SNP observations has been a major challenge for many modern bioinformatics applications, e.g., predicting risk of future disease events in complex conditions such as cancer. Small-sample, high-dimensional nature of SNP data, weak effect of each SNP on the outcome, and highly non-linear SNP interactions are several key factors complicating the analysis. Additionally, SNPs take a finite number of values which may be best understood as ordinal or categorical variables, but are treated as continuous ones by many algorithms. METHODS: We use the theory of high dimensional model representation (HDMR) to build appropriate low dimensional glass-box models, allowing us to account for the effects of feature interactions. We compute the second order HDMR expansion of the log-likelihood ratio to account for the effects of single SNPs and their pairwise interactions. We propose a regression based approach, called linear approximation for block second order HDMR expansion of categorical observations (LABS-HDMR-CO), to approximate the HDMR coefficients. We show how HDMR can be used to detect pairwise SNP interactions, and propose the fixed pattern test (FPT) to identify statistically significant pairwise interactions. RESULTS: We apply LABS-HDMR-CO and FPT to synthetically generated HAPGEN2 data as well as to two GWAS cancer datasets. In these examples LABS-HDMR-CO enjoys superior accuracy compared with several algorithms used for SNP classification, while also taking pairwise interactions into account. FPT declares very few significant interactions in the small sample GWAS datasets when bounding false discovery rate (FDR) by 5%, due to the large number of tests performed. On the other hand, LABS-HDMR-CO utilizes a large number of SNP pairs to improve its prediction accuracy. In the larger HAPGEN2 dataset FTP declares a larger portion of SNP pairs used by LABS-HDMR-CO as significant. CONCLUSION: LABS-HDMR-CO and FPT are interesting methods to design prediction rules and detect pairwise feature interactions for SNP data. Reliably detecting pairwise SNP interactions and taking advantage of potential interactions to improve prediction accuracy are two different objectives addressed by these methods. While the large number of potential SNP interactions may result in low power of detection, potentially interacting SNP pairs, of which many might be false alarms, can still be used to improve prediction accuracy. BioMed Central 2020-09-21 /pmc/articles/PMC7504683/ /pubmed/32957998 http://dx.doi.org/10.1186/s12920-020-00774-1 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
pour, Ali Foroughi
Pietrzak, Maciej
Sucheston-Campbell, Lara E.
Karaesmen, Ezgi
Dalton, Lori A.
Rempała, Grzegorz A.
High dimensional model representation of log likelihood ratio: binary classification with SNP data
title High dimensional model representation of log likelihood ratio: binary classification with SNP data
title_full High dimensional model representation of log likelihood ratio: binary classification with SNP data
title_fullStr High dimensional model representation of log likelihood ratio: binary classification with SNP data
title_full_unstemmed High dimensional model representation of log likelihood ratio: binary classification with SNP data
title_short High dimensional model representation of log likelihood ratio: binary classification with SNP data
title_sort high dimensional model representation of log likelihood ratio: binary classification with snp data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7504683/
https://www.ncbi.nlm.nih.gov/pubmed/32957998
http://dx.doi.org/10.1186/s12920-020-00774-1
work_keys_str_mv AT pouraliforoughi highdimensionalmodelrepresentationofloglikelihoodratiobinaryclassificationwithsnpdata
AT pietrzakmaciej highdimensionalmodelrepresentationofloglikelihoodratiobinaryclassificationwithsnpdata
AT suchestoncampbelllarae highdimensionalmodelrepresentationofloglikelihoodratiobinaryclassificationwithsnpdata
AT karaesmenezgi highdimensionalmodelrepresentationofloglikelihoodratiobinaryclassificationwithsnpdata
AT daltonloria highdimensionalmodelrepresentationofloglikelihoodratiobinaryclassificationwithsnpdata
AT rempałagrzegorza highdimensionalmodelrepresentationofloglikelihoodratiobinaryclassificationwithsnpdata