Cargando…

High dimensional model representation of log-likelihood ratio: binary classification with expression data

BACKGROUND: Binary classification rules based on a small-sample of high-dimensional data (for instance, gene expression data) are ubiquitous in modern bioinformatics. Constructing such classifiers is challenging due to (a) the complex nature of underlying biological traits, such as gene interactions...

Descripción completa

Detalles Bibliográficos
Autores principales: Foroughi pour, Ali, Pietrzak, Maciej, Dalton, Lori A, Rempała, Grzegorz A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7183128/
https://www.ncbi.nlm.nih.gov/pubmed/32334509
http://dx.doi.org/10.1186/s12859-020-3486-x
_version_ 1783526371517005824
author Foroughi pour, Ali
Pietrzak, Maciej
Dalton, Lori A
Rempała, Grzegorz A.
author_facet Foroughi pour, Ali
Pietrzak, Maciej
Dalton, Lori A
Rempała, Grzegorz A.
author_sort Foroughi pour, Ali
collection PubMed
description BACKGROUND: Binary classification rules based on a small-sample of high-dimensional data (for instance, gene expression data) are ubiquitous in modern bioinformatics. Constructing such classifiers is challenging due to (a) the complex nature of underlying biological traits, such as gene interactions, and (b) the need for highly interpretable glass-box models. We use the theory of high dimensional model representation (HDMR) to build interpretable low dimensional approximations of the log-likelihood ratio accounting for the effects of each individual gene as well as gene-gene interactions. We propose two algorithms approximating the second order HDMR expansion, and a hypothesis test based on the HDMR formulation to identify significantly dysregulated pairwise interactions. The theory is seen as flexible and requiring only a mild set of assumptions. RESULTS: We apply our approach to gene expression data from both synthetic and real (breast and lung cancer) datasets comparing it also against several popular state-of-the-art methods. The analyses suggest the proposed algorithms can be used to obtain interpretable prediction rules with high prediction accuracies and to successfully extract significantly dysregulated gene-gene interactions from the data. They also compare favorably against their competitors across multiple synthetic data scenarios. CONCLUSION: The proposed HDMR-based approach appears to produce a reliable classifier that additionally allows one to describe how individual genes or gene-gene interactions affect classification decisions. Both real and synthetic data analyses suggest that our methods can be used to identify gene networks with dysregulated pairwise interactions, and are therefore appropriate for differential networks analysis.
format Online
Article
Text
id pubmed-7183128
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-71831282020-04-28 High dimensional model representation of log-likelihood ratio: binary classification with expression data Foroughi pour, Ali Pietrzak, Maciej Dalton, Lori A Rempała, Grzegorz A. BMC Bioinformatics Methodology Article BACKGROUND: Binary classification rules based on a small-sample of high-dimensional data (for instance, gene expression data) are ubiquitous in modern bioinformatics. Constructing such classifiers is challenging due to (a) the complex nature of underlying biological traits, such as gene interactions, and (b) the need for highly interpretable glass-box models. We use the theory of high dimensional model representation (HDMR) to build interpretable low dimensional approximations of the log-likelihood ratio accounting for the effects of each individual gene as well as gene-gene interactions. We propose two algorithms approximating the second order HDMR expansion, and a hypothesis test based on the HDMR formulation to identify significantly dysregulated pairwise interactions. The theory is seen as flexible and requiring only a mild set of assumptions. RESULTS: We apply our approach to gene expression data from both synthetic and real (breast and lung cancer) datasets comparing it also against several popular state-of-the-art methods. The analyses suggest the proposed algorithms can be used to obtain interpretable prediction rules with high prediction accuracies and to successfully extract significantly dysregulated gene-gene interactions from the data. They also compare favorably against their competitors across multiple synthetic data scenarios. CONCLUSION: The proposed HDMR-based approach appears to produce a reliable classifier that additionally allows one to describe how individual genes or gene-gene interactions affect classification decisions. Both real and synthetic data analyses suggest that our methods can be used to identify gene networks with dysregulated pairwise interactions, and are therefore appropriate for differential networks analysis. BioMed Central 2020-04-25 /pmc/articles/PMC7183128/ /pubmed/32334509 http://dx.doi.org/10.1186/s12859-020-3486-x Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Methodology Article
Foroughi pour, Ali
Pietrzak, Maciej
Dalton, Lori A
Rempała, Grzegorz A.
High dimensional model representation of log-likelihood ratio: binary classification with expression data
title High dimensional model representation of log-likelihood ratio: binary classification with expression data
title_full High dimensional model representation of log-likelihood ratio: binary classification with expression data
title_fullStr High dimensional model representation of log-likelihood ratio: binary classification with expression data
title_full_unstemmed High dimensional model representation of log-likelihood ratio: binary classification with expression data
title_short High dimensional model representation of log-likelihood ratio: binary classification with expression data
title_sort high dimensional model representation of log-likelihood ratio: binary classification with expression data
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7183128/
https://www.ncbi.nlm.nih.gov/pubmed/32334509
http://dx.doi.org/10.1186/s12859-020-3486-x
work_keys_str_mv AT foroughipourali highdimensionalmodelrepresentationofloglikelihoodratiobinaryclassificationwithexpressiondata
AT pietrzakmaciej highdimensionalmodelrepresentationofloglikelihoodratiobinaryclassificationwithexpressiondata
AT daltonloria highdimensionalmodelrepresentationofloglikelihoodratiobinaryclassificationwithexpressiondata
AT rempałagrzegorza highdimensionalmodelrepresentationofloglikelihoodratiobinaryclassificationwithexpressiondata