Cargando…
High dimensional model representation of log-likelihood ratio: binary classification with expression data
BACKGROUND: Binary classification rules based on a small-sample of high-dimensional data (for instance, gene expression data) are ubiquitous in modern bioinformatics. Constructing such classifiers is challenging due to (a) the complex nature of underlying biological traits, such as gene interactions...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7183128/ https://www.ncbi.nlm.nih.gov/pubmed/32334509 http://dx.doi.org/10.1186/s12859-020-3486-x |
_version_ | 1783526371517005824 |
---|---|
author | Foroughi pour, Ali Pietrzak, Maciej Dalton, Lori A Rempała, Grzegorz A. |
author_facet | Foroughi pour, Ali Pietrzak, Maciej Dalton, Lori A Rempała, Grzegorz A. |
author_sort | Foroughi pour, Ali |
collection | PubMed |
description | BACKGROUND: Binary classification rules based on a small-sample of high-dimensional data (for instance, gene expression data) are ubiquitous in modern bioinformatics. Constructing such classifiers is challenging due to (a) the complex nature of underlying biological traits, such as gene interactions, and (b) the need for highly interpretable glass-box models. We use the theory of high dimensional model representation (HDMR) to build interpretable low dimensional approximations of the log-likelihood ratio accounting for the effects of each individual gene as well as gene-gene interactions. We propose two algorithms approximating the second order HDMR expansion, and a hypothesis test based on the HDMR formulation to identify significantly dysregulated pairwise interactions. The theory is seen as flexible and requiring only a mild set of assumptions. RESULTS: We apply our approach to gene expression data from both synthetic and real (breast and lung cancer) datasets comparing it also against several popular state-of-the-art methods. The analyses suggest the proposed algorithms can be used to obtain interpretable prediction rules with high prediction accuracies and to successfully extract significantly dysregulated gene-gene interactions from the data. They also compare favorably against their competitors across multiple synthetic data scenarios. CONCLUSION: The proposed HDMR-based approach appears to produce a reliable classifier that additionally allows one to describe how individual genes or gene-gene interactions affect classification decisions. Both real and synthetic data analyses suggest that our methods can be used to identify gene networks with dysregulated pairwise interactions, and are therefore appropriate for differential networks analysis. |
format | Online Article Text |
id | pubmed-7183128 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-71831282020-04-28 High dimensional model representation of log-likelihood ratio: binary classification with expression data Foroughi pour, Ali Pietrzak, Maciej Dalton, Lori A Rempała, Grzegorz A. BMC Bioinformatics Methodology Article BACKGROUND: Binary classification rules based on a small-sample of high-dimensional data (for instance, gene expression data) are ubiquitous in modern bioinformatics. Constructing such classifiers is challenging due to (a) the complex nature of underlying biological traits, such as gene interactions, and (b) the need for highly interpretable glass-box models. We use the theory of high dimensional model representation (HDMR) to build interpretable low dimensional approximations of the log-likelihood ratio accounting for the effects of each individual gene as well as gene-gene interactions. We propose two algorithms approximating the second order HDMR expansion, and a hypothesis test based on the HDMR formulation to identify significantly dysregulated pairwise interactions. The theory is seen as flexible and requiring only a mild set of assumptions. RESULTS: We apply our approach to gene expression data from both synthetic and real (breast and lung cancer) datasets comparing it also against several popular state-of-the-art methods. The analyses suggest the proposed algorithms can be used to obtain interpretable prediction rules with high prediction accuracies and to successfully extract significantly dysregulated gene-gene interactions from the data. They also compare favorably against their competitors across multiple synthetic data scenarios. CONCLUSION: The proposed HDMR-based approach appears to produce a reliable classifier that additionally allows one to describe how individual genes or gene-gene interactions affect classification decisions. Both real and synthetic data analyses suggest that our methods can be used to identify gene networks with dysregulated pairwise interactions, and are therefore appropriate for differential networks analysis. BioMed Central 2020-04-25 /pmc/articles/PMC7183128/ /pubmed/32334509 http://dx.doi.org/10.1186/s12859-020-3486-x Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Methodology Article Foroughi pour, Ali Pietrzak, Maciej Dalton, Lori A Rempała, Grzegorz A. High dimensional model representation of log-likelihood ratio: binary classification with expression data |
title | High dimensional model representation of log-likelihood ratio: binary classification with expression data |
title_full | High dimensional model representation of log-likelihood ratio: binary classification with expression data |
title_fullStr | High dimensional model representation of log-likelihood ratio: binary classification with expression data |
title_full_unstemmed | High dimensional model representation of log-likelihood ratio: binary classification with expression data |
title_short | High dimensional model representation of log-likelihood ratio: binary classification with expression data |
title_sort | high dimensional model representation of log-likelihood ratio: binary classification with expression data |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7183128/ https://www.ncbi.nlm.nih.gov/pubmed/32334509 http://dx.doi.org/10.1186/s12859-020-3486-x |
work_keys_str_mv | AT foroughipourali highdimensionalmodelrepresentationofloglikelihoodratiobinaryclassificationwithexpressiondata AT pietrzakmaciej highdimensionalmodelrepresentationofloglikelihoodratiobinaryclassificationwithexpressiondata AT daltonloria highdimensionalmodelrepresentationofloglikelihoodratiobinaryclassificationwithexpressiondata AT rempałagrzegorza highdimensionalmodelrepresentationofloglikelihoodratiobinaryclassificationwithexpressiondata |