Cargando…

Two-way learning with one-way supervision for gene expression data

BACKGROUND: A family of parsimonious Gaussian mixture models for the biclustering of gene expression data is introduced. Biclustering is accommodated by adopting a mixture of factor analyzers model with a binary, row-stochastic factor loadings matrix. This particular form of factor loadings matrix r...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wong, Monica H. T., Mutch, David M., McNicholas, Paul D.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5336648/ https://www.ncbi.nlm.nih.gov/pubmed/28257645 http://dx.doi.org/10.1186/s12859-017-1564-5

_version_	1782512230387941376
author	Wong, Monica H. T. Mutch, David M. McNicholas, Paul D.
author_facet	Wong, Monica H. T. Mutch, David M. McNicholas, Paul D.
author_sort	Wong, Monica H. T.
collection	PubMed
description	BACKGROUND: A family of parsimonious Gaussian mixture models for the biclustering of gene expression data is introduced. Biclustering is accommodated by adopting a mixture of factor analyzers model with a binary, row-stochastic factor loadings matrix. This particular form of factor loadings matrix results in a block-diagonal covariance matrix, which is a useful property in gene expression analyses, specifically in biomarker discovery scenarios where blood can potentially act as a surrogate tissue for other less accessible tissues. Prior knowledge of the factor loadings matrix is useful in this application and is reflected in the one-way supervised nature of the algorithm. Additionally, the factor loadings matrix can be assumed to be constant across all components because of the relationship desired between the various types of tissue samples. Parameter estimates are obtained through a variant of the expectation-maximization algorithm and the best-fitting model is selected using the Bayesian information criterion. The family of models is demonstrated using simulated data and two real microarray data sets. The first real data set is from a rat study that investigated the influence of diabetes on gene expression in different tissues. The second real data set is from a human transcriptomics study that focused on blood and immune tissues. The microarray data sets illustrate the biclustering family’s performance in biomarker discovery involving peripheral blood as surrogate biopsy material. RESULTS: The simulation studies indicate that the algorithm identifies the correct biclusters, most optimally when the number of observation clusters is known. Moreover, the biclustering algorithm identified biclusters comprised of biologically meaningful data related to insulin resistance and immune function in the rat and human real data sets, respectively. CONCLUSIONS: Initial results using real data show that this biclustering technique provides a novel approach for biomarker discovery by enabling blood to be used as a surrogate for hard-to-obtain tissues. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1564-5) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5336648
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-53366482017-03-07 Two-way learning with one-way supervision for gene expression data Wong, Monica H. T. Mutch, David M. McNicholas, Paul D. BMC Bioinformatics Methodology Article BACKGROUND: A family of parsimonious Gaussian mixture models for the biclustering of gene expression data is introduced. Biclustering is accommodated by adopting a mixture of factor analyzers model with a binary, row-stochastic factor loadings matrix. This particular form of factor loadings matrix results in a block-diagonal covariance matrix, which is a useful property in gene expression analyses, specifically in biomarker discovery scenarios where blood can potentially act as a surrogate tissue for other less accessible tissues. Prior knowledge of the factor loadings matrix is useful in this application and is reflected in the one-way supervised nature of the algorithm. Additionally, the factor loadings matrix can be assumed to be constant across all components because of the relationship desired between the various types of tissue samples. Parameter estimates are obtained through a variant of the expectation-maximization algorithm and the best-fitting model is selected using the Bayesian information criterion. The family of models is demonstrated using simulated data and two real microarray data sets. The first real data set is from a rat study that investigated the influence of diabetes on gene expression in different tissues. The second real data set is from a human transcriptomics study that focused on blood and immune tissues. The microarray data sets illustrate the biclustering family’s performance in biomarker discovery involving peripheral blood as surrogate biopsy material. RESULTS: The simulation studies indicate that the algorithm identifies the correct biclusters, most optimally when the number of observation clusters is known. Moreover, the biclustering algorithm identified biclusters comprised of biologically meaningful data related to insulin resistance and immune function in the rat and human real data sets, respectively. CONCLUSIONS: Initial results using real data show that this biclustering technique provides a novel approach for biomarker discovery by enabling blood to be used as a surrogate for hard-to-obtain tissues. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1564-5) contains supplementary material, which is available to authorized users. BioMed Central 2017-03-04 /pmc/articles/PMC5336648/ /pubmed/28257645 http://dx.doi.org/10.1186/s12859-017-1564-5 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Wong, Monica H. T. Mutch, David M. McNicholas, Paul D. Two-way learning with one-way supervision for gene expression data
title	Two-way learning with one-way supervision for gene expression data
title_full	Two-way learning with one-way supervision for gene expression data
title_fullStr	Two-way learning with one-way supervision for gene expression data
title_full_unstemmed	Two-way learning with one-way supervision for gene expression data
title_short	Two-way learning with one-way supervision for gene expression data
title_sort	two-way learning with one-way supervision for gene expression data
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5336648/ https://www.ncbi.nlm.nih.gov/pubmed/28257645 http://dx.doi.org/10.1186/s12859-017-1564-5
work_keys_str_mv	AT wongmonicaht twowaylearningwithonewaysupervisionforgeneexpressiondata AT mutchdavidm twowaylearningwithonewaysupervisionforgeneexpressiondata AT mcnicholaspauld twowaylearningwithonewaysupervisionforgeneexpressiondata

Two-way learning with one-way supervision for gene expression data

Ejemplares similares