Cargando…

Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers

MOTIVATION: Mammalian genomes can contain thousands of enhancers but only a subset are actively driving gene expression in a given cellular context. Integrated genomic datasets can be harnessed to predict active enhancers. One challenge in integration of large genomic datasets is the increasing hete...

Descripción completa

Detalles Bibliográficos
Autores principales: Mehdi, Tahmid F, Singh, Gurdeep, Mitchell, Jennifer A, Moses, Alan M
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6748727/
https://www.ncbi.nlm.nih.gov/pubmed/30753279
http://dx.doi.org/10.1093/bioinformatics/btz064
_version_ 1783452140979617792
author Mehdi, Tahmid F
Singh, Gurdeep
Mitchell, Jennifer A
Moses, Alan M
author_facet Mehdi, Tahmid F
Singh, Gurdeep
Mitchell, Jennifer A
Moses, Alan M
author_sort Mehdi, Tahmid F
collection PubMed
description MOTIVATION: Mammalian genomes can contain thousands of enhancers but only a subset are actively driving gene expression in a given cellular context. Integrated genomic datasets can be harnessed to predict active enhancers. One challenge in integration of large genomic datasets is the increasing heterogeneity: continuous, binary and discrete features may all be relevant. Coupled with the typically small numbers of training examples, semi-supervised approaches for heterogeneous data are needed; however, current enhancer prediction methods are not designed to handle heterogeneous data in the semi-supervised paradigm. RESULTS: We implemented a Dirichlet Process Heterogeneous Mixture model that infers Gaussian, Bernoulli and Poisson distributions over features. We derived a novel variational inference algorithm to handle semi-supervised learning tasks where certain observations are forced to cluster together. We applied this model to enhancer candidates in mouse heart tissues based on heterogeneous features. We constrained a small number of known active enhancers to appear in the same cluster, and 47 additional regions clustered with them. Many of these are located near heart-specific genes. The model also predicted 1176 active promoters, suggesting that it can discover new enhancers and promoters. AVAILABILITY AND IMPLEMENTATION: We created the ‘dphmix’ Python package: https://pypi.org/project/dphmix/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-6748727
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-67487272019-09-23 Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers Mehdi, Tahmid F Singh, Gurdeep Mitchell, Jennifer A Moses, Alan M Bioinformatics Original Papers MOTIVATION: Mammalian genomes can contain thousands of enhancers but only a subset are actively driving gene expression in a given cellular context. Integrated genomic datasets can be harnessed to predict active enhancers. One challenge in integration of large genomic datasets is the increasing heterogeneity: continuous, binary and discrete features may all be relevant. Coupled with the typically small numbers of training examples, semi-supervised approaches for heterogeneous data are needed; however, current enhancer prediction methods are not designed to handle heterogeneous data in the semi-supervised paradigm. RESULTS: We implemented a Dirichlet Process Heterogeneous Mixture model that infers Gaussian, Bernoulli and Poisson distributions over features. We derived a novel variational inference algorithm to handle semi-supervised learning tasks where certain observations are forced to cluster together. We applied this model to enhancer candidates in mouse heart tissues based on heterogeneous features. We constrained a small number of known active enhancers to appear in the same cluster, and 47 additional regions clustered with them. Many of these are located near heart-specific genes. The model also predicted 1176 active promoters, suggesting that it can discover new enhancers and promoters. AVAILABILITY AND IMPLEMENTATION: We created the ‘dphmix’ Python package: https://pypi.org/project/dphmix/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2019-09-15 2019-02-07 /pmc/articles/PMC6748727/ /pubmed/30753279 http://dx.doi.org/10.1093/bioinformatics/btz064 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Mehdi, Tahmid F
Singh, Gurdeep
Mitchell, Jennifer A
Moses, Alan M
Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers
title Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers
title_full Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers
title_fullStr Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers
title_full_unstemmed Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers
title_short Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers
title_sort variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6748727/
https://www.ncbi.nlm.nih.gov/pubmed/30753279
http://dx.doi.org/10.1093/bioinformatics/btz064
work_keys_str_mv AT mehditahmidf variationalinfiniteheterogeneousmixturemodelforsemisupervisedclusteringofheartenhancers
AT singhgurdeep variationalinfiniteheterogeneousmixturemodelforsemisupervisedclusteringofheartenhancers
AT mitchelljennifera variationalinfiniteheterogeneousmixturemodelforsemisupervisedclusteringofheartenhancers
AT mosesalanm variationalinfiniteheterogeneousmixturemodelforsemisupervisedclusteringofheartenhancers