Cargando…

Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers

MOTIVATION: Mammalian genomes can contain thousands of enhancers but only a subset are actively driving gene expression in a given cellular context. Integrated genomic datasets can be harnessed to predict active enhancers. One challenge in integration of large genomic datasets is the increasing hete...

Descripción completa

Detalles Bibliográficos
Autores principales: Mehdi, Tahmid F, Singh, Gurdeep, Mitchell, Jennifer A, Moses, Alan M
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6748727/
https://www.ncbi.nlm.nih.gov/pubmed/30753279
http://dx.doi.org/10.1093/bioinformatics/btz064
Descripción
Sumario:MOTIVATION: Mammalian genomes can contain thousands of enhancers but only a subset are actively driving gene expression in a given cellular context. Integrated genomic datasets can be harnessed to predict active enhancers. One challenge in integration of large genomic datasets is the increasing heterogeneity: continuous, binary and discrete features may all be relevant. Coupled with the typically small numbers of training examples, semi-supervised approaches for heterogeneous data are needed; however, current enhancer prediction methods are not designed to handle heterogeneous data in the semi-supervised paradigm. RESULTS: We implemented a Dirichlet Process Heterogeneous Mixture model that infers Gaussian, Bernoulli and Poisson distributions over features. We derived a novel variational inference algorithm to handle semi-supervised learning tasks where certain observations are forced to cluster together. We applied this model to enhancer candidates in mouse heart tissues based on heterogeneous features. We constrained a small number of known active enhancers to appear in the same cluster, and 47 additional regions clustered with them. Many of these are located near heart-specific genes. The model also predicted 1176 active promoters, suggesting that it can discover new enhancers and promoters. AVAILABILITY AND IMPLEMENTATION: We created the ‘dphmix’ Python package: https://pypi.org/project/dphmix/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.