Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models

BACKGROUND: Histone modification constitutes a basic mechanism for the genetic regulation of gene expression. In early 2000s, a powerful technique has emerged that couples chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq). This technique provides a direct survey of the DNA reg...

Descripción completa

Detalles Bibliográficos
Autores principales:	Liehrmann, Arnaud, Rigaill, Guillem, Hocking, Toby Dylan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2021
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8201703/ https://www.ncbi.nlm.nih.gov/pubmed/34126932 http://dx.doi.org/10.1186/s12859-021-04221-5

_version_	1783707855076982784
author	Liehrmann, Arnaud Rigaill, Guillem Hocking, Toby Dylan
author_facet	Liehrmann, Arnaud Rigaill, Guillem Hocking, Toby Dylan
author_sort	Liehrmann, Arnaud
collection	PubMed
description	BACKGROUND: Histone modification constitutes a basic mechanism for the genetic regulation of gene expression. In early 2000s, a powerful technique has emerged that couples chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq). This technique provides a direct survey of the DNA regions associated to these modifications. In order to realize the full potential of this technique, increasingly sophisticated statistical algorithms have been developed or adapted to analyze the massive amount of data it generates. Many of these algorithms were built around natural assumptions such as the Poisson distribution to model the noise in the count data. In this work we start from these natural assumptions and show that it is possible to improve upon them. RESULTS: Our comparisons on seven reference datasets of histone modifications (H3K36me3 & H3K4me3) suggest that natural assumptions are not always realistic under application conditions. We show that the unconstrained multiple changepoint detection model with alternative noise assumptions and supervised learning of the penalty parameter reduces the over-dispersion exhibited by count data. These models, implemented in the R package CROCS (https://github.com/aLiehrmann/CROCS), detect the peaks more accurately than algorithms which rely on natural assumptions. CONCLUSION: The segmentation models we propose can benefit researchers in the field of epigenetics by providing new high-quality peak prediction tracks for H3K36me3 and H3K4me3 histone modifications. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04221-5.
format	Online Article Text
id	pubmed-8201703
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-82017032021-06-15 Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models Liehrmann, Arnaud Rigaill, Guillem Hocking, Toby Dylan BMC Bioinformatics Research BACKGROUND: Histone modification constitutes a basic mechanism for the genetic regulation of gene expression. In early 2000s, a powerful technique has emerged that couples chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq). This technique provides a direct survey of the DNA regions associated to these modifications. In order to realize the full potential of this technique, increasingly sophisticated statistical algorithms have been developed or adapted to analyze the massive amount of data it generates. Many of these algorithms were built around natural assumptions such as the Poisson distribution to model the noise in the count data. In this work we start from these natural assumptions and show that it is possible to improve upon them. RESULTS: Our comparisons on seven reference datasets of histone modifications (H3K36me3 & H3K4me3) suggest that natural assumptions are not always realistic under application conditions. We show that the unconstrained multiple changepoint detection model with alternative noise assumptions and supervised learning of the penalty parameter reduces the over-dispersion exhibited by count data. These models, implemented in the R package CROCS (https://github.com/aLiehrmann/CROCS), detect the peaks more accurately than algorithms which rely on natural assumptions. CONCLUSION: The segmentation models we propose can benefit researchers in the field of epigenetics by providing new high-quality peak prediction tracks for H3K36me3 and H3K4me3 histone modifications. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-021-04221-5. BioMed Central 2021-06-14 /pmc/articles/PMC8201703/ /pubmed/34126932 http://dx.doi.org/10.1186/s12859-021-04221-5 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Liehrmann, Arnaud Rigaill, Guillem Hocking, Toby Dylan Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models
title	Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models
title_full	Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models
title_fullStr	Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models
title_full_unstemmed	Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models
title_short	Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models
title_sort	increased peak detection accuracy in over-dispersed chip-seq data with supervised segmentation models
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8201703/ https://www.ncbi.nlm.nih.gov/pubmed/34126932 http://dx.doi.org/10.1186/s12859-021-04221-5
work_keys_str_mv	AT liehrmannarnaud increasedpeakdetectionaccuracyinoverdispersedchipseqdatawithsupervisedsegmentationmodels AT rigaillguillem increasedpeakdetectionaccuracyinoverdispersedchipseqdatawithsupervisedsegmentationmodels AT hockingtobydylan increasedpeakdetectionaccuracyinoverdispersedchipseqdatawithsupervisedsegmentationmodels

Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models

Ejemplares similares