Cargando…

Identification of factors associated with duplicate rate in ChIP-seq data

Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources...

Descripción completa

Detalles Bibliográficos
Autores principales: Tian, Shulan, Peng, Shuxia, Kalmbach, Michael, Gaonkar, Krutika S., Bhagwate, Aditya, Ding, Wei, Eckel-Passow, Jeanette, Yan, Huihuang, Slager, Susan L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6447195/
https://www.ncbi.nlm.nih.gov/pubmed/30943272
http://dx.doi.org/10.1371/journal.pone.0214723
_version_ 1783408474427752448
author Tian, Shulan
Peng, Shuxia
Kalmbach, Michael
Gaonkar, Krutika S.
Bhagwate, Aditya
Ding, Wei
Eckel-Passow, Jeanette
Yan, Huihuang
Slager, Susan L.
author_facet Tian, Shulan
Peng, Shuxia
Kalmbach, Michael
Gaonkar, Krutika S.
Bhagwate, Aditya
Ding, Wei
Eckel-Passow, Jeanette
Yan, Huihuang
Slager, Susan L.
author_sort Tian, Shulan
collection PubMed
description Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication.
format Online
Article
Text
id pubmed-6447195
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-64471952019-04-17 Identification of factors associated with duplicate rate in ChIP-seq data Tian, Shulan Peng, Shuxia Kalmbach, Michael Gaonkar, Krutika S. Bhagwate, Aditya Ding, Wei Eckel-Passow, Jeanette Yan, Huihuang Slager, Susan L. PLoS One Research Article Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication. Public Library of Science 2019-04-03 /pmc/articles/PMC6447195/ /pubmed/30943272 http://dx.doi.org/10.1371/journal.pone.0214723 Text en © 2019 Tian et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Tian, Shulan
Peng, Shuxia
Kalmbach, Michael
Gaonkar, Krutika S.
Bhagwate, Aditya
Ding, Wei
Eckel-Passow, Jeanette
Yan, Huihuang
Slager, Susan L.
Identification of factors associated with duplicate rate in ChIP-seq data
title Identification of factors associated with duplicate rate in ChIP-seq data
title_full Identification of factors associated with duplicate rate in ChIP-seq data
title_fullStr Identification of factors associated with duplicate rate in ChIP-seq data
title_full_unstemmed Identification of factors associated with duplicate rate in ChIP-seq data
title_short Identification of factors associated with duplicate rate in ChIP-seq data
title_sort identification of factors associated with duplicate rate in chip-seq data
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6447195/
https://www.ncbi.nlm.nih.gov/pubmed/30943272
http://dx.doi.org/10.1371/journal.pone.0214723
work_keys_str_mv AT tianshulan identificationoffactorsassociatedwithduplicaterateinchipseqdata
AT pengshuxia identificationoffactorsassociatedwithduplicaterateinchipseqdata
AT kalmbachmichael identificationoffactorsassociatedwithduplicaterateinchipseqdata
AT gaonkarkrutikas identificationoffactorsassociatedwithduplicaterateinchipseqdata
AT bhagwateaditya identificationoffactorsassociatedwithduplicaterateinchipseqdata
AT dingwei identificationoffactorsassociatedwithduplicaterateinchipseqdata
AT eckelpassowjeanette identificationoffactorsassociatedwithduplicaterateinchipseqdata
AT yanhuihuang identificationoffactorsassociatedwithduplicaterateinchipseqdata
AT slagersusanl identificationoffactorsassociatedwithduplicaterateinchipseqdata