Cargando…
Identification of factors associated with duplicate rate in ChIP-seq data
Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6447195/ https://www.ncbi.nlm.nih.gov/pubmed/30943272 http://dx.doi.org/10.1371/journal.pone.0214723 |
_version_ | 1783408474427752448 |
---|---|
author | Tian, Shulan Peng, Shuxia Kalmbach, Michael Gaonkar, Krutika S. Bhagwate, Aditya Ding, Wei Eckel-Passow, Jeanette Yan, Huihuang Slager, Susan L. |
author_facet | Tian, Shulan Peng, Shuxia Kalmbach, Michael Gaonkar, Krutika S. Bhagwate, Aditya Ding, Wei Eckel-Passow, Jeanette Yan, Huihuang Slager, Susan L. |
author_sort | Tian, Shulan |
collection | PubMed |
description | Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication. |
format | Online Article Text |
id | pubmed-6447195 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-64471952019-04-17 Identification of factors associated with duplicate rate in ChIP-seq data Tian, Shulan Peng, Shuxia Kalmbach, Michael Gaonkar, Krutika S. Bhagwate, Aditya Ding, Wei Eckel-Passow, Jeanette Yan, Huihuang Slager, Susan L. PLoS One Research Article Chromatin immunoprecipitation and sequencing (ChIP-seq) has been widely used to map DNA-binding proteins, histone proteins and their modifications. ChIP-seq data contains redundant reads termed duplicates, referring to those mapping to the same genomic location and strand. There are two main sources of duplicates: polymerase chain reaction (PCR) duplicates and natural duplicates. Unlike natural duplicates that represent true signals from sequencing of independent DNA templates, PCR duplicates are artifacts originating from sequencing of identical copies amplified from the same DNA template. In analysis, duplicates are removed from peak calling and signal quantification. Nevertheless, a significant portion of the duplicates is believed to represent true signals. Obviously, removing all duplicates will underestimate the signal level in peaks and impact the identification of signal changes across samples. Therefore, an in-depth evaluation of the impact from duplicate removal is needed. Using eight public ChIP-seq datasets from three narrow-peak and two broad-peak marks, we tried to understand the distribution of duplicates in the genome, the extent by which duplicate removal impacts peak calling and signal estimation, and the factors associated with duplicate level in peaks. The three PCR-free histone H3 lysine 4 trimethylation (H3K4me3) ChIP-seq data had about 40% duplicates and 97% of them were within peaks. For the other datasets generated with PCR amplification of ChIP DNA, as expected, the narrow-peak marks have a much higher proportion of duplicates than the broad-peak marks. We found that duplicates are enriched in peaks and largely represent true signals, more conspicuous in those with high confidence. Furthermore, duplicate level in peaks is strongly correlated with the target enrichment level estimated using nonredundant reads, which provides the basis to properly allocate duplicates between noise and signal. Our analysis supports the feasibility of retaining the portion of signal duplicates into downstream analysis, thus alleviating the limitation of complete deduplication. Public Library of Science 2019-04-03 /pmc/articles/PMC6447195/ /pubmed/30943272 http://dx.doi.org/10.1371/journal.pone.0214723 Text en © 2019 Tian et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Tian, Shulan Peng, Shuxia Kalmbach, Michael Gaonkar, Krutika S. Bhagwate, Aditya Ding, Wei Eckel-Passow, Jeanette Yan, Huihuang Slager, Susan L. Identification of factors associated with duplicate rate in ChIP-seq data |
title | Identification of factors associated with duplicate rate in ChIP-seq data |
title_full | Identification of factors associated with duplicate rate in ChIP-seq data |
title_fullStr | Identification of factors associated with duplicate rate in ChIP-seq data |
title_full_unstemmed | Identification of factors associated with duplicate rate in ChIP-seq data |
title_short | Identification of factors associated with duplicate rate in ChIP-seq data |
title_sort | identification of factors associated with duplicate rate in chip-seq data |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6447195/ https://www.ncbi.nlm.nih.gov/pubmed/30943272 http://dx.doi.org/10.1371/journal.pone.0214723 |
work_keys_str_mv | AT tianshulan identificationoffactorsassociatedwithduplicaterateinchipseqdata AT pengshuxia identificationoffactorsassociatedwithduplicaterateinchipseqdata AT kalmbachmichael identificationoffactorsassociatedwithduplicaterateinchipseqdata AT gaonkarkrutikas identificationoffactorsassociatedwithduplicaterateinchipseqdata AT bhagwateaditya identificationoffactorsassociatedwithduplicaterateinchipseqdata AT dingwei identificationoffactorsassociatedwithduplicaterateinchipseqdata AT eckelpassowjeanette identificationoffactorsassociatedwithduplicaterateinchipseqdata AT yanhuihuang identificationoffactorsassociatedwithduplicaterateinchipseqdata AT slagersusanl identificationoffactorsassociatedwithduplicaterateinchipseqdata |