Cargando…

Large-Scale Quality Analysis of Published ChIP-seq Data

ChIP-seq has become the primary method for identifying in vivo protein–DNA interactions on a genome-wide scale, with nearly 800 publications involving the technique appearing in PubMed as of December 2012. Individually and in aggregate, these data are an important and information-rich resource. Howe...

Descripción completa

Detalles Bibliográficos
Autores principales: Marinov, Georgi K., Kundaje, Anshul, Park, Peter J., Wold, Barbara J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Genetics Society of America 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3931556/
https://www.ncbi.nlm.nih.gov/pubmed/24347632
http://dx.doi.org/10.1534/g3.113.008680
_version_ 1782304673268498432
author Marinov, Georgi K.
Kundaje, Anshul
Park, Peter J.
Wold, Barbara J.
author_facet Marinov, Georgi K.
Kundaje, Anshul
Park, Peter J.
Wold, Barbara J.
author_sort Marinov, Georgi K.
collection PubMed
description ChIP-seq has become the primary method for identifying in vivo protein–DNA interactions on a genome-wide scale, with nearly 800 publications involving the technique appearing in PubMed as of December 2012. Individually and in aggregate, these data are an important and information-rich resource. However, uncertainties about data quality confound their use by the wider research community. Recently, the Encyclopedia of DNA Elements (ENCODE) project developed and applied metrics to objectively measure ChIP-seq data quality. The ENCODE quality analysis was useful for flagging datasets for closer inspection, eliminating or replacing poor data, and for driving changes in experimental pipelines. There had been no similarly systematic quality analysis of the large and disparate body of published ChIP-seq profiles. Here, we report a uniform analysis of vertebrate transcription factor ChIP-seq datasets in the Gene Expression Omnibus (GEO) repository as of April 1, 2012. The majority (55%) of datasets scored as being highly successful, but a substantial minority (20%) were of apparently poor quality, and another ∼25% were of intermediate quality. We discuss how different uses of ChIP-seq data are affected by specific aspects of data quality, and we highlight exceptional instances for which the metric values should not be taken at face value. Unexpectedly, we discovered that a significant subset of control datasets (i.e., no immunoprecipitation and mock immunoprecipitation samples) display an enrichment structure similar to successful ChIP-seq data. This can, in turn, affect peak calling and data interpretation. Published datasets identified here as high-quality comprise a large group that users can draw on for large-scale integrated analysis. In the future, ChIP-seq quality assessment similar to that used here could guide experimentalists at early stages in a study, provide useful input in the publication process, and be used to stratify ChIP-seq data for different community-wide uses.
format Online
Article
Text
id pubmed-3931556
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher Genetics Society of America
record_format MEDLINE/PubMed
spelling pubmed-39315562014-02-24 Large-Scale Quality Analysis of Published ChIP-seq Data Marinov, Georgi K. Kundaje, Anshul Park, Peter J. Wold, Barbara J. G3 (Bethesda) Investigations ChIP-seq has become the primary method for identifying in vivo protein–DNA interactions on a genome-wide scale, with nearly 800 publications involving the technique appearing in PubMed as of December 2012. Individually and in aggregate, these data are an important and information-rich resource. However, uncertainties about data quality confound their use by the wider research community. Recently, the Encyclopedia of DNA Elements (ENCODE) project developed and applied metrics to objectively measure ChIP-seq data quality. The ENCODE quality analysis was useful for flagging datasets for closer inspection, eliminating or replacing poor data, and for driving changes in experimental pipelines. There had been no similarly systematic quality analysis of the large and disparate body of published ChIP-seq profiles. Here, we report a uniform analysis of vertebrate transcription factor ChIP-seq datasets in the Gene Expression Omnibus (GEO) repository as of April 1, 2012. The majority (55%) of datasets scored as being highly successful, but a substantial minority (20%) were of apparently poor quality, and another ∼25% were of intermediate quality. We discuss how different uses of ChIP-seq data are affected by specific aspects of data quality, and we highlight exceptional instances for which the metric values should not be taken at face value. Unexpectedly, we discovered that a significant subset of control datasets (i.e., no immunoprecipitation and mock immunoprecipitation samples) display an enrichment structure similar to successful ChIP-seq data. This can, in turn, affect peak calling and data interpretation. Published datasets identified here as high-quality comprise a large group that users can draw on for large-scale integrated analysis. In the future, ChIP-seq quality assessment similar to that used here could guide experimentalists at early stages in a study, provide useful input in the publication process, and be used to stratify ChIP-seq data for different community-wide uses. Genetics Society of America 2013-12-17 /pmc/articles/PMC3931556/ /pubmed/24347632 http://dx.doi.org/10.1534/g3.113.008680 Text en Copyright © 2014 Marinov et al. http://creativecommons.org/licenses/by/3.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution Unported License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Investigations
Marinov, Georgi K.
Kundaje, Anshul
Park, Peter J.
Wold, Barbara J.
Large-Scale Quality Analysis of Published ChIP-seq Data
title Large-Scale Quality Analysis of Published ChIP-seq Data
title_full Large-Scale Quality Analysis of Published ChIP-seq Data
title_fullStr Large-Scale Quality Analysis of Published ChIP-seq Data
title_full_unstemmed Large-Scale Quality Analysis of Published ChIP-seq Data
title_short Large-Scale Quality Analysis of Published ChIP-seq Data
title_sort large-scale quality analysis of published chip-seq data
topic Investigations
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3931556/
https://www.ncbi.nlm.nih.gov/pubmed/24347632
http://dx.doi.org/10.1534/g3.113.008680
work_keys_str_mv AT marinovgeorgik largescalequalityanalysisofpublishedchipseqdata
AT kundajeanshul largescalequalityanalysisofpublishedchipseqdata
AT parkpeterj largescalequalityanalysisofpublishedchipseqdata
AT woldbarbaraj largescalequalityanalysisofpublishedchipseqdata