Cargando…

Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome

BACKGROUND: Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, s...

Descripción completa

Detalles Bibliográficos
Autores principales: Kuznetsov, Vladimir A, Singh, Onkar, Jenjaroenpun, Piroon
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2822526/
https://www.ncbi.nlm.nih.gov/pubmed/20158869
http://dx.doi.org/10.1186/1471-2164-11-S1-S12
_version_ 1782177533322592256
author Kuznetsov, Vladimir A
Singh, Onkar
Jenjaroenpun, Piroon
author_facet Kuznetsov, Vladimir A
Singh, Onkar
Jenjaroenpun, Piroon
author_sort Kuznetsov, Vladimir A
collection PubMed
description BACKGROUND: Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIP-based high-throughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type. RESULTS: We developed an exploratory mixture probabilistic model for a specific and non-specific transcription factor-DNA (TF-DNA) binding. Within ChiP-seq data sets, the statistics of specific and non-specific DNA-protein binding is defined by a mixture of sample size-dependent skewed functions described by Kolmogorov-Waring (K-W) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chip-seq data for eleven TFs, essential for self-maintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiP-seq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified c-Myc TFBSs supports our results and allowed us to predict many novel c-Myc target genes. CONCLUSION: We provide a novel methodology of estimating the specificity and the sensitivity of TF-DNA binding in massively paralleled ChIP sequencing (ChIP-seq) binding assay. Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised.
format Text
id pubmed-2822526
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28225262010-02-17 Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome Kuznetsov, Vladimir A Singh, Onkar Jenjaroenpun, Piroon BMC Genomics Research BACKGROUND: Transcription factor (TF)-DNA binding loci are explored by analyzing massive datasets generated with application of Chromatin Immuno-Precipitation (ChIP)-based high-throughput sequencing technologies. These datasets suffer from a bias in the information about binding loci availability, sample incompleteness and diverse sources of technical and biological noises. Therefore adequate mathematical models of ChIP-based high-throughput assay(s) and statistical tools are required for a robust identification of specific and reliable TF binding sites (TFBS), a precise characterization of TFBS avidity distribution and a plausible estimation the total number of specific TFBS for a given TF in the genome for a given cell type. RESULTS: We developed an exploratory mixture probabilistic model for a specific and non-specific transcription factor-DNA (TF-DNA) binding. Within ChiP-seq data sets, the statistics of specific and non-specific DNA-protein binding is defined by a mixture of sample size-dependent skewed functions described by Kolmogorov-Waring (K-W) function (Kuznetsov, 2003) and exponential function, respectively. Using available Chip-seq data for eleven TFs, essential for self-maintenance and differentiation of mouse embryonic stem cells (SC) (Nanog, Oct4, sox2, KLf4, STAT3, E2F1, Tcfcp211, ZFX, n-Myc, c-Myc and Essrb) reported in Chen et al (2008), we estimated (i) the specificity and the sensitivity of the ChiP-seq binding assays and (ii) the number of specific but not identified in the current experiments binding sites (BSs) in the genome of mouse embryonic stem cells. Motif finding analysis applied to the identified c-Myc TFBSs supports our results and allowed us to predict many novel c-Myc target genes. CONCLUSION: We provide a novel methodology of estimating the specificity and the sensitivity of TF-DNA binding in massively paralleled ChIP sequencing (ChIP-seq) binding assay. Goodness-of fit analysis of K-W functions suggests that a large fraction of low- and moderate- avidity TFBSs cannot be identified by the ChIP-based methods. Thus the task to identify the binding sensitivity of a TF cannot be technically resolved yet by current ChIP-seq, compared to former experimental techniques. Considering our improvement in measuring the sensitivity and the specificity of the TFs obtained from the ChIP-seq data, the models of transcriptional regulatory networks in embryonic cells and other cell types derived from the given ChIp-seq data should be carefully revised. BioMed Central 2010-02-10 /pmc/articles/PMC2822526/ /pubmed/20158869 http://dx.doi.org/10.1186/1471-2164-11-S1-S12 Text en Copyright ©2010 Kuznetsov et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Kuznetsov, Vladimir A
Singh, Onkar
Jenjaroenpun, Piroon
Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome
title Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome
title_full Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome
title_fullStr Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome
title_full_unstemmed Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome
title_short Statistics of protein-DNA binding and the total number of binding sites for a transcription factor in the mammalian genome
title_sort statistics of protein-dna binding and the total number of binding sites for a transcription factor in the mammalian genome
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2822526/
https://www.ncbi.nlm.nih.gov/pubmed/20158869
http://dx.doi.org/10.1186/1471-2164-11-S1-S12
work_keys_str_mv AT kuznetsovvladimira statisticsofproteindnabindingandthetotalnumberofbindingsitesforatranscriptionfactorinthemammaliangenome
AT singhonkar statisticsofproteindnabindingandthetotalnumberofbindingsitesforatranscriptionfactorinthemammaliangenome
AT jenjaroenpunpiroon statisticsofproteindnabindingandthetotalnumberofbindingsitesforatranscriptionfactorinthemammaliangenome