Cargando…

THiCweed: fast, sensitive detection of sequence features by clustering big datasets

We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence si...

Descripción completa

Detalles Bibliográficos
Autores principales:	Agrawal, Ankit, Sambare, Snehal V, Narlikar, Leelavati, Siddharthan, Rahul
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2018
Materias:	Methods Online
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5861420/ https://www.ncbi.nlm.nih.gov/pubmed/29267972 http://dx.doi.org/10.1093/nar/gkx1251

_version_	1783308090544750592
author	Agrawal, Ankit Sambare, Snehal V Narlikar, Leelavati Siddharthan, Rahul
author_facet	Agrawal, Ankit Sambare, Snehal V Narlikar, Leelavati Siddharthan, Rahul
author_sort	Agrawal, Ankit
collection	PubMed
description	We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30 000 peaks in 1–2 h, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs. THiCweed performs best with large ‘window’ sizes (≥50 bp), much longer than typical binding sites (7–15 bp). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs and secondary motifs even when they occur in <5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq datasets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif finding to give new insights into genomic transcription factor-binding complexity.
format	Online Article Text
id	pubmed-5861420
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-58614202018-03-28 THiCweed: fast, sensitive detection of sequence features by clustering big datasets Agrawal, Ankit Sambare, Snehal V Narlikar, Leelavati Siddharthan, Rahul Nucleic Acids Res Methods Online We present THiCweed, a new approach to analyzing transcription factor binding data from high-throughput chromatin immunoprecipitation-sequencing (ChIP-seq) experiments. THiCweed clusters bound regions based on sequence similarity using a divisive hierarchical clustering approach based on sequence similarity within sliding windows, while exploring both strands. ThiCweed is specially geared toward data containing mixtures of motifs, which present a challenge to traditional motif-finders. Our implementation is significantly faster than standard motif-finding programs, able to process 30 000 peaks in 1–2 h, on a single CPU core of a desktop computer. On synthetic data containing mixtures of motifs it is as accurate or more accurate than all other tested programs. THiCweed performs best with large ‘window’ sizes (≥50 bp), much longer than typical binding sites (7–15 bp). On real data it successfully recovers literature motifs, but also uncovers complex sequence characteristics in flanking DNA, variant motifs and secondary motifs even when they occur in <5% of the input, all of which appear biologically relevant. We also find recurring sequence patterns across diverse ChIP-seq datasets, possibly related to chromatin architecture and looping. THiCweed thus goes beyond traditional motif finding to give new insights into genomic transcription factor-binding complexity. Oxford University Press 2018-03-16 2017-12-18 /pmc/articles/PMC5861420/ /pubmed/29267972 http://dx.doi.org/10.1093/nar/gkx1251 Text en © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methods Online Agrawal, Ankit Sambare, Snehal V Narlikar, Leelavati Siddharthan, Rahul THiCweed: fast, sensitive detection of sequence features by clustering big datasets
title	THiCweed: fast, sensitive detection of sequence features by clustering big datasets
title_full	THiCweed: fast, sensitive detection of sequence features by clustering big datasets
title_fullStr	THiCweed: fast, sensitive detection of sequence features by clustering big datasets
title_full_unstemmed	THiCweed: fast, sensitive detection of sequence features by clustering big datasets
title_short	THiCweed: fast, sensitive detection of sequence features by clustering big datasets
title_sort	thicweed: fast, sensitive detection of sequence features by clustering big datasets
topic	Methods Online
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5861420/ https://www.ncbi.nlm.nih.gov/pubmed/29267972 http://dx.doi.org/10.1093/nar/gkx1251
work_keys_str_mv	AT agrawalankit thicweedfastsensitivedetectionofsequencefeaturesbyclusteringbigdatasets AT sambaresnehalv thicweedfastsensitivedetectionofsequencefeaturesbyclusteringbigdatasets AT narlikarleelavati thicweedfastsensitivedetectionofsequencefeaturesbyclusteringbigdatasets AT siddharthanrahul thicweedfastsensitivedetectionofsequencefeaturesbyclusteringbigdatasets

THiCweed: fast, sensitive detection of sequence features by clustering big datasets

Ejemplares similares