Cargando…

Set cover-based methods for motif selection

MOTIVATION: De novo motif discovery algorithms find statistically over-represented sequence motifs that may function as transcription factor binding sites. Current methods often report large numbers of motifs, making it difficult to perform further analyses and experimental validation. The motif sel...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Yichao, Liu, Yating, Juedes, David, Drews, Frank, Bunescu, Razvan, Welch, Lonnie
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7703758/
https://www.ncbi.nlm.nih.gov/pubmed/31665223
http://dx.doi.org/10.1093/bioinformatics/btz697
_version_ 1783616689544364032
author Li, Yichao
Liu, Yating
Juedes, David
Drews, Frank
Bunescu, Razvan
Welch, Lonnie
author_facet Li, Yichao
Liu, Yating
Juedes, David
Drews, Frank
Bunescu, Razvan
Welch, Lonnie
author_sort Li, Yichao
collection PubMed
description MOTIVATION: De novo motif discovery algorithms find statistically over-represented sequence motifs that may function as transcription factor binding sites. Current methods often report large numbers of motifs, making it difficult to perform further analyses and experimental validation. The motif selection problem seeks to identify a minimal set of putative regulatory motifs that characterize sequences of interest (e.g. ChIP-Seq binding regions). RESULTS: In this study, the motif selection problem is mapped to variants of the set cover problem that are solved via tabu search and by relaxed integer linear programing (RILP). The algorithms are employed to analyze 349 ChIP-Seq experiments from the ENCODE project, yielding a small number of high-quality motifs that represent putative binding sites of primary factors and cofactors. Specifically, when compared with the motifs reported by Kheradpour and Kellis, the set cover-based algorithms produced motif sets covering 35% more peaks for 11 TFs and identified 4 more putative cofactors for 6 TFs. Moreover, a systematic evaluation using nested cross-validation revealed that the RILP algorithm selected fewer motifs and was able to cover 6% more peaks and 3% fewer background regions, which reduced the error rate by 7%. AVAILABILITY AND IMPLEMENTATION: The source code of the algorithms and all the datasets are available at https://github.com/YichaoOU/Set_cover_tools. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-7703758
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-77037582020-12-07 Set cover-based methods for motif selection Li, Yichao Liu, Yating Juedes, David Drews, Frank Bunescu, Razvan Welch, Lonnie Bioinformatics Original Papers MOTIVATION: De novo motif discovery algorithms find statistically over-represented sequence motifs that may function as transcription factor binding sites. Current methods often report large numbers of motifs, making it difficult to perform further analyses and experimental validation. The motif selection problem seeks to identify a minimal set of putative regulatory motifs that characterize sequences of interest (e.g. ChIP-Seq binding regions). RESULTS: In this study, the motif selection problem is mapped to variants of the set cover problem that are solved via tabu search and by relaxed integer linear programing (RILP). The algorithms are employed to analyze 349 ChIP-Seq experiments from the ENCODE project, yielding a small number of high-quality motifs that represent putative binding sites of primary factors and cofactors. Specifically, when compared with the motifs reported by Kheradpour and Kellis, the set cover-based algorithms produced motif sets covering 35% more peaks for 11 TFs and identified 4 more putative cofactors for 6 TFs. Moreover, a systematic evaluation using nested cross-validation revealed that the RILP algorithm selected fewer motifs and was able to cover 6% more peaks and 3% fewer background regions, which reduced the error rate by 7%. AVAILABILITY AND IMPLEMENTATION: The source code of the algorithms and all the datasets are available at https://github.com/YichaoOU/Set_cover_tools. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-02-15 2019-09-17 /pmc/articles/PMC7703758/ /pubmed/31665223 http://dx.doi.org/10.1093/bioinformatics/btz697 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Li, Yichao
Liu, Yating
Juedes, David
Drews, Frank
Bunescu, Razvan
Welch, Lonnie
Set cover-based methods for motif selection
title Set cover-based methods for motif selection
title_full Set cover-based methods for motif selection
title_fullStr Set cover-based methods for motif selection
title_full_unstemmed Set cover-based methods for motif selection
title_short Set cover-based methods for motif selection
title_sort set cover-based methods for motif selection
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7703758/
https://www.ncbi.nlm.nih.gov/pubmed/31665223
http://dx.doi.org/10.1093/bioinformatics/btz697
work_keys_str_mv AT liyichao setcoverbasedmethodsformotifselection
AT liuyating setcoverbasedmethodsformotifselection
AT juedesdavid setcoverbasedmethodsformotifselection
AT drewsfrank setcoverbasedmethodsformotifselection
AT bunescurazvan setcoverbasedmethodsformotifselection
AT welchlonnie setcoverbasedmethodsformotifselection