Cargando…

SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets

BACKGROUND: Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription facto...

Descripción completa

Detalles Bibliográficos
Autores principales: Yu, Qiang, Wei, Dingbang, Huo, Hongwei
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6006848/
https://www.ncbi.nlm.nih.gov/pubmed/29914360
http://dx.doi.org/10.1186/s12859-018-2242-y
_version_ 1783332924034121728
author Yu, Qiang
Wei, Dingbang
Huo, Hongwei
author_facet Yu, Qiang
Wei, Dingbang
Huo, Hongwei
author_sort Yu, Qiang
collection PubMed
description BACKGROUND: Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. RESULTS: We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D’ with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D’. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D’ efficiently and (2) that the qPMS algorithms executed on D’ can find implanted or real motifs in a significantly shorter time than when executed on D. CONCLUSIONS: We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D’, rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.
format Online
Article
Text
id pubmed-6006848
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-60068482018-06-26 SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets Yu, Qiang Wei, Dingbang Huo, Hongwei BMC Bioinformatics Research Article BACKGROUND: Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. RESULTS: We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D’ with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D’. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D’ efficiently and (2) that the qPMS algorithms executed on D’ can find implanted or real motifs in a significantly shorter time than when executed on D. CONCLUSIONS: We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D’, rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm. BioMed Central 2018-06-18 /pmc/articles/PMC6006848/ /pubmed/29914360 http://dx.doi.org/10.1186/s12859-018-2242-y Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Yu, Qiang
Wei, Dingbang
Huo, Hongwei
SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets
title SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets
title_full SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets
title_fullStr SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets
title_full_unstemmed SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets
title_short SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets
title_sort samselect: a sample sequence selection algorithm for quorum planted motif search on large dna datasets
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6006848/
https://www.ncbi.nlm.nih.gov/pubmed/29914360
http://dx.doi.org/10.1186/s12859-018-2242-y
work_keys_str_mv AT yuqiang samselectasamplesequenceselectionalgorithmforquorumplantedmotifsearchonlargednadatasets
AT weidingbang samselectasamplesequenceselectionalgorithmforquorumplantedmotifsearchonlargednadatasets
AT huohongwei samselectasamplesequenceselectionalgorithmforquorumplantedmotifsearchonlargednadatasets