Cargando…

A parallel and incremental algorithm for efficient unique signature discovery on DNA databases

BACKGROUND: DNA signatures are distinct short nucleotide sequences that provide valuable information that is used for various purposes, such as the design of Polymerase Chain Reaction primers and microarray experiments. Biologists usually use a discovery algorithm to find unique signatures from DNA...

Descripción completa

Detalles Bibliográficos
Autores principales: Lee, Hsiao Ping, Sheu, Tzu-Fang, Tang, Chuan Yi
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2848650/
https://www.ncbi.nlm.nih.gov/pubmed/20230647
http://dx.doi.org/10.1186/1471-2105-11-132
_version_ 1782179703032905728
author Lee, Hsiao Ping
Sheu, Tzu-Fang
Tang, Chuan Yi
author_facet Lee, Hsiao Ping
Sheu, Tzu-Fang
Tang, Chuan Yi
author_sort Lee, Hsiao Ping
collection PubMed
description BACKGROUND: DNA signatures are distinct short nucleotide sequences that provide valuable information that is used for various purposes, such as the design of Polymerase Chain Reaction primers and microarray experiments. Biologists usually use a discovery algorithm to find unique signatures from DNA databases, and then apply the signatures to microarray experiments. Such discovery algorithms require to set some input factors, such as signature length l and mismatch tolerance d, which affect the discovery results. However, suggestions about how to select proper factor values are rare, especially when an unfamiliar DNA database is used. In most cases, biologists typically select factor values based on experience, or even by guessing. If the discovered result is unsatisfactory, biologists change the input factors of the algorithm to obtain a new result. This process is repeated until a proper result is obtained. Implicit signatures under the discovery condition (l, d) are defined as the signatures of length ≤ l with mismatch tolerance ≥ d. A discovery algorithm that could discover all implicit signatures, such that those that meet the requirements concerning the results, would be more helpful than one that depends on trial and error. However, existing discovery algorithms do not address the need to discover all implicit signatures. RESULTS: This work proposes two discovery algorithms - the consecutive multiple discovery (CMD) algorithm and the parallel and incremental signature discovery (PISD) algorithm. The PISD algorithm is designed for efficiently discovering signatures under a certain discovery condition. The algorithm finds new results by using previously discovered results as candidates, rather than by using the whole database. The PISD algorithm further increases discovery efficiency by applying parallel computing. The CMD algorithm is designed to discover implicit signatures efficiently. It uses the PISD algorithm as a kernel routine to discover implicit signatures efficiently under every feasible discovery condition. CONCLUSIONS: The proposed algorithms discover implicit signatures efficiently. The presented CMD algorithm has up to 97% less execution time than typical sequential discovery algorithms in the discovery of implicit signatures in experiments, when eight processing cores are used.
format Text
id pubmed-2848650
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28486502010-04-02 A parallel and incremental algorithm for efficient unique signature discovery on DNA databases Lee, Hsiao Ping Sheu, Tzu-Fang Tang, Chuan Yi BMC Bioinformatics Methodology article BACKGROUND: DNA signatures are distinct short nucleotide sequences that provide valuable information that is used for various purposes, such as the design of Polymerase Chain Reaction primers and microarray experiments. Biologists usually use a discovery algorithm to find unique signatures from DNA databases, and then apply the signatures to microarray experiments. Such discovery algorithms require to set some input factors, such as signature length l and mismatch tolerance d, which affect the discovery results. However, suggestions about how to select proper factor values are rare, especially when an unfamiliar DNA database is used. In most cases, biologists typically select factor values based on experience, or even by guessing. If the discovered result is unsatisfactory, biologists change the input factors of the algorithm to obtain a new result. This process is repeated until a proper result is obtained. Implicit signatures under the discovery condition (l, d) are defined as the signatures of length ≤ l with mismatch tolerance ≥ d. A discovery algorithm that could discover all implicit signatures, such that those that meet the requirements concerning the results, would be more helpful than one that depends on trial and error. However, existing discovery algorithms do not address the need to discover all implicit signatures. RESULTS: This work proposes two discovery algorithms - the consecutive multiple discovery (CMD) algorithm and the parallel and incremental signature discovery (PISD) algorithm. The PISD algorithm is designed for efficiently discovering signatures under a certain discovery condition. The algorithm finds new results by using previously discovered results as candidates, rather than by using the whole database. The PISD algorithm further increases discovery efficiency by applying parallel computing. The CMD algorithm is designed to discover implicit signatures efficiently. It uses the PISD algorithm as a kernel routine to discover implicit signatures efficiently under every feasible discovery condition. CONCLUSIONS: The proposed algorithms discover implicit signatures efficiently. The presented CMD algorithm has up to 97% less execution time than typical sequential discovery algorithms in the discovery of implicit signatures in experiments, when eight processing cores are used. BioMed Central 2010-03-16 /pmc/articles/PMC2848650/ /pubmed/20230647 http://dx.doi.org/10.1186/1471-2105-11-132 Text en Copyright ©2010 Lee et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology article
Lee, Hsiao Ping
Sheu, Tzu-Fang
Tang, Chuan Yi
A parallel and incremental algorithm for efficient unique signature discovery on DNA databases
title A parallel and incremental algorithm for efficient unique signature discovery on DNA databases
title_full A parallel and incremental algorithm for efficient unique signature discovery on DNA databases
title_fullStr A parallel and incremental algorithm for efficient unique signature discovery on DNA databases
title_full_unstemmed A parallel and incremental algorithm for efficient unique signature discovery on DNA databases
title_short A parallel and incremental algorithm for efficient unique signature discovery on DNA databases
title_sort parallel and incremental algorithm for efficient unique signature discovery on dna databases
topic Methodology article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2848650/
https://www.ncbi.nlm.nih.gov/pubmed/20230647
http://dx.doi.org/10.1186/1471-2105-11-132
work_keys_str_mv AT leehsiaoping aparallelandincrementalalgorithmforefficientuniquesignaturediscoveryondnadatabases
AT sheutzufang aparallelandincrementalalgorithmforefficientuniquesignaturediscoveryondnadatabases
AT tangchuanyi aparallelandincrementalalgorithmforefficientuniquesignaturediscoveryondnadatabases
AT leehsiaoping parallelandincrementalalgorithmforefficientuniquesignaturediscoveryondnadatabases
AT sheutzufang parallelandincrementalalgorithmforefficientuniquesignaturediscoveryondnadatabases
AT tangchuanyi parallelandincrementalalgorithmforefficientuniquesignaturediscoveryondnadatabases