Cargando…

Suffix tree searcher: exploration of common substrings in large DNA sequence sets

BACKGROUND: Large DNA sequence data sets require special bioinformatics tools to search and compare them. Such tools should be easy to use so that the data can be easily accessed by a wide array of researchers. In the past, the use of suffix trees for searching DNA sequences has been limited by a pr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Minkley, David, Whitney, Michael J, Lin, Song-Han, Barsky, Marina G, Kelly, Chris, Upton, Chris
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4118789/ https://www.ncbi.nlm.nih.gov/pubmed/25053142 http://dx.doi.org/10.1186/1756-0500-7-466

_version_	1782328885388509184
author	Minkley, David Whitney, Michael J Lin, Song-Han Barsky, Marina G Kelly, Chris Upton, Chris
author_facet	Minkley, David Whitney, Michael J Lin, Song-Han Barsky, Marina G Kelly, Chris Upton, Chris
author_sort	Minkley, David
collection	PubMed
description	BACKGROUND: Large DNA sequence data sets require special bioinformatics tools to search and compare them. Such tools should be easy to use so that the data can be easily accessed by a wide array of researchers. In the past, the use of suffix trees for searching DNA sequences has been limited by a practical need to keep the trees in RAM. Newer algorithms solve this problem by using disk-based approaches. However, none of the fastest suffix tree algorithms have been implemented with a graphical user interface, preventing their incorporation into a feasible laboratory workflow. RESULTS: Suffix Tree Searcher (STS) is designed as an easy-to-use tool to index, search, and analyze very large DNA sequence datasets. The program accommodates very large numbers of very large sequences, with aggregate size reaching tens of billions of nucleotides. The program makes use of pre-sorted persistent "building blocks" to reduce the time required to construct new trees. STS is comprised of a graphical user interface written in Java, and four C modules. All components are automatically downloaded when a web link is clicked. The underlying suffix tree data structure permits extremely fast searching for specific nucleotide strings, with wild cards or mismatches allowed. Complete tree traversals for detecting common substrings are also very fast. The graphical user interface allows the user to transition seamlessly between building, traversing, and searching the dataset. CONCLUSIONS: Thus, STS provides a new resource for the detection of substrings common to multiple DNA sequences or within a single sequence, for truly huge data sets. The re-searching of sequence hits, allowing wild card positions or mismatched nucleotides, together with the ability to rapidly retrieve large numbers of sequence hits from the DNA sequence files, provides the user with an efficient method of evaluating the similarity between nucleotide sequences by multiple alignment or use of Logos. The ability to re-use existing suffix tree pieces considerably shortens index generation time. The graphical user interface enables quick mastery of the analysis functions, easy access to the generated data, and seamless workflow integration.
format	Online Article Text
id	pubmed-4118789
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-41187892014-08-02 Suffix tree searcher: exploration of common substrings in large DNA sequence sets Minkley, David Whitney, Michael J Lin, Song-Han Barsky, Marina G Kelly, Chris Upton, Chris BMC Res Notes Research Article BACKGROUND: Large DNA sequence data sets require special bioinformatics tools to search and compare them. Such tools should be easy to use so that the data can be easily accessed by a wide array of researchers. In the past, the use of suffix trees for searching DNA sequences has been limited by a practical need to keep the trees in RAM. Newer algorithms solve this problem by using disk-based approaches. However, none of the fastest suffix tree algorithms have been implemented with a graphical user interface, preventing their incorporation into a feasible laboratory workflow. RESULTS: Suffix Tree Searcher (STS) is designed as an easy-to-use tool to index, search, and analyze very large DNA sequence datasets. The program accommodates very large numbers of very large sequences, with aggregate size reaching tens of billions of nucleotides. The program makes use of pre-sorted persistent "building blocks" to reduce the time required to construct new trees. STS is comprised of a graphical user interface written in Java, and four C modules. All components are automatically downloaded when a web link is clicked. The underlying suffix tree data structure permits extremely fast searching for specific nucleotide strings, with wild cards or mismatches allowed. Complete tree traversals for detecting common substrings are also very fast. The graphical user interface allows the user to transition seamlessly between building, traversing, and searching the dataset. CONCLUSIONS: Thus, STS provides a new resource for the detection of substrings common to multiple DNA sequences or within a single sequence, for truly huge data sets. The re-searching of sequence hits, allowing wild card positions or mismatched nucleotides, together with the ability to rapidly retrieve large numbers of sequence hits from the DNA sequence files, provides the user with an efficient method of evaluating the similarity between nucleotide sequences by multiple alignment or use of Logos. The ability to re-use existing suffix tree pieces considerably shortens index generation time. The graphical user interface enables quick mastery of the analysis functions, easy access to the generated data, and seamless workflow integration. BioMed Central 2014-07-23 /pmc/articles/PMC4118789/ /pubmed/25053142 http://dx.doi.org/10.1186/1756-0500-7-466 Text en Copyright © 2014 Minkley et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Minkley, David Whitney, Michael J Lin, Song-Han Barsky, Marina G Kelly, Chris Upton, Chris Suffix tree searcher: exploration of common substrings in large DNA sequence sets
title	Suffix tree searcher: exploration of common substrings in large DNA sequence sets
title_full	Suffix tree searcher: exploration of common substrings in large DNA sequence sets
title_fullStr	Suffix tree searcher: exploration of common substrings in large DNA sequence sets
title_full_unstemmed	Suffix tree searcher: exploration of common substrings in large DNA sequence sets
title_short	Suffix tree searcher: exploration of common substrings in large DNA sequence sets
title_sort	suffix tree searcher: exploration of common substrings in large dna sequence sets
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4118789/ https://www.ncbi.nlm.nih.gov/pubmed/25053142 http://dx.doi.org/10.1186/1756-0500-7-466
work_keys_str_mv	AT minkleydavid suffixtreesearcherexplorationofcommonsubstringsinlargednasequencesets AT whitneymichaelj suffixtreesearcherexplorationofcommonsubstringsinlargednasequencesets AT linsonghan suffixtreesearcherexplorationofcommonsubstringsinlargednasequencesets AT barskymarinag suffixtreesearcherexplorationofcommonsubstringsinlargednasequencesets AT kellychris suffixtreesearcherexplorationofcommonsubstringsinlargednasequencesets AT uptonchris suffixtreesearcherexplorationofcommonsubstringsinlargednasequencesets

Suffix tree searcher: exploration of common substrings in large DNA sequence sets

Ejemplares similares