Cargando…

Simrank: Rapid and sensitive general-purpose k-mer search tool

BACKGROUND: Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence d...

Descripción completa

Detalles Bibliográficos
Autores principales: DeSantis, Todd Z, Keller, Keith, Karaoz, Ulas, Alekseyenko, Alexander V, Singh, Navjeet NS, Brodie, Eoin L, Pei, Zhiheng, Andersen, Gary L, Larsen, Niels
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3097142/
https://www.ncbi.nlm.nih.gov/pubmed/21524302
http://dx.doi.org/10.1186/1472-6785-11-11
_version_ 1782203787982667776
author DeSantis, Todd Z
Keller, Keith
Karaoz, Ulas
Alekseyenko, Alexander V
Singh, Navjeet NS
Brodie, Eoin L
Pei, Zhiheng
Andersen, Gary L
Larsen, Niels
author_facet DeSantis, Todd Z
Keller, Keith
Karaoz, Ulas
Alekseyenko, Alexander V
Singh, Navjeet NS
Brodie, Eoin L
Pei, Zhiheng
Andersen, Gary L
Larsen, Niels
author_sort DeSantis, Todd Z
collection PubMed
description BACKGROUND: Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available. RESULTS: Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset. CONCLUSIONS: Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity.
format Text
id pubmed-3097142
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30971422011-05-19 Simrank: Rapid and sensitive general-purpose k-mer search tool DeSantis, Todd Z Keller, Keith Karaoz, Ulas Alekseyenko, Alexander V Singh, Navjeet NS Brodie, Eoin L Pei, Zhiheng Andersen, Gary L Larsen, Niels BMC Ecol Software BACKGROUND: Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available. RESULTS: Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset. CONCLUSIONS: Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity. BioMed Central 2011-04-27 /pmc/articles/PMC3097142/ /pubmed/21524302 http://dx.doi.org/10.1186/1472-6785-11-11 Text en Copyright ©2011 DeSantis et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
DeSantis, Todd Z
Keller, Keith
Karaoz, Ulas
Alekseyenko, Alexander V
Singh, Navjeet NS
Brodie, Eoin L
Pei, Zhiheng
Andersen, Gary L
Larsen, Niels
Simrank: Rapid and sensitive general-purpose k-mer search tool
title Simrank: Rapid and sensitive general-purpose k-mer search tool
title_full Simrank: Rapid and sensitive general-purpose k-mer search tool
title_fullStr Simrank: Rapid and sensitive general-purpose k-mer search tool
title_full_unstemmed Simrank: Rapid and sensitive general-purpose k-mer search tool
title_short Simrank: Rapid and sensitive general-purpose k-mer search tool
title_sort simrank: rapid and sensitive general-purpose k-mer search tool
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3097142/
https://www.ncbi.nlm.nih.gov/pubmed/21524302
http://dx.doi.org/10.1186/1472-6785-11-11
work_keys_str_mv AT desantistoddz simrankrapidandsensitivegeneralpurposekmersearchtool
AT kellerkeith simrankrapidandsensitivegeneralpurposekmersearchtool
AT karaozulas simrankrapidandsensitivegeneralpurposekmersearchtool
AT alekseyenkoalexanderv simrankrapidandsensitivegeneralpurposekmersearchtool
AT singhnavjeetns simrankrapidandsensitivegeneralpurposekmersearchtool
AT brodieeoinl simrankrapidandsensitivegeneralpurposekmersearchtool
AT peizhiheng simrankrapidandsensitivegeneralpurposekmersearchtool
AT andersengaryl simrankrapidandsensitivegeneralpurposekmersearchtool
AT larsenniels simrankrapidandsensitivegeneralpurposekmersearchtool