Cargando…
Simrank: Rapid and sensitive general-purpose k-mer search tool
BACKGROUND: Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence d...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2011
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3097142/ https://www.ncbi.nlm.nih.gov/pubmed/21524302 http://dx.doi.org/10.1186/1472-6785-11-11 |
_version_ | 1782203787982667776 |
---|---|
author | DeSantis, Todd Z Keller, Keith Karaoz, Ulas Alekseyenko, Alexander V Singh, Navjeet NS Brodie, Eoin L Pei, Zhiheng Andersen, Gary L Larsen, Niels |
author_facet | DeSantis, Todd Z Keller, Keith Karaoz, Ulas Alekseyenko, Alexander V Singh, Navjeet NS Brodie, Eoin L Pei, Zhiheng Andersen, Gary L Larsen, Niels |
author_sort | DeSantis, Todd Z |
collection | PubMed |
description | BACKGROUND: Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available. RESULTS: Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset. CONCLUSIONS: Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity. |
format | Text |
id | pubmed-3097142 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2011 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-30971422011-05-19 Simrank: Rapid and sensitive general-purpose k-mer search tool DeSantis, Todd Z Keller, Keith Karaoz, Ulas Alekseyenko, Alexander V Singh, Navjeet NS Brodie, Eoin L Pei, Zhiheng Andersen, Gary L Larsen, Niels BMC Ecol Software BACKGROUND: Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available. RESULTS: Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset. CONCLUSIONS: Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity. BioMed Central 2011-04-27 /pmc/articles/PMC3097142/ /pubmed/21524302 http://dx.doi.org/10.1186/1472-6785-11-11 Text en Copyright ©2011 DeSantis et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Software DeSantis, Todd Z Keller, Keith Karaoz, Ulas Alekseyenko, Alexander V Singh, Navjeet NS Brodie, Eoin L Pei, Zhiheng Andersen, Gary L Larsen, Niels Simrank: Rapid and sensitive general-purpose k-mer search tool |
title | Simrank: Rapid and sensitive general-purpose k-mer search tool |
title_full | Simrank: Rapid and sensitive general-purpose k-mer search tool |
title_fullStr | Simrank: Rapid and sensitive general-purpose k-mer search tool |
title_full_unstemmed | Simrank: Rapid and sensitive general-purpose k-mer search tool |
title_short | Simrank: Rapid and sensitive general-purpose k-mer search tool |
title_sort | simrank: rapid and sensitive general-purpose k-mer search tool |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3097142/ https://www.ncbi.nlm.nih.gov/pubmed/21524302 http://dx.doi.org/10.1186/1472-6785-11-11 |
work_keys_str_mv | AT desantistoddz simrankrapidandsensitivegeneralpurposekmersearchtool AT kellerkeith simrankrapidandsensitivegeneralpurposekmersearchtool AT karaozulas simrankrapidandsensitivegeneralpurposekmersearchtool AT alekseyenkoalexanderv simrankrapidandsensitivegeneralpurposekmersearchtool AT singhnavjeetns simrankrapidandsensitivegeneralpurposekmersearchtool AT brodieeoinl simrankrapidandsensitivegeneralpurposekmersearchtool AT peizhiheng simrankrapidandsensitivegeneralpurposekmersearchtool AT andersengaryl simrankrapidandsensitivegeneralpurposekmersearchtool AT larsenniels simrankrapidandsensitivegeneralpurposekmersearchtool |