Cargando…

SANS: high-throughput retrieval of protein sequences allowing 50% mismatches

Motivation: The genomic era in molecular biology has brought on a rapidly widening gap between the amount of sequence data and first-hand experimental characterization of proteins. Fortunately, the theory of evolution provides a simple solution: functional and structural information can be transferr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Koskinen, J. Patrik, Holm, Liisa
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2012
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436844/ https://www.ncbi.nlm.nih.gov/pubmed/22962464 http://dx.doi.org/10.1093/bioinformatics/bts417

_version_	1782242710792437760
author	Koskinen, J. Patrik Holm, Liisa
author_facet	Koskinen, J. Patrik Holm, Liisa
author_sort	Koskinen, J. Patrik
collection	PubMed
description	Motivation: The genomic era in molecular biology has brought on a rapidly widening gap between the amount of sequence data and first-hand experimental characterization of proteins. Fortunately, the theory of evolution provides a simple solution: functional and structural information can be transferred between homologous proteins. Sequence similarity searching followed by k-nearest neighbor classification is the most widely used tool to predict the function or structure of anonymous gene products that come out of genome sequencing projects. Results: We present a novel word filter, suffix array neighborhood search (SANS), to identify protein sequence similarities in the range of 50–100% identity with sensitivity comparable to BLAST and 10 times the speed of USEARCH. In contrast to these previous approaches, the complexity of the search is proportional only to the length of the query sequence and independent of database size, enabling fast searching and functional annotation into the future despite rapidly expanding databases. Availability and implementation: The software is freely available to non-commercial users from our website http://ekhidna.biocenter.helsinki.fi/downloads/sans. Contact: liisa.holm@helsinki.fi.
format	Online Article Text
id	pubmed-3436844
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-34368442012-12-12 SANS: high-throughput retrieval of protein sequences allowing 50% mismatches Koskinen, J. Patrik Holm, Liisa Bioinformatics Original Papers Motivation: The genomic era in molecular biology has brought on a rapidly widening gap between the amount of sequence data and first-hand experimental characterization of proteins. Fortunately, the theory of evolution provides a simple solution: functional and structural information can be transferred between homologous proteins. Sequence similarity searching followed by k-nearest neighbor classification is the most widely used tool to predict the function or structure of anonymous gene products that come out of genome sequencing projects. Results: We present a novel word filter, suffix array neighborhood search (SANS), to identify protein sequence similarities in the range of 50–100% identity with sensitivity comparable to BLAST and 10 times the speed of USEARCH. In contrast to these previous approaches, the complexity of the search is proportional only to the length of the query sequence and independent of database size, enabling fast searching and functional annotation into the future despite rapidly expanding databases. Availability and implementation: The software is freely available to non-commercial users from our website http://ekhidna.biocenter.helsinki.fi/downloads/sans. Contact: liisa.holm@helsinki.fi. Oxford University Press 2012-09-15 2012-09-03 /pmc/articles/PMC3436844/ /pubmed/22962464 http://dx.doi.org/10.1093/bioinformatics/bts417 Text en © The Author(s) (2012). Published by Oxford University Press. http://creativecommons.org/licenses/by/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Koskinen, J. Patrik Holm, Liisa SANS: high-throughput retrieval of protein sequences allowing 50% mismatches
title	SANS: high-throughput retrieval of protein sequences allowing 50% mismatches
title_full	SANS: high-throughput retrieval of protein sequences allowing 50% mismatches
title_fullStr	SANS: high-throughput retrieval of protein sequences allowing 50% mismatches
title_full_unstemmed	SANS: high-throughput retrieval of protein sequences allowing 50% mismatches
title_short	SANS: high-throughput retrieval of protein sequences allowing 50% mismatches
title_sort	sans: high-throughput retrieval of protein sequences allowing 50% mismatches
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436844/ https://www.ncbi.nlm.nih.gov/pubmed/22962464 http://dx.doi.org/10.1093/bioinformatics/bts417
work_keys_str_mv	AT koskinenjpatrik sanshighthroughputretrievalofproteinsequencesallowing50mismatches AT holmliisa sanshighthroughputretrievalofproteinsequencesallowing50mismatches

SANS: high-throughput retrieval of protein sequences allowing 50% mismatches

Ejemplares similares