Cargando…

rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison

Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don’t-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions o...

Descripción completa

Detalles Bibliográficos
Autores principales: Hahn, Lars, Leimeister, Chris-André, Ounit, Rachid, Lonardi, Stefano, Morgenstern, Burkhard
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5070788/
https://www.ncbi.nlm.nih.gov/pubmed/27760124
http://dx.doi.org/10.1371/journal.pcbi.1005107
_version_ 1782461197995474944
author Hahn, Lars
Leimeister, Chris-André
Ounit, Rachid
Lonardi, Stefano
Morgenstern, Burkhard
author_facet Hahn, Lars
Leimeister, Chris-André
Ounit, Rachid
Lonardi, Stefano
Morgenstern, Burkhard
author_sort Hahn, Lars
collection PubMed
description Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don’t-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/
format Online
Article
Text
id pubmed-5070788
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-50707882016-10-27 rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison Hahn, Lars Leimeister, Chris-André Ounit, Rachid Lonardi, Stefano Morgenstern, Burkhard PLoS Comput Biol Research Article Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don’t-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/ Public Library of Science 2016-10-19 /pmc/articles/PMC5070788/ /pubmed/27760124 http://dx.doi.org/10.1371/journal.pcbi.1005107 Text en © 2016 Hahn et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Hahn, Lars
Leimeister, Chris-André
Ounit, Rachid
Lonardi, Stefano
Morgenstern, Burkhard
rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison
title rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison
title_full rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison
title_fullStr rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison
title_full_unstemmed rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison
title_short rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison
title_sort rasbhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5070788/
https://www.ncbi.nlm.nih.gov/pubmed/27760124
http://dx.doi.org/10.1371/journal.pcbi.1005107
work_keys_str_mv AT hahnlars rasbharioptimizingspacedseedsfordatabasesearchingreadmappingandalignmentfreesequencecomparison
AT leimeisterchrisandre rasbharioptimizingspacedseedsfordatabasesearchingreadmappingandalignmentfreesequencecomparison
AT ounitrachid rasbharioptimizingspacedseedsfordatabasesearchingreadmappingandalignmentfreesequencecomparison
AT lonardistefano rasbharioptimizingspacedseedsfordatabasesearchingreadmappingandalignmentfreesequencecomparison
AT morgensternburkhard rasbharioptimizingspacedseedsfordatabasesearchingreadmappingandalignmentfreesequencecomparison