Cargando…

Fast alignment-free sequence comparison using spaced-word frequencies

Motivation: Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignment...

Descripción completa

Detalles Bibliográficos
Autores principales: Leimeister, Chris-Andre, Boden, Marcus, Horwege, Sebastian, Lindner, Sebastian, Morgenstern, Burkhard
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080745/
https://www.ncbi.nlm.nih.gov/pubmed/24700317
http://dx.doi.org/10.1093/bioinformatics/btu177
_version_ 1782324031918178304
author Leimeister, Chris-Andre
Boden, Marcus
Horwege, Sebastian
Lindner, Sebastian
Morgenstern, Burkhard
author_facet Leimeister, Chris-Andre
Boden, Marcus
Horwege, Sebastian
Lindner, Sebastian
Morgenstern, Burkhard
author_sort Leimeister, Chris-Andre
collection PubMed
description Motivation: Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well-known problem with these methods is that neighbouring word matches are far from independent. Results: To reduce the statistical dependency between adjacent word matches, we propose to use ‘spaced words’, defined by patterns of ‘match’ and ‘don’t care’ positions, for alignment-free sequence comparison. We describe a fast implementation of this approach using recursive hashing and bit operations, and we show that further improvements can be achieved by using multiple patterns instead of single patterns. To evaluate our approach, we use spaced-word frequencies as a basis for fast phylogeny reconstruction. Using real-world and simulated sequence data, we demonstrate that our multiple-pattern approach produces better phylogenies than approaches relying on contiguous words. Availability and implementation: Our program is freely available at http://spaced.gobics.de/. Contact: chris.leimeister@stud.uni-goettingen.de Supplementary information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-4080745
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-40807452014-07-03 Fast alignment-free sequence comparison using spaced-word frequencies Leimeister, Chris-Andre Boden, Marcus Horwege, Sebastian Lindner, Sebastian Morgenstern, Burkhard Bioinformatics Original Papers Motivation: Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well-known problem with these methods is that neighbouring word matches are far from independent. Results: To reduce the statistical dependency between adjacent word matches, we propose to use ‘spaced words’, defined by patterns of ‘match’ and ‘don’t care’ positions, for alignment-free sequence comparison. We describe a fast implementation of this approach using recursive hashing and bit operations, and we show that further improvements can be achieved by using multiple patterns instead of single patterns. To evaluate our approach, we use spaced-word frequencies as a basis for fast phylogeny reconstruction. Using real-world and simulated sequence data, we demonstrate that our multiple-pattern approach produces better phylogenies than approaches relying on contiguous words. Availability and implementation: Our program is freely available at http://spaced.gobics.de/. Contact: chris.leimeister@stud.uni-goettingen.de Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2014-07-15 2014-04-03 /pmc/articles/PMC4080745/ /pubmed/24700317 http://dx.doi.org/10.1093/bioinformatics/btu177 Text en © The Author 2014. Published by Oxford University Press. http://creativecommons.org/licenses/by/3.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Leimeister, Chris-Andre
Boden, Marcus
Horwege, Sebastian
Lindner, Sebastian
Morgenstern, Burkhard
Fast alignment-free sequence comparison using spaced-word frequencies
title Fast alignment-free sequence comparison using spaced-word frequencies
title_full Fast alignment-free sequence comparison using spaced-word frequencies
title_fullStr Fast alignment-free sequence comparison using spaced-word frequencies
title_full_unstemmed Fast alignment-free sequence comparison using spaced-word frequencies
title_short Fast alignment-free sequence comparison using spaced-word frequencies
title_sort fast alignment-free sequence comparison using spaced-word frequencies
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080745/
https://www.ncbi.nlm.nih.gov/pubmed/24700317
http://dx.doi.org/10.1093/bioinformatics/btu177
work_keys_str_mv AT leimeisterchrisandre fastalignmentfreesequencecomparisonusingspacedwordfrequencies
AT bodenmarcus fastalignmentfreesequencecomparisonusingspacedwordfrequencies
AT horwegesebastian fastalignmentfreesequencecomparisonusingspacedwordfrequencies
AT lindnersebastian fastalignmentfreesequencecomparisonusingspacedwordfrequencies
AT morgensternburkhard fastalignmentfreesequencecomparisonusingspacedwordfrequencies