Cargando…
Fast and accurate phylogeny reconstruction using filtered spaced-word matches
MOTIVATION: Word-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignme...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5409309/ https://www.ncbi.nlm.nih.gov/pubmed/28073754 http://dx.doi.org/10.1093/bioinformatics/btw776 |
_version_ | 1783232454085050368 |
---|---|
author | Leimeister, Chris-André Sohrabi-Jahromi, Salma Morgenstern, Burkhard |
author_facet | Leimeister, Chris-André Sohrabi-Jahromi, Salma Morgenstern, Burkhard |
author_sort | Leimeister, Chris-André |
collection | PubMed |
description | MOTIVATION: Word-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. RESULTS: We propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don’t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don’t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don’t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. AVAILABILITY AND IMPLEMENTATION: The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/ SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-5409309 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-54093092017-05-03 Fast and accurate phylogeny reconstruction using filtered spaced-word matches Leimeister, Chris-André Sohrabi-Jahromi, Salma Morgenstern, Burkhard Bioinformatics Original Papers MOTIVATION: Word-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. RESULTS: We propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don’t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don’t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don’t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. AVAILABILITY AND IMPLEMENTATION: The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/ SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2017-04-01 2017-01-04 /pmc/articles/PMC5409309/ /pubmed/28073754 http://dx.doi.org/10.1093/bioinformatics/btw776 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered or transformed in any way, and that the work properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Original Papers Leimeister, Chris-André Sohrabi-Jahromi, Salma Morgenstern, Burkhard Fast and accurate phylogeny reconstruction using filtered spaced-word matches |
title | Fast and accurate phylogeny reconstruction using filtered spaced-word matches |
title_full | Fast and accurate phylogeny reconstruction using filtered spaced-word matches |
title_fullStr | Fast and accurate phylogeny reconstruction using filtered spaced-word matches |
title_full_unstemmed | Fast and accurate phylogeny reconstruction using filtered spaced-word matches |
title_short | Fast and accurate phylogeny reconstruction using filtered spaced-word matches |
title_sort | fast and accurate phylogeny reconstruction using filtered spaced-word matches |
topic | Original Papers |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5409309/ https://www.ncbi.nlm.nih.gov/pubmed/28073754 http://dx.doi.org/10.1093/bioinformatics/btw776 |
work_keys_str_mv | AT leimeisterchrisandre fastandaccuratephylogenyreconstructionusingfilteredspacedwordmatches AT sohrabijahromisalma fastandaccuratephylogenyreconstructionusingfilteredspacedwordmatches AT morgensternburkhard fastandaccuratephylogenyreconstructionusingfilteredspacedwordmatches |