Cargando…

Fast and accurate phylogeny reconstruction using filtered spaced-word matches

MOTIVATION: Word-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignme...

Descripción completa

Detalles Bibliográficos
Autores principales: Leimeister, Chris-André, Sohrabi-Jahromi, Salma, Morgenstern, Burkhard
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5409309/
https://www.ncbi.nlm.nih.gov/pubmed/28073754
http://dx.doi.org/10.1093/bioinformatics/btw776
_version_ 1783232454085050368
author Leimeister, Chris-André
Sohrabi-Jahromi, Salma
Morgenstern, Burkhard
author_facet Leimeister, Chris-André
Sohrabi-Jahromi, Salma
Morgenstern, Burkhard
author_sort Leimeister, Chris-André
collection PubMed
description MOTIVATION: Word-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. RESULTS: We propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don’t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don’t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don’t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. AVAILABILITY AND IMPLEMENTATION: The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/ SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-5409309
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-54093092017-05-03 Fast and accurate phylogeny reconstruction using filtered spaced-word matches Leimeister, Chris-André Sohrabi-Jahromi, Salma Morgenstern, Burkhard Bioinformatics Original Papers MOTIVATION: Word-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. RESULTS: We propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don’t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don’t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don’t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. AVAILABILITY AND IMPLEMENTATION: The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/ SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2017-04-01 2017-01-04 /pmc/articles/PMC5409309/ /pubmed/28073754 http://dx.doi.org/10.1093/bioinformatics/btw776 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered or transformed in any way, and that the work properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Leimeister, Chris-André
Sohrabi-Jahromi, Salma
Morgenstern, Burkhard
Fast and accurate phylogeny reconstruction using filtered spaced-word matches
title Fast and accurate phylogeny reconstruction using filtered spaced-word matches
title_full Fast and accurate phylogeny reconstruction using filtered spaced-word matches
title_fullStr Fast and accurate phylogeny reconstruction using filtered spaced-word matches
title_full_unstemmed Fast and accurate phylogeny reconstruction using filtered spaced-word matches
title_short Fast and accurate phylogeny reconstruction using filtered spaced-word matches
title_sort fast and accurate phylogeny reconstruction using filtered spaced-word matches
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5409309/
https://www.ncbi.nlm.nih.gov/pubmed/28073754
http://dx.doi.org/10.1093/bioinformatics/btw776
work_keys_str_mv AT leimeisterchrisandre fastandaccuratephylogenyreconstructionusingfilteredspacedwordmatches
AT sohrabijahromisalma fastandaccuratephylogenyreconstructionusingfilteredspacedwordmatches
AT morgensternburkhard fastandaccuratephylogenyreconstructionusingfilteredspacedwordmatches