Cargando…

The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances

We study the number N(k) of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences—i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor—...

Descripción completa

Detalles Bibliográficos
Autores principales: Röhling, Sophie, Linne, Alexander, Schellhorn, Jendrik, Hosseini, Morteza, Dencker, Thomas, Morgenstern, Burkhard
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7010260/
https://www.ncbi.nlm.nih.gov/pubmed/32040534
http://dx.doi.org/10.1371/journal.pone.0228070
_version_ 1783495848715354112
author Röhling, Sophie
Linne, Alexander
Schellhorn, Jendrik
Hosseini, Morteza
Dencker, Thomas
Morgenstern, Burkhard
author_facet Röhling, Sophie
Linne, Alexander
Schellhorn, Jendrik
Hosseini, Morteza
Dencker, Thomas
Morgenstern, Burkhard
author_sort Röhling, Sophie
collection PubMed
description We study the number N(k) of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences—i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor—can be estimated from the slope of a function F that depends on N(k) and that is affine-linear within a certain range of k. Integers k(min) and k(max) can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(k(min)) and F(k(max)). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.
format Online
Article
Text
id pubmed-7010260
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-70102602020-02-21 The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances Röhling, Sophie Linne, Alexander Schellhorn, Jendrik Hosseini, Morteza Dencker, Thomas Morgenstern, Burkhard PLoS One Research Article We study the number N(k) of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences—i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor—can be estimated from the slope of a function F that depends on N(k) and that is affine-linear within a certain range of k. Integers k(min) and k(max) can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(k(min)) and F(k(max)). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies. Public Library of Science 2020-02-10 /pmc/articles/PMC7010260/ /pubmed/32040534 http://dx.doi.org/10.1371/journal.pone.0228070 Text en © 2020 Röhling et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Röhling, Sophie
Linne, Alexander
Schellhorn, Jendrik
Hosseini, Morteza
Dencker, Thomas
Morgenstern, Burkhard
The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances
title The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances
title_full The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances
title_fullStr The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances
title_full_unstemmed The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances
title_short The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances
title_sort number of k-mer matches between two dna sequences as a function of k and applications to estimate phylogenetic distances
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7010260/
https://www.ncbi.nlm.nih.gov/pubmed/32040534
http://dx.doi.org/10.1371/journal.pone.0228070
work_keys_str_mv AT rohlingsophie thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT linnealexander thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT schellhornjendrik thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT hosseinimorteza thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT denckerthomas thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT morgensternburkhard thenumberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT rohlingsophie numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT linnealexander numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT schellhornjendrik numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT hosseinimorteza numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT denckerthomas numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances
AT morgensternburkhard numberofkmermatchesbetweentwodnasequencesasafunctionofkandapplicationstoestimatephylogeneticdistances