Cargando…

Phylogeny Estimation Given Sequence Length Heterogeneity

Phylogeny estimation is a major step in many biological studies, and has many well known challenges. With the dropping cost of sequencing technologies, biologists now have increasingly large datasets available for use in phylogeny estimation. Here we address the challenge of estimating a tree given...

Descripción completa

Detalles Bibliográficos
Autores principales:	Smirnov, Vladimir, Warnow, Tandy
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2020
Materias:	Regular Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7875441/ https://www.ncbi.nlm.nih.gov/pubmed/32692823 http://dx.doi.org/10.1093/sysbio/syaa058

_version_	1783649774828781568
author	Smirnov, Vladimir Warnow, Tandy
author_facet	Smirnov, Vladimir Warnow, Tandy
author_sort	Smirnov, Vladimir
collection	PubMed
description	Phylogeny estimation is a major step in many biological studies, and has many well known challenges. With the dropping cost of sequencing technologies, biologists now have increasingly large datasets available for use in phylogeny estimation. Here we address the challenge of estimating a tree given large datasets with a combination of full-length sequences and fragmentary sequences, which can arise due to a variety of reasons, including sample collection, sequencing technologies, and analytical pipelines. We compare two basic approaches: (1) computing an alignment on the full dataset and then computing a maximum likelihood tree on the alignment, or (2) constructing an alignment and tree on the full length sequences and then using phylogenetic placement to add the remaining sequences (which will generally be fragmentary) into the tree. We explore these two approaches on a range of simulated datasets, each with 1000 sequences and varying in rates of evolution, and two biological datasets. Our study shows some striking performance differences between methods, especially when there is substantial sequence length heterogeneity and high rates of evolution. We find in particular that using UPP to align sequences and RAxML to compute a tree on the alignment provides the best accuracy, substantially outperforming trees computed using phylogenetic placement methods. We also find that FastTree has poor accuracy on alignments containing fragmentary sequences. Overall, our study provides insights into the literature comparing different methods and pipelines for phylogenetic estimation, and suggests directions for future method development. [Phylogeny estimation, sequence length heterogeneity, phylogenetic placement.]
format	Online Article Text
id	pubmed-7875441
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-78754412021-02-16 Phylogeny Estimation Given Sequence Length Heterogeneity Smirnov, Vladimir Warnow, Tandy Syst Biol Regular Articles Phylogeny estimation is a major step in many biological studies, and has many well known challenges. With the dropping cost of sequencing technologies, biologists now have increasingly large datasets available for use in phylogeny estimation. Here we address the challenge of estimating a tree given large datasets with a combination of full-length sequences and fragmentary sequences, which can arise due to a variety of reasons, including sample collection, sequencing technologies, and analytical pipelines. We compare two basic approaches: (1) computing an alignment on the full dataset and then computing a maximum likelihood tree on the alignment, or (2) constructing an alignment and tree on the full length sequences and then using phylogenetic placement to add the remaining sequences (which will generally be fragmentary) into the tree. We explore these two approaches on a range of simulated datasets, each with 1000 sequences and varying in rates of evolution, and two biological datasets. Our study shows some striking performance differences between methods, especially when there is substantial sequence length heterogeneity and high rates of evolution. We find in particular that using UPP to align sequences and RAxML to compute a tree on the alignment provides the best accuracy, substantially outperforming trees computed using phylogenetic placement methods. We also find that FastTree has poor accuracy on alignments containing fragmentary sequences. Overall, our study provides insights into the literature comparing different methods and pipelines for phylogenetic estimation, and suggests directions for future method development. [Phylogeny estimation, sequence length heterogeneity, phylogenetic placement.] Oxford University Press 2020-07-21 /pmc/articles/PMC7875441/ /pubmed/32692823 http://dx.doi.org/10.1093/sysbio/syaa058 Text en © The Author(s) 2020. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Regular Articles Smirnov, Vladimir Warnow, Tandy Phylogeny Estimation Given Sequence Length Heterogeneity
title	Phylogeny Estimation Given Sequence Length Heterogeneity
title_full	Phylogeny Estimation Given Sequence Length Heterogeneity
title_fullStr	Phylogeny Estimation Given Sequence Length Heterogeneity
title_full_unstemmed	Phylogeny Estimation Given Sequence Length Heterogeneity
title_short	Phylogeny Estimation Given Sequence Length Heterogeneity
title_sort	phylogeny estimation given sequence length heterogeneity
topic	Regular Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7875441/ https://www.ncbi.nlm.nih.gov/pubmed/32692823 http://dx.doi.org/10.1093/sysbio/syaa058
work_keys_str_mv	AT smirnovvladimir phylogenyestimationgivensequencelengthheterogeneity AT warnowtandy phylogenyestimationgivensequencelengthheterogeneity

Phylogeny Estimation Given Sequence Length Heterogeneity

Ejemplares similares