Cargando…

Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus

Aligning sequences for phylogenetic analysis (multiple sequence alignment; MSA) is an important, but increasingly computationally expensive step with the recent surge in DNA sequence data. Much of this sequence data is publicly available, but can be extremely fragmentary (i.e., a combination of full...

Descripción completa

Detalles Bibliográficos
Autores principales:	Catanach, Therese A., Sweet, Andrew D., Nguyen, Nam-phuong D., Peery, Rhiannon M., Debevec, Andrew H., Thomer, Andrea K., Owings, Amanda C., Boyd, Bret M., Katz, Aron D., Soto-Adames, Felipe N., Allen, Julie M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2019
Materias:	Bioinformatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6321758/ https://www.ncbi.nlm.nih.gov/pubmed/30627489 http://dx.doi.org/10.7717/peerj.6142

_version_	1783385516221136896
author	Catanach, Therese A. Sweet, Andrew D. Nguyen, Nam-phuong D. Peery, Rhiannon M. Debevec, Andrew H. Thomer, Andrea K. Owings, Amanda C. Boyd, Bret M. Katz, Aron D. Soto-Adames, Felipe N. Allen, Julie M.
author_facet	Catanach, Therese A. Sweet, Andrew D. Nguyen, Nam-phuong D. Peery, Rhiannon M. Debevec, Andrew H. Thomer, Andrea K. Owings, Amanda C. Boyd, Bret M. Katz, Aron D. Soto-Adames, Felipe N. Allen, Julie M.
author_sort	Catanach, Therese A.
collection	PubMed
description	Aligning sequences for phylogenetic analysis (multiple sequence alignment; MSA) is an important, but increasingly computationally expensive step with the recent surge in DNA sequence data. Much of this sequence data is publicly available, but can be extremely fragmentary (i.e., a combination of full genomes and genomic fragments), which can compound the computational issues related to MSA. Traditionally, alignments are produced with automated algorithms and then checked and/or corrected “by eye” prior to phylogenetic inference. However, this manual curation is inefficient at the data scales required of modern phylogenetics and results in alignments that are not reproducible. Recently, methods have been developed for fully automating alignments of large data sets, but it is unclear if these methods produce alignments that result in compatible phylogenies when compared to more traditional alignment approaches that combined automated and manual methods. Here we use approximately 33,000 publicly available sequences from the hepatitis B virus (HBV), a globally distributed and rapidly evolving virus, to compare different alignment approaches. Using one data set comprised exclusively of whole genomes and a second that also included sequence fragments, we compared three MSA methods: (1) a purely automated approach using traditional software, (2) an automated approach including by eye manual editing, and (3) more recent fully automated approaches. To understand how these methods affect phylogenetic results, we compared resulting tree topologies based on these different alignment methods using multiple metrics. We further determined if the monophyly of existing HBV genotypes was supported in phylogenies estimated from each alignment type and under different statistical support thresholds. Traditional and fully automated alignments produced similar HBV phylogenies. Although there was variability between branch support thresholds, allowing lower support thresholds tended to result in more differences among trees. Therefore, differences between the trees could be best explained by phylogenetic uncertainty unrelated to the MSA method used. Nevertheless, automated alignment approaches did not require human intervention and were therefore considerably less time-intensive than traditional approaches. Because of this, we conclude that fully automated algorithms for MSA are fully compatible with older methods even in extremely difficult to align data sets. Additionally, we found that most HBV diagnostic genotypes did not correspond to evolutionarily-sound groups, regardless of alignment type and support threshold. This suggests there may be errors in genotype classification in the database or that HBV genotypes may need a revision.
format	Online Article Text
id	pubmed-6321758
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-63217582019-01-09 Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus Catanach, Therese A. Sweet, Andrew D. Nguyen, Nam-phuong D. Peery, Rhiannon M. Debevec, Andrew H. Thomer, Andrea K. Owings, Amanda C. Boyd, Bret M. Katz, Aron D. Soto-Adames, Felipe N. Allen, Julie M. PeerJ Bioinformatics Aligning sequences for phylogenetic analysis (multiple sequence alignment; MSA) is an important, but increasingly computationally expensive step with the recent surge in DNA sequence data. Much of this sequence data is publicly available, but can be extremely fragmentary (i.e., a combination of full genomes and genomic fragments), which can compound the computational issues related to MSA. Traditionally, alignments are produced with automated algorithms and then checked and/or corrected “by eye” prior to phylogenetic inference. However, this manual curation is inefficient at the data scales required of modern phylogenetics and results in alignments that are not reproducible. Recently, methods have been developed for fully automating alignments of large data sets, but it is unclear if these methods produce alignments that result in compatible phylogenies when compared to more traditional alignment approaches that combined automated and manual methods. Here we use approximately 33,000 publicly available sequences from the hepatitis B virus (HBV), a globally distributed and rapidly evolving virus, to compare different alignment approaches. Using one data set comprised exclusively of whole genomes and a second that also included sequence fragments, we compared three MSA methods: (1) a purely automated approach using traditional software, (2) an automated approach including by eye manual editing, and (3) more recent fully automated approaches. To understand how these methods affect phylogenetic results, we compared resulting tree topologies based on these different alignment methods using multiple metrics. We further determined if the monophyly of existing HBV genotypes was supported in phylogenies estimated from each alignment type and under different statistical support thresholds. Traditional and fully automated alignments produced similar HBV phylogenies. Although there was variability between branch support thresholds, allowing lower support thresholds tended to result in more differences among trees. Therefore, differences between the trees could be best explained by phylogenetic uncertainty unrelated to the MSA method used. Nevertheless, automated alignment approaches did not require human intervention and were therefore considerably less time-intensive than traditional approaches. Because of this, we conclude that fully automated algorithms for MSA are fully compatible with older methods even in extremely difficult to align data sets. Additionally, we found that most HBV diagnostic genotypes did not correspond to evolutionarily-sound groups, regardless of alignment type and support threshold. This suggests there may be errors in genotype classification in the database or that HBV genotypes may need a revision. PeerJ Inc. 2019-01-03 /pmc/articles/PMC6321758/ /pubmed/30627489 http://dx.doi.org/10.7717/peerj.6142 Text en ©2019 Catanach et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle	Bioinformatics Catanach, Therese A. Sweet, Andrew D. Nguyen, Nam-phuong D. Peery, Rhiannon M. Debevec, Andrew H. Thomer, Andrea K. Owings, Amanda C. Boyd, Bret M. Katz, Aron D. Soto-Adames, Felipe N. Allen, Julie M. Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus
title	Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus
title_full	Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus
title_fullStr	Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus
title_full_unstemmed	Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus
title_short	Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus
title_sort	fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis b virus
topic	Bioinformatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6321758/ https://www.ncbi.nlm.nih.gov/pubmed/30627489 http://dx.doi.org/10.7717/peerj.6142
work_keys_str_mv	AT catanachtheresea fullyautomatedsequencealignmentmethodsarecomparabletoandmuchfasterthantraditionalmethodsinlargedatasetsanexamplewithhepatitisbvirus AT sweetandrewd fullyautomatedsequencealignmentmethodsarecomparabletoandmuchfasterthantraditionalmethodsinlargedatasetsanexamplewithhepatitisbvirus AT nguyennamphuongd fullyautomatedsequencealignmentmethodsarecomparabletoandmuchfasterthantraditionalmethodsinlargedatasetsanexamplewithhepatitisbvirus AT peeryrhiannonm fullyautomatedsequencealignmentmethodsarecomparabletoandmuchfasterthantraditionalmethodsinlargedatasetsanexamplewithhepatitisbvirus AT debevecandrewh fullyautomatedsequencealignmentmethodsarecomparabletoandmuchfasterthantraditionalmethodsinlargedatasetsanexamplewithhepatitisbvirus AT thomerandreak fullyautomatedsequencealignmentmethodsarecomparabletoandmuchfasterthantraditionalmethodsinlargedatasetsanexamplewithhepatitisbvirus AT owingsamandac fullyautomatedsequencealignmentmethodsarecomparabletoandmuchfasterthantraditionalmethodsinlargedatasetsanexamplewithhepatitisbvirus AT boydbretm fullyautomatedsequencealignmentmethodsarecomparabletoandmuchfasterthantraditionalmethodsinlargedatasetsanexamplewithhepatitisbvirus AT katzarond fullyautomatedsequencealignmentmethodsarecomparabletoandmuchfasterthantraditionalmethodsinlargedatasetsanexamplewithhepatitisbvirus AT sotoadamesfelipen fullyautomatedsequencealignmentmethodsarecomparabletoandmuchfasterthantraditionalmethodsinlargedatasetsanexamplewithhepatitisbvirus AT allenjuliem fullyautomatedsequencealignmentmethodsarecomparabletoandmuchfasterthantraditionalmethodsinlargedatasetsanexamplewithhepatitisbvirus

Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus

Ejemplares similares