Cargando…

UPP2: fast and accurate alignment of datasets with fragmentary sequences

MOTIVATION: Multiple sequence alignment (MSA) is a basic step in many bioinformatics pipelines. However, achieving highly accurate alignments on large datasets, especially those with sequence length heterogeneity, is a challenging task. Ultra-large multiple sequence alignment using Phylogeny-aware P...

Descripción completa

Detalles Bibliográficos
Autores principales:	Park, Minhyuk, Ivanovic, Stefan, Chu, Gillian, Shen, Chengze, Warnow, Tandy
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9846425/ https://www.ncbi.nlm.nih.gov/pubmed/36625535 http://dx.doi.org/10.1093/bioinformatics/btad007

_version_	1784871172283826176
author	Park, Minhyuk Ivanovic, Stefan Chu, Gillian Shen, Chengze Warnow, Tandy
author_facet	Park, Minhyuk Ivanovic, Stefan Chu, Gillian Shen, Chengze Warnow, Tandy
author_sort	Park, Minhyuk
collection	PubMed
description	MOTIVATION: Multiple sequence alignment (MSA) is a basic step in many bioinformatics pipelines. However, achieving highly accurate alignments on large datasets, especially those with sequence length heterogeneity, is a challenging task. Ultra-large multiple sequence alignment using Phylogeny-aware Profiles (UPP) is a method for MSA estimation that builds an ensemble of Hidden Markov Models (eHMM) to represent an estimated alignment on the full-length sequences in the input, and then adds the remaining sequences into the alignment using selected HMMs in the ensemble. Although UPP provides good accuracy, it is computationally intensive on large datasets. RESULTS: We present UPP2, a direct improvement on UPP. The main advance is a fast technique for selecting HMMs in the ensemble that allows us to achieve the same accuracy as UPP but with greatly reduced runtime. We show that UPP2 produces more accurate alignments compared to leading MSA methods on datasets exhibiting substantial sequence length heterogeneity and is among the most accurate otherwise. AVAILABILITY AND IMPLEMENTATION: https://github.com/gillichu/sepp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-9846425
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-98464252023-01-20 UPP2: fast and accurate alignment of datasets with fragmentary sequences Park, Minhyuk Ivanovic, Stefan Chu, Gillian Shen, Chengze Warnow, Tandy Bioinformatics Original Paper MOTIVATION: Multiple sequence alignment (MSA) is a basic step in many bioinformatics pipelines. However, achieving highly accurate alignments on large datasets, especially those with sequence length heterogeneity, is a challenging task. Ultra-large multiple sequence alignment using Phylogeny-aware Profiles (UPP) is a method for MSA estimation that builds an ensemble of Hidden Markov Models (eHMM) to represent an estimated alignment on the full-length sequences in the input, and then adds the remaining sequences into the alignment using selected HMMs in the ensemble. Although UPP provides good accuracy, it is computationally intensive on large datasets. RESULTS: We present UPP2, a direct improvement on UPP. The main advance is a fast technique for selecting HMMs in the ensemble that allows us to achieve the same accuracy as UPP but with greatly reduced runtime. We show that UPP2 produces more accurate alignments compared to leading MSA methods on datasets exhibiting substantial sequence length heterogeneity and is among the most accurate otherwise. AVAILABILITY AND IMPLEMENTATION: https://github.com/gillichu/sepp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2023-01-10 /pmc/articles/PMC9846425/ /pubmed/36625535 http://dx.doi.org/10.1093/bioinformatics/btad007 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Park, Minhyuk Ivanovic, Stefan Chu, Gillian Shen, Chengze Warnow, Tandy UPP2: fast and accurate alignment of datasets with fragmentary sequences
title	UPP2: fast and accurate alignment of datasets with fragmentary sequences
title_full	UPP2: fast and accurate alignment of datasets with fragmentary sequences
title_fullStr	UPP2: fast and accurate alignment of datasets with fragmentary sequences
title_full_unstemmed	UPP2: fast and accurate alignment of datasets with fragmentary sequences
title_short	UPP2: fast and accurate alignment of datasets with fragmentary sequences
title_sort	upp2: fast and accurate alignment of datasets with fragmentary sequences
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9846425/ https://www.ncbi.nlm.nih.gov/pubmed/36625535 http://dx.doi.org/10.1093/bioinformatics/btad007
work_keys_str_mv	AT parkminhyuk upp2fastandaccuratealignmentofdatasetswithfragmentarysequences AT ivanovicstefan upp2fastandaccuratealignmentofdatasetswithfragmentarysequences AT chugillian upp2fastandaccuratealignmentofdatasetswithfragmentarysequences AT shenchengze upp2fastandaccuratealignmentofdatasetswithfragmentarysequences AT warnowtandy upp2fastandaccuratealignmentofdatasetswithfragmentarysequences

UPP2: fast and accurate alignment of datasets with fragmentary sequences

Ejemplares similares