Cargando…

learnMSA: learning and aligning large protein families

BACKGROUND: The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundame...

Descripción completa

Detalles Bibliográficos
Autores principales:	Becker, Felix, Stanke, Mario
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Technical Note
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9673500/ https://www.ncbi.nlm.nih.gov/pubmed/36399060 http://dx.doi.org/10.1093/gigascience/giac104

_version_	1784832953736495104
author	Becker, Felix Stanke, Mario
author_facet	Becker, Felix Stanke, Mario
author_sort	Becker, Felix
collection	PubMed
description	BACKGROUND: The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundamental problem since many downstream tasks rely on accurate alignments. RESULTS: We present learnMSA, a novel statistical learning approach of profile hidden Markov models (pHMMs) based on batch gradient descent. Fundamentally different from popular aligners, we fit a custom recurrent neural network architecture for (p)HMMs to potentially millions of sequences with respect to a maximum a posteriori objective and decode an alignment. We rely on automatic differentiation of the log-likelihood, and thus, our approach is different from existing HMM training algorithms like Baum–Welch. Our method does not involve progressive, regressive, or divide-and-conquer heuristics. We use uniform batch sampling to adapt to large datasets in linear time without the requirement of a tree. When tested on ultra-large protein families with up to 3.5 million sequences, learnMSA is both more accurate and faster than state-of-the-art tools. On the established benchmarks HomFam and BaliFam with smaller sequence sets, it matches state-of-the-art performance. All experiments were done on a standard workstation with a GPU. CONCLUSIONS: Our results show that learnMSA does not share the counterintuitive drawback of many popular heuristic aligners, which can substantially lose accuracy when many additional homologs are input. LearnMSA is a future-proof framework for large alignments with many opportunities for further improvements.
format	Online Article Text
id	pubmed-9673500
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-96735002022-11-21 learnMSA: learning and aligning large protein families Becker, Felix Stanke, Mario Gigascience Technical Note BACKGROUND: The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundamental problem since many downstream tasks rely on accurate alignments. RESULTS: We present learnMSA, a novel statistical learning approach of profile hidden Markov models (pHMMs) based on batch gradient descent. Fundamentally different from popular aligners, we fit a custom recurrent neural network architecture for (p)HMMs to potentially millions of sequences with respect to a maximum a posteriori objective and decode an alignment. We rely on automatic differentiation of the log-likelihood, and thus, our approach is different from existing HMM training algorithms like Baum–Welch. Our method does not involve progressive, regressive, or divide-and-conquer heuristics. We use uniform batch sampling to adapt to large datasets in linear time without the requirement of a tree. When tested on ultra-large protein families with up to 3.5 million sequences, learnMSA is both more accurate and faster than state-of-the-art tools. On the established benchmarks HomFam and BaliFam with smaller sequence sets, it matches state-of-the-art performance. All experiments were done on a standard workstation with a GPU. CONCLUSIONS: Our results show that learnMSA does not share the counterintuitive drawback of many popular heuristic aligners, which can substantially lose accuracy when many additional homologs are input. LearnMSA is a future-proof framework for large alignments with many opportunities for further improvements. Oxford University Press 2022-11-18 /pmc/articles/PMC9673500/ /pubmed/36399060 http://dx.doi.org/10.1093/gigascience/giac104 Text en © The Author(s) 2022. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Technical Note Becker, Felix Stanke, Mario learnMSA: learning and aligning large protein families
title	learnMSA: learning and aligning large protein families
title_full	learnMSA: learning and aligning large protein families
title_fullStr	learnMSA: learning and aligning large protein families
title_full_unstemmed	learnMSA: learning and aligning large protein families
title_short	learnMSA: learning and aligning large protein families
title_sort	learnmsa: learning and aligning large protein families
topic	Technical Note
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9673500/ https://www.ncbi.nlm.nih.gov/pubmed/36399060 http://dx.doi.org/10.1093/gigascience/giac104
work_keys_str_mv	AT beckerfelix learnmsalearningandaligninglargeproteinfamilies AT stankemario learnmsalearningandaligninglargeproteinfamilies

learnMSA: learning and aligning large protein families

Ejemplares similares