Cargando…

Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments

For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database...

Descripción completa

Detalles Bibliográficos
Autores principales: Neuwald, Andrew F, Lanczycki, Christopher J, Hodges, Theresa K, Marchler-Bauer, Aron
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7297217/
https://www.ncbi.nlm.nih.gov/pubmed/32500917
http://dx.doi.org/10.1093/database/baaa042
_version_ 1783546964191739904
author Neuwald, Andrew F
Lanczycki, Christopher J
Hodges, Theresa K
Marchler-Bauer, Aron
author_facet Neuwald, Andrew F
Lanczycki, Christopher J
Hodges, Theresa K
Marchler-Bauer, Aron
author_sort Neuwald, Andrew F
collection PubMed
description For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease–endonuclease–phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/.
format Online
Article
Text
id pubmed-7297217
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-72972172020-06-22 Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments Neuwald, Andrew F Lanczycki, Christopher J Hodges, Theresa K Marchler-Bauer, Aron Database (Oxford) Database Tool For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease–endonuclease–phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https://www.igs.umaryland.edu/labs/neuwald/software/mapgaps/. Oxford University Press 2020-06-08 /pmc/articles/PMC7297217/ /pubmed/32500917 http://dx.doi.org/10.1093/database/baaa042 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Database Tool
Neuwald, Andrew F
Lanczycki, Christopher J
Hodges, Theresa K
Marchler-Bauer, Aron
Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments
title Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments
title_full Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments
title_fullStr Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments
title_full_unstemmed Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments
title_short Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments
title_sort obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments
topic Database Tool
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7297217/
https://www.ncbi.nlm.nih.gov/pubmed/32500917
http://dx.doi.org/10.1093/database/baaa042
work_keys_str_mv AT neuwaldandrewf obtainingextremelylargeandaccurateproteinmultiplesequencealignmentsfromcuratedhierarchicalalignments
AT lanczyckichristopherj obtainingextremelylargeandaccurateproteinmultiplesequencealignmentsfromcuratedhierarchicalalignments
AT hodgestheresak obtainingextremelylargeandaccurateproteinmultiplesequencealignmentsfromcuratedhierarchicalalignments
AT marchlerbaueraron obtainingextremelylargeandaccurateproteinmultiplesequencealignmentsfromcuratedhierarchicalalignments