Cargando…

LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins

MOTIVATION: To facilitate accurate estimation of statistical significance of sequence similarity in profile–profile searches, queries should ideally correspond to protein domains. For multidomain proteins, using domains as queries depends on delineation of domain borders, which may be unknown. Thus,...

Descripción completa

Detalles Bibliográficos
Autores principales: Gulyaeva, Anastasia A, Sigorskih, Andrey I, Ocheredko, Elena S, Samborskiy, Dmitry V, Gorbalenya, Alexander E
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7203729/
https://www.ncbi.nlm.nih.gov/pubmed/32003788
http://dx.doi.org/10.1093/bioinformatics/btaa065
_version_ 1783529922928574464
author Gulyaeva, Anastasia A
Sigorskih, Andrey I
Ocheredko, Elena S
Samborskiy, Dmitry V
Gorbalenya, Alexander E
author_facet Gulyaeva, Anastasia A
Sigorskih, Andrey I
Ocheredko, Elena S
Samborskiy, Dmitry V
Gorbalenya, Alexander E
author_sort Gulyaeva, Anastasia A
collection PubMed
description MOTIVATION: To facilitate accurate estimation of statistical significance of sequence similarity in profile–profile searches, queries should ideally correspond to protein domains. For multidomain proteins, using domains as queries depends on delineation of domain borders, which may be unknown. Thus, proteins are commonly used as queries that complicate establishing homology for similarities close to cutoff levels of statistical significance. RESULTS: In this article, we describe an iterative approach, called LAMPA, LArge Multidomain Protein Annotator, that resolves the above conundrum by gradual expansion of hit coverage of multidomain proteins through re-evaluating statistical significance of hit similarity using ever smaller queries defined at each iteration. LAMPA employs TMHMM and HHsearch for recognition of transmembrane regions and homology, respectively. We used Pfam database for annotating 2985 multidomain proteins (polyproteins) composed of >1000 amino acid residues, which dominate proteomes of RNA viruses. Under strict cutoffs, LAMPA outperformed HHsearch-mediated runs using intact polyproteins as queries by three measures: number of and coverage by identified homologous regions, and number of hit Pfam profiles. Compared to HHsearch, LAMPA identified 507 extra homologous regions in 14.4% of polyproteins. This Pfam-based annotation of RNA virus polyproteins by LAMPA was also superior to RefSeq expert annotation by two measures, region number and annotated length, for 69.3% of RNA virus polyprotein entries. We rationalized the obtained results based on dependencies of HHsearch hit statistical significance for local alignment similarity score from lengths and diversities of query-target pairs in computational experiments. AVAILABILITY AND IMPLEMENTATION: LAMPA 1.0.0 R package is placed at github (https://github.com/Gorbalenya-Lab/LAMPA). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-7203729
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-72037292020-05-11 LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins Gulyaeva, Anastasia A Sigorskih, Andrey I Ocheredko, Elena S Samborskiy, Dmitry V Gorbalenya, Alexander E Bioinformatics Original Papers MOTIVATION: To facilitate accurate estimation of statistical significance of sequence similarity in profile–profile searches, queries should ideally correspond to protein domains. For multidomain proteins, using domains as queries depends on delineation of domain borders, which may be unknown. Thus, proteins are commonly used as queries that complicate establishing homology for similarities close to cutoff levels of statistical significance. RESULTS: In this article, we describe an iterative approach, called LAMPA, LArge Multidomain Protein Annotator, that resolves the above conundrum by gradual expansion of hit coverage of multidomain proteins through re-evaluating statistical significance of hit similarity using ever smaller queries defined at each iteration. LAMPA employs TMHMM and HHsearch for recognition of transmembrane regions and homology, respectively. We used Pfam database for annotating 2985 multidomain proteins (polyproteins) composed of >1000 amino acid residues, which dominate proteomes of RNA viruses. Under strict cutoffs, LAMPA outperformed HHsearch-mediated runs using intact polyproteins as queries by three measures: number of and coverage by identified homologous regions, and number of hit Pfam profiles. Compared to HHsearch, LAMPA identified 507 extra homologous regions in 14.4% of polyproteins. This Pfam-based annotation of RNA virus polyproteins by LAMPA was also superior to RefSeq expert annotation by two measures, region number and annotated length, for 69.3% of RNA virus polyprotein entries. We rationalized the obtained results based on dependencies of HHsearch hit statistical significance for local alignment similarity score from lengths and diversities of query-target pairs in computational experiments. AVAILABILITY AND IMPLEMENTATION: LAMPA 1.0.0 R package is placed at github (https://github.com/Gorbalenya-Lab/LAMPA). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-05-01 2020-01-31 /pmc/articles/PMC7203729/ /pubmed/32003788 http://dx.doi.org/10.1093/bioinformatics/btaa065 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Gulyaeva, Anastasia A
Sigorskih, Andrey I
Ocheredko, Elena S
Samborskiy, Dmitry V
Gorbalenya, Alexander E
LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins
title LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins
title_full LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins
title_fullStr LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins
title_full_unstemmed LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins
title_short LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins
title_sort lampa, large multidomain protein annotator, and its application to rna virus polyproteins
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7203729/
https://www.ncbi.nlm.nih.gov/pubmed/32003788
http://dx.doi.org/10.1093/bioinformatics/btaa065
work_keys_str_mv AT gulyaevaanastasiaa lampalargemultidomainproteinannotatoranditsapplicationtornaviruspolyproteins
AT sigorskihandreyi lampalargemultidomainproteinannotatoranditsapplicationtornaviruspolyproteins
AT ocheredkoelenas lampalargemultidomainproteinannotatoranditsapplicationtornaviruspolyproteins
AT samborskiydmitryv lampalargemultidomainproteinannotatoranditsapplicationtornaviruspolyproteins
AT gorbalenyaalexandere lampalargemultidomainproteinannotatoranditsapplicationtornaviruspolyproteins