Cargando…

CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models

MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorit...

Descripción completa

Detalles Bibliográficos
Autores principales: Nallapareddy, Vamsi, Bordin, Nicola, Sillitoe, Ian, Heinzinger, Michael, Littmann, Maria, Waman, Vaishali P, Sen, Neeladri, Rost, Burkhard, Orengo, Christine
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9887088/
https://www.ncbi.nlm.nih.gov/pubmed/36648327
http://dx.doi.org/10.1093/bioinformatics/btad029
_version_ 1784880261799870464
author Nallapareddy, Vamsi
Bordin, Nicola
Sillitoe, Ian
Heinzinger, Michael
Littmann, Maria
Waman, Vaishali P
Sen, Neeladri
Rost, Burkhard
Orengo, Christine
author_facet Nallapareddy, Vamsi
Bordin, Nicola
Sillitoe, Ian
Heinzinger, Michael
Littmann, Maria
Waman, Vaishali P
Sen, Neeladri
Rost, Burkhard
Orengo, Christine
author_sort Nallapareddy, Vamsi
collection PubMed
description MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein language models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. RESULTS: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6 ± 0.4% and 98.2 ± 0.3%, respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold2 structures with structures from the CATH superfamilies to which they were assigned. AVAILABILITY AND IMPLEMENTATION: The code for the developed models is available on https://github.com/vam-sin/CATHe, and the datasets developed in this study can be accessed on https://zenodo.org/record/6327572. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-9887088
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-98870882023-01-31 CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models Nallapareddy, Vamsi Bordin, Nicola Sillitoe, Ian Heinzinger, Michael Littmann, Maria Waman, Vaishali P Sen, Neeladri Rost, Burkhard Orengo, Christine Bioinformatics Original Paper MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein language models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. RESULTS: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6 ± 0.4% and 98.2 ± 0.3%, respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold2 structures with structures from the CATH superfamilies to which they were assigned. AVAILABILITY AND IMPLEMENTATION: The code for the developed models is available on https://github.com/vam-sin/CATHe, and the datasets developed in this study can be accessed on https://zenodo.org/record/6327572. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2023-01-17 /pmc/articles/PMC9887088/ /pubmed/36648327 http://dx.doi.org/10.1093/bioinformatics/btad029 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Nallapareddy, Vamsi
Bordin, Nicola
Sillitoe, Ian
Heinzinger, Michael
Littmann, Maria
Waman, Vaishali P
Sen, Neeladri
Rost, Burkhard
Orengo, Christine
CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models
title CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models
title_full CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models
title_fullStr CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models
title_full_unstemmed CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models
title_short CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models
title_sort cathe: detection of remote homologues for cath superfamilies using embeddings from protein language models
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9887088/
https://www.ncbi.nlm.nih.gov/pubmed/36648327
http://dx.doi.org/10.1093/bioinformatics/btad029
work_keys_str_mv AT nallapareddyvamsi cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels
AT bordinnicola cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels
AT sillitoeian cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels
AT heinzingermichael cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels
AT littmannmaria cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels
AT wamanvaishalip cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels
AT senneeladri cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels
AT rostburkhard cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels
AT orengochristine cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels