Cargando…
CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models
MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorit...
Autores principales: | , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9887088/ https://www.ncbi.nlm.nih.gov/pubmed/36648327 http://dx.doi.org/10.1093/bioinformatics/btad029 |
_version_ | 1784880261799870464 |
---|---|
author | Nallapareddy, Vamsi Bordin, Nicola Sillitoe, Ian Heinzinger, Michael Littmann, Maria Waman, Vaishali P Sen, Neeladri Rost, Burkhard Orengo, Christine |
author_facet | Nallapareddy, Vamsi Bordin, Nicola Sillitoe, Ian Heinzinger, Michael Littmann, Maria Waman, Vaishali P Sen, Neeladri Rost, Burkhard Orengo, Christine |
author_sort | Nallapareddy, Vamsi |
collection | PubMed |
description | MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein language models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. RESULTS: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6 ± 0.4% and 98.2 ± 0.3%, respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold2 structures with structures from the CATH superfamilies to which they were assigned. AVAILABILITY AND IMPLEMENTATION: The code for the developed models is available on https://github.com/vam-sin/CATHe, and the datasets developed in this study can be accessed on https://zenodo.org/record/6327572. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-9887088 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-98870882023-01-31 CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models Nallapareddy, Vamsi Bordin, Nicola Sillitoe, Ian Heinzinger, Michael Littmann, Maria Waman, Vaishali P Sen, Neeladri Rost, Burkhard Orengo, Christine Bioinformatics Original Paper MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein language models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. RESULTS: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6 ± 0.4% and 98.2 ± 0.3%, respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold2 structures with structures from the CATH superfamilies to which they were assigned. AVAILABILITY AND IMPLEMENTATION: The code for the developed models is available on https://github.com/vam-sin/CATHe, and the datasets developed in this study can be accessed on https://zenodo.org/record/6327572. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2023-01-17 /pmc/articles/PMC9887088/ /pubmed/36648327 http://dx.doi.org/10.1093/bioinformatics/btad029 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Paper Nallapareddy, Vamsi Bordin, Nicola Sillitoe, Ian Heinzinger, Michael Littmann, Maria Waman, Vaishali P Sen, Neeladri Rost, Burkhard Orengo, Christine CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models |
title | CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models |
title_full | CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models |
title_fullStr | CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models |
title_full_unstemmed | CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models |
title_short | CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models |
title_sort | cathe: detection of remote homologues for cath superfamilies using embeddings from protein language models |
topic | Original Paper |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9887088/ https://www.ncbi.nlm.nih.gov/pubmed/36648327 http://dx.doi.org/10.1093/bioinformatics/btad029 |
work_keys_str_mv | AT nallapareddyvamsi cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels AT bordinnicola cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels AT sillitoeian cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels AT heinzingermichael cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels AT littmannmaria cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels AT wamanvaishalip cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels AT senneeladri cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels AT rostburkhard cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels AT orengochristine cathedetectionofremotehomologuesforcathsuperfamiliesusingembeddingsfromproteinlanguagemodels |