Cargando…
Clustering FunFams using sequence embeddings improves EC purity
MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be ‘pure’, i.e., contain only proteins with identical function. Functional Families (FunFams) cl...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8545299/ https://www.ncbi.nlm.nih.gov/pubmed/33978744 http://dx.doi.org/10.1093/bioinformatics/btab371 |
_version_ | 1784589986399518720 |
---|---|
author | Littmann, Maria Bordin, Nicola Heinzinger, Michael Schütze, Konstantin Dallago, Christian Orengo, Christine Rost, Burkhard |
author_facet | Littmann, Maria Bordin, Nicola Heinzinger, Michael Schütze, Konstantin Dallago, Christian Orengo, Christine Rost, Burkhard |
author_sort | Littmann, Maria |
collection | PubMed |
description | MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be ‘pure’, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations. RESULTS: We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. AVAILABILITY AND IMPLEMENTATION: Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-8545299 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-85452992021-10-26 Clustering FunFams using sequence embeddings improves EC purity Littmann, Maria Bordin, Nicola Heinzinger, Michael Schütze, Konstantin Dallago, Christian Orengo, Christine Rost, Burkhard Bioinformatics Original Papers MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be ‘pure’, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations. RESULTS: We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. AVAILABILITY AND IMPLEMENTATION: Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-05-12 /pmc/articles/PMC8545299/ /pubmed/33978744 http://dx.doi.org/10.1093/bioinformatics/btab371 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Papers Littmann, Maria Bordin, Nicola Heinzinger, Michael Schütze, Konstantin Dallago, Christian Orengo, Christine Rost, Burkhard Clustering FunFams using sequence embeddings improves EC purity |
title | Clustering FunFams using sequence embeddings improves EC purity |
title_full | Clustering FunFams using sequence embeddings improves EC purity |
title_fullStr | Clustering FunFams using sequence embeddings improves EC purity |
title_full_unstemmed | Clustering FunFams using sequence embeddings improves EC purity |
title_short | Clustering FunFams using sequence embeddings improves EC purity |
title_sort | clustering funfams using sequence embeddings improves ec purity |
topic | Original Papers |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8545299/ https://www.ncbi.nlm.nih.gov/pubmed/33978744 http://dx.doi.org/10.1093/bioinformatics/btab371 |
work_keys_str_mv | AT littmannmaria clusteringfunfamsusingsequenceembeddingsimprovesecpurity AT bordinnicola clusteringfunfamsusingsequenceembeddingsimprovesecpurity AT heinzingermichael clusteringfunfamsusingsequenceembeddingsimprovesecpurity AT schutzekonstantin clusteringfunfamsusingsequenceembeddingsimprovesecpurity AT dallagochristian clusteringfunfamsusingsequenceembeddingsimprovesecpurity AT orengochristine clusteringfunfamsusingsequenceembeddingsimprovesecpurity AT rostburkhard clusteringfunfamsusingsequenceembeddingsimprovesecpurity |