Cargando…

Clustering FunFams using sequence embeddings improves EC purity

MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be ‘pure’, i.e., contain only proteins with identical function. Functional Families (FunFams) cl...

Descripción completa

Detalles Bibliográficos
Autores principales: Littmann, Maria, Bordin, Nicola, Heinzinger, Michael, Schütze, Konstantin, Dallago, Christian, Orengo, Christine, Rost, Burkhard
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8545299/
https://www.ncbi.nlm.nih.gov/pubmed/33978744
http://dx.doi.org/10.1093/bioinformatics/btab371
_version_ 1784589986399518720
author Littmann, Maria
Bordin, Nicola
Heinzinger, Michael
Schütze, Konstantin
Dallago, Christian
Orengo, Christine
Rost, Burkhard
author_facet Littmann, Maria
Bordin, Nicola
Heinzinger, Michael
Schütze, Konstantin
Dallago, Christian
Orengo, Christine
Rost, Burkhard
author_sort Littmann, Maria
collection PubMed
description MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be ‘pure’, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations. RESULTS: We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. AVAILABILITY AND IMPLEMENTATION: Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-8545299
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-85452992021-10-26 Clustering FunFams using sequence embeddings improves EC purity Littmann, Maria Bordin, Nicola Heinzinger, Michael Schütze, Konstantin Dallago, Christian Orengo, Christine Rost, Burkhard Bioinformatics Original Papers MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be ‘pure’, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations. RESULTS: We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. AVAILABILITY AND IMPLEMENTATION: Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-05-12 /pmc/articles/PMC8545299/ /pubmed/33978744 http://dx.doi.org/10.1093/bioinformatics/btab371 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Littmann, Maria
Bordin, Nicola
Heinzinger, Michael
Schütze, Konstantin
Dallago, Christian
Orengo, Christine
Rost, Burkhard
Clustering FunFams using sequence embeddings improves EC purity
title Clustering FunFams using sequence embeddings improves EC purity
title_full Clustering FunFams using sequence embeddings improves EC purity
title_fullStr Clustering FunFams using sequence embeddings improves EC purity
title_full_unstemmed Clustering FunFams using sequence embeddings improves EC purity
title_short Clustering FunFams using sequence embeddings improves EC purity
title_sort clustering funfams using sequence embeddings improves ec purity
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8545299/
https://www.ncbi.nlm.nih.gov/pubmed/33978744
http://dx.doi.org/10.1093/bioinformatics/btab371
work_keys_str_mv AT littmannmaria clusteringfunfamsusingsequenceembeddingsimprovesecpurity
AT bordinnicola clusteringfunfamsusingsequenceembeddingsimprovesecpurity
AT heinzingermichael clusteringfunfamsusingsequenceembeddingsimprovesecpurity
AT schutzekonstantin clusteringfunfamsusingsequenceembeddingsimprovesecpurity
AT dallagochristian clusteringfunfamsusingsequenceembeddingsimprovesecpurity
AT orengochristine clusteringfunfamsusingsequenceembeddingsimprovesecpurity
AT rostburkhard clusteringfunfamsusingsequenceembeddingsimprovesecpurity