Cargando…

Clustering FunFams using sequence embeddings improves EC purity

MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be ‘pure’, i.e., contain only proteins with identical function. Functional Families (FunFams) cl...

Descripción completa

Detalles Bibliográficos
Autores principales:	Littmann, Maria, Bordin, Nicola, Heinzinger, Michael, Schütze, Konstantin, Dallago, Christian, Orengo, Christine, Rost, Burkhard
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2021
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8545299/ https://www.ncbi.nlm.nih.gov/pubmed/33978744 http://dx.doi.org/10.1093/bioinformatics/btab371

_version_	1784589986399518720
author	Littmann, Maria Bordin, Nicola Heinzinger, Michael Schütze, Konstantin Dallago, Christian Orengo, Christine Rost, Burkhard
author_facet	Littmann, Maria Bordin, Nicola Heinzinger, Michael Schütze, Konstantin Dallago, Christian Orengo, Christine Rost, Burkhard
author_sort	Littmann, Maria
collection	PubMed
description	MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be ‘pure’, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations. RESULTS: We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. AVAILABILITY AND IMPLEMENTATION: Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-8545299
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-85452992021-10-26 Clustering FunFams using sequence embeddings improves EC purity Littmann, Maria Bordin, Nicola Heinzinger, Michael Schütze, Konstantin Dallago, Christian Orengo, Christine Rost, Burkhard Bioinformatics Original Papers MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be ‘pure’, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations. RESULTS: We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. AVAILABILITY AND IMPLEMENTATION: Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-05-12 /pmc/articles/PMC8545299/ /pubmed/33978744 http://dx.doi.org/10.1093/bioinformatics/btab371 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Littmann, Maria Bordin, Nicola Heinzinger, Michael Schütze, Konstantin Dallago, Christian Orengo, Christine Rost, Burkhard Clustering FunFams using sequence embeddings improves EC purity
title	Clustering FunFams using sequence embeddings improves EC purity
title_full	Clustering FunFams using sequence embeddings improves EC purity
title_fullStr	Clustering FunFams using sequence embeddings improves EC purity
title_full_unstemmed	Clustering FunFams using sequence embeddings improves EC purity
title_short	Clustering FunFams using sequence embeddings improves EC purity
title_sort	clustering funfams using sequence embeddings improves ec purity
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8545299/ https://www.ncbi.nlm.nih.gov/pubmed/33978744 http://dx.doi.org/10.1093/bioinformatics/btab371
work_keys_str_mv	AT littmannmaria clusteringfunfamsusingsequenceembeddingsimprovesecpurity AT bordinnicola clusteringfunfamsusingsequenceembeddingsimprovesecpurity AT heinzingermichael clusteringfunfamsusingsequenceembeddingsimprovesecpurity AT schutzekonstantin clusteringfunfamsusingsequenceembeddingsimprovesecpurity AT dallagochristian clusteringfunfamsusingsequenceembeddingsimprovesecpurity AT orengochristine clusteringfunfamsusingsequenceembeddingsimprovesecpurity AT rostburkhard clusteringfunfamsusingsequenceembeddingsimprovesecpurity

Clustering FunFams using sequence embeddings improves EC purity

Ejemplares similares