Cargando…

DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets

Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources ty...

Descripción completa

Detalles Bibliográficos
Autores principales: Russo, Elena Tea, Barone, Federico, Bateman, Alex, Cozzini, Stefano, Punta, Marco, Laio, Alessandro
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9621593/
https://www.ncbi.nlm.nih.gov/pubmed/36260616
http://dx.doi.org/10.1371/journal.pcbi.1010610
_version_ 1784821593419022336
author Russo, Elena Tea
Barone, Federico
Bateman, Alex
Cozzini, Stefano
Punta, Marco
Laio, Alessandro
author_facet Russo, Elena Tea
Barone, Federico
Bateman, Alex
Cozzini, Stefano
Punta, Marco
Laio, Alessandro
author_sort Russo, Elena Tea
collection PubMed
description Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.
format Online
Article
Text
id pubmed-9621593
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-96215932022-11-01 DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets Russo, Elena Tea Barone, Federico Bateman, Alex Cozzini, Stefano Punta, Marco Laio, Alessandro PLoS Comput Biol Research Article Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository. Public Library of Science 2022-10-19 /pmc/articles/PMC9621593/ /pubmed/36260616 http://dx.doi.org/10.1371/journal.pcbi.1010610 Text en © 2022 Russo et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Russo, Elena Tea
Barone, Federico
Bateman, Alex
Cozzini, Stefano
Punta, Marco
Laio, Alessandro
DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets
title DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets
title_full DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets
title_fullStr DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets
title_full_unstemmed DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets
title_short DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets
title_sort dpcfam: unsupervised protein family classification by density peak clustering of large sequence datasets
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9621593/
https://www.ncbi.nlm.nih.gov/pubmed/36260616
http://dx.doi.org/10.1371/journal.pcbi.1010610
work_keys_str_mv AT russoelenatea dpcfamunsupervisedproteinfamilyclassificationbydensitypeakclusteringoflargesequencedatasets
AT baronefederico dpcfamunsupervisedproteinfamilyclassificationbydensitypeakclusteringoflargesequencedatasets
AT batemanalex dpcfamunsupervisedproteinfamilyclassificationbydensitypeakclusteringoflargesequencedatasets
AT cozzinistefano dpcfamunsupervisedproteinfamilyclassificationbydensitypeakclusteringoflargesequencedatasets
AT puntamarco dpcfamunsupervisedproteinfamilyclassificationbydensitypeakclusteringoflargesequencedatasets
AT laioalessandro dpcfamunsupervisedproteinfamilyclassificationbydensitypeakclusteringoflargesequencedatasets