Cargando…

Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering

A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nguyen, Rachel, Sokhansanj, Bahrad A., Polikar, Robi, Rosen, Gail L.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	PeerJ Inc. 2023
Materias:	Bioinformatics
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9921987/ https://www.ncbi.nlm.nih.gov/pubmed/36785708 http://dx.doi.org/10.7717/peerj.14779

_version_	1784887444651376640
author	Nguyen, Rachel Sokhansanj, Bahrad A. Polikar, Robi Rosen, Gail L.
author_facet	Nguyen, Rachel Sokhansanj, Bahrad A. Polikar, Robi Rosen, Gail L.
author_sort	Nguyen, Rachel
collection	PubMed
description	A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative in grouping sequences with other sequences. Remote homologs may fail to be clustered together and instead form unnecessarily distinct clusters. The resulting clusters have high homogeneity but completeness that is too low. We propose Complet+, a computationally scalable post-processing method to increase the completeness of clusters without an undue cost in homogeneity. Complet+ proves to effectively merge closely-related clusters of protein that have verified structural relationships in the SCOPe classification scheme, improving the completeness of clustering results at little cost to homogeneity. Applying Complet+ to clusters obtained using MMseqs2’s clusterupdate achieves an increased V-measure of 0.09 and 0.05 at the SCOPe superfamily and family levels, respectively. Complet+ also creates more biologically representative clusters, as shown by a substantial increase in Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) metrics when comparing predicted clusters to biological classifications. Complet+ similarly improves clustering metrics when applied to other methods, such as CD-HIT and linclust. Finally, we show that Complet+ runtime scales linearly with respect to the number of clusters being post-processed on a COG dataset of over 3 million sequences. Code and supplementary information is available on Github: https://github.com/EESI/Complet-Plus.
format	Online Article Text
id	pubmed-9921987
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	PeerJ Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-99219872023-02-12 Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering Nguyen, Rachel Sokhansanj, Bahrad A. Polikar, Robi Rosen, Gail L. PeerJ Bioinformatics A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative in grouping sequences with other sequences. Remote homologs may fail to be clustered together and instead form unnecessarily distinct clusters. The resulting clusters have high homogeneity but completeness that is too low. We propose Complet+, a computationally scalable post-processing method to increase the completeness of clusters without an undue cost in homogeneity. Complet+ proves to effectively merge closely-related clusters of protein that have verified structural relationships in the SCOPe classification scheme, improving the completeness of clustering results at little cost to homogeneity. Applying Complet+ to clusters obtained using MMseqs2’s clusterupdate achieves an increased V-measure of 0.09 and 0.05 at the SCOPe superfamily and family levels, respectively. Complet+ also creates more biologically representative clusters, as shown by a substantial increase in Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) metrics when comparing predicted clusters to biological classifications. Complet+ similarly improves clustering metrics when applied to other methods, such as CD-HIT and linclust. Finally, we show that Complet+ runtime scales linearly with respect to the number of clusters being post-processed on a COG dataset of over 3 million sequences. Code and supplementary information is available on Github: https://github.com/EESI/Complet-Plus. PeerJ Inc. 2023-02-08 /pmc/articles/PMC9921987/ /pubmed/36785708 http://dx.doi.org/10.7717/peerj.14779 Text en ©2023 Nguyen et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle	Bioinformatics Nguyen, Rachel Sokhansanj, Bahrad A. Polikar, Robi Rosen, Gail L. Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title	Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_full	Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_fullStr	Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_full_unstemmed	Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_short	Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_sort	complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
topic	Bioinformatics
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9921987/ https://www.ncbi.nlm.nih.gov/pubmed/36785708 http://dx.doi.org/10.7717/peerj.14779
work_keys_str_mv	AT nguyenrachel completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering AT sokhansanjbahrada completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering AT polikarrobi completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering AT rosengaill completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering

Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering

Ejemplares similares