Cargando…

Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing

BACKGROUND: Clustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple high-throughput sequencing technologies, new protein sequences are becoming available at an extra...

Descripción completa

Detalles Bibliográficos
Autores principales:	Abnousi, Armen, Broschat, Shira L., Kalyanaraman, Ananth
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5838936/ https://www.ncbi.nlm.nih.gov/pubmed/29506470 http://dx.doi.org/10.1186/s12859-018-2080-y

_version_	1783304335400108032
author	Abnousi, Armen Broschat, Shira L. Kalyanaraman, Ananth
author_facet	Abnousi, Armen Broschat, Shira L. Kalyanaraman, Ananth
author_sort	Abnousi, Armen
collection	PubMed
description	BACKGROUND: Clustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple high-throughput sequencing technologies, new protein sequences are becoming available at an extraordinary rate. The rapid growth rate has impeded deployment of existing protein clustering/annotation tools which depend largely on pairwise sequence alignment. RESULTS: In this paper, we propose an alignment-free clustering approach, coreClust, for annotating protein sequences using detected conserved regions. The proposed algorithm uses Min-Wise Independent Hashing for identifying similar conserved regions. Min-Wise Independent Hashing works by generating a (w,c)-sketch for each document and comparing these sketches. Our algorithm fits well within the MapReduce framework, permitting scalability. We show that coreClust generates results comparable to existing known methods. In particular, we show that the clusters generated by our algorithm capture the subfamilies of the Pfam domain families for which the sequences in a cluster have a similar domain architecture. We show that for a data set of 90,000 sequences (about 250,000 domain regions), the clusters generated by our algorithm give a 75% average weighted F1 score, our accuracy metric, when compared to the clusters generated by a semi-exhaustive pairwise alignment algorithm. CONCLUSIONS: The new clustering algorithm can be used to generate meaningful clusters of conserved regions. It is a scalable method that when paired with our prior work, NADDA for detecting conserved regions, provides a complete end-to-end pipeline for annotating protein sequences. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2080-y) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5838936
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-58389362018-03-09 Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing Abnousi, Armen Broschat, Shira L. Kalyanaraman, Ananth BMC Bioinformatics Methodology Article BACKGROUND: Clustering of protein sequences is of key importance in predicting the structure and function of newly sequenced proteins and is also of use for their annotation. With the advent of multiple high-throughput sequencing technologies, new protein sequences are becoming available at an extraordinary rate. The rapid growth rate has impeded deployment of existing protein clustering/annotation tools which depend largely on pairwise sequence alignment. RESULTS: In this paper, we propose an alignment-free clustering approach, coreClust, for annotating protein sequences using detected conserved regions. The proposed algorithm uses Min-Wise Independent Hashing for identifying similar conserved regions. Min-Wise Independent Hashing works by generating a (w,c)-sketch for each document and comparing these sketches. Our algorithm fits well within the MapReduce framework, permitting scalability. We show that coreClust generates results comparable to existing known methods. In particular, we show that the clusters generated by our algorithm capture the subfamilies of the Pfam domain families for which the sequences in a cluster have a similar domain architecture. We show that for a data set of 90,000 sequences (about 250,000 domain regions), the clusters generated by our algorithm give a 75% average weighted F1 score, our accuracy metric, when compared to the clusters generated by a semi-exhaustive pairwise alignment algorithm. CONCLUSIONS: The new clustering algorithm can be used to generate meaningful clusters of conserved regions. It is a scalable method that when paired with our prior work, NADDA for detecting conserved regions, provides a complete end-to-end pipeline for annotating protein sequences. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2080-y) contains supplementary material, which is available to authorized users. BioMed Central 2018-03-05 /pmc/articles/PMC5838936/ /pubmed/29506470 http://dx.doi.org/10.1186/s12859-018-2080-y Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Abnousi, Armen Broschat, Shira L. Kalyanaraman, Ananth Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing
title	Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing
title_full	Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing
title_fullStr	Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing
title_full_unstemmed	Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing
title_short	Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing
title_sort	alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5838936/ https://www.ncbi.nlm.nih.gov/pubmed/29506470 http://dx.doi.org/10.1186/s12859-018-2080-y
work_keys_str_mv	AT abnousiarmen alignmentfreeclusteringoflargedatasetsofunannotatedproteinconservedregionsusingminhashing AT broschatshiral alignmentfreeclusteringoflargedatasetsofunannotatedproteinconservedregionsusingminhashing AT kalyanaramanananth alignmentfreeclusteringoflargedatasetsofunannotatedproteinconservedregionsusingminhashing

Alignment-free clustering of large data sets of unannotated protein conserved regions using minhashing

Ejemplares similares