Cargando…

Rapid identification of novel protein families using similarity searches

Protein family databases are an important tool for biologists trying to dissect the function of proteins. Comparing potential new families to the thousands of existing entries is an important task when operating a protein family database. This comparison helps to understand whether a collection of p...

Descripción completa

Detalles Bibliográficos
Autores principales: Jeffryes, Matt, Bateman, Alex
Formato: Online Artículo Texto
Lenguaje:English
Publicado: F1000 Research Limited 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6439793/
https://www.ncbi.nlm.nih.gov/pubmed/30984369
http://dx.doi.org/10.12688/f1000research.17315.1
_version_ 1783407285343617024
author Jeffryes, Matt
Bateman, Alex
author_facet Jeffryes, Matt
Bateman, Alex
author_sort Jeffryes, Matt
collection PubMed
description Protein family databases are an important tool for biologists trying to dissect the function of proteins. Comparing potential new families to the thousands of existing entries is an important task when operating a protein family database. This comparison helps to understand whether a collection of protein regions forms a novel family or has overlaps with existing families of proteins. In this paper, we describe a method for performing this analysis with an adjustable level of accuracy, depending on the desired speed, enabling interactive comparisons. This method is based upon the MinHash algorithm, which we have further extended to calculate the Jaccard containment rather than the Jaccard index of the original MinHash technique. Testing this method with the Pfam protein family database, we are able to compare potential new families to the over 17,000 existing families in Pfam in less than a second, with little loss in accuracy.
format Online
Article
Text
id pubmed-6439793
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher F1000 Research Limited
record_format MEDLINE/PubMed
spelling pubmed-64397932019-04-12 Rapid identification of novel protein families using similarity searches Jeffryes, Matt Bateman, Alex F1000Res Method Article Protein family databases are an important tool for biologists trying to dissect the function of proteins. Comparing potential new families to the thousands of existing entries is an important task when operating a protein family database. This comparison helps to understand whether a collection of protein regions forms a novel family or has overlaps with existing families of proteins. In this paper, we describe a method for performing this analysis with an adjustable level of accuracy, depending on the desired speed, enabling interactive comparisons. This method is based upon the MinHash algorithm, which we have further extended to calculate the Jaccard containment rather than the Jaccard index of the original MinHash technique. Testing this method with the Pfam protein family database, we are able to compare potential new families to the over 17,000 existing families in Pfam in less than a second, with little loss in accuracy. F1000 Research Limited 2018-12-24 /pmc/articles/PMC6439793/ /pubmed/30984369 http://dx.doi.org/10.12688/f1000research.17315.1 Text en Copyright: © 2018 Jeffryes M and Bateman A http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Method Article
Jeffryes, Matt
Bateman, Alex
Rapid identification of novel protein families using similarity searches
title Rapid identification of novel protein families using similarity searches
title_full Rapid identification of novel protein families using similarity searches
title_fullStr Rapid identification of novel protein families using similarity searches
title_full_unstemmed Rapid identification of novel protein families using similarity searches
title_short Rapid identification of novel protein families using similarity searches
title_sort rapid identification of novel protein families using similarity searches
topic Method Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6439793/
https://www.ncbi.nlm.nih.gov/pubmed/30984369
http://dx.doi.org/10.12688/f1000research.17315.1
work_keys_str_mv AT jeffryesmatt rapididentificationofnovelproteinfamiliesusingsimilaritysearches
AT batemanalex rapididentificationofnovelproteinfamiliesusingsimilaritysearches