Cargando…

Performance of Hidden Markov Models in Recovering the Standard Classification of Glycoside Hydrolases

Glycoside hydrolases (GHs) are carbohydrate-active enzymes that assist the hydrolysis of glycoside bonds of complex sugars into carbohydrates. The current standard GH family classification is available in the CAZy database, which is based on the similarities of amino acid sequences and curated semi-...

Descripción completa

Detalles Bibliográficos
Autores principales: Rossi, Mariana Fonseca, Mello, Beatriz, Schrago, Carlos G
Formato: Online Artículo Texto
Lenguaje:English
Publicado: SAGE Publications 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5404901/
https://www.ncbi.nlm.nih.gov/pubmed/28469382
http://dx.doi.org/10.1177/1176934317703401
_version_ 1783231671362912256
author Rossi, Mariana Fonseca
Mello, Beatriz
Schrago, Carlos G
author_facet Rossi, Mariana Fonseca
Mello, Beatriz
Schrago, Carlos G
author_sort Rossi, Mariana Fonseca
collection PubMed
description Glycoside hydrolases (GHs) are carbohydrate-active enzymes that assist the hydrolysis of glycoside bonds of complex sugars into carbohydrates. The current standard GH family classification is available in the CAZy database, which is based on the similarities of amino acid sequences and curated semi-automatically. However, with the exponential increase in data availability from genome sequences, automated classification methods are required for the fast annotation of coding sequences. Currently, the dbCAN database offers automatic annotations of signature domains from CAZy-defined classifications using a statistical approach, the hidden Markov models (HMMs). However, dbCAN does not contain the entire set of CAZy GH families. Moreover, no evaluation has been conducted so far of the viability of using HMM profiles as a means of automatically assigning GH amino acid sequences to the standard CAZy GH family classification itself. In this work, we performed a meta-analysis in which amino acid sequences from CAZy-defined GH families were used to build HMM family-specific profiles. We then queried a set with ~300 000 GH sequences against our database of HMM profiles estimated from CAZy families. We conducted the same evaluation against the available dbCAN HMM profiles. Our analyses recovered 65% of matches with the standard CAZy classification, whereas dbCAN HMMs resulted in 61% of matches. We also provided an analysis of the types of errors commonly found when HMMs are used to recover CAZy-based classifications. Although the performance of HMM was good, further developments are necessary for a fully automated classification of GH, allowing the standardization of GH classification among protein databases.
format Online
Article
Text
id pubmed-5404901
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher SAGE Publications
record_format MEDLINE/PubMed
spelling pubmed-54049012017-05-03 Performance of Hidden Markov Models in Recovering the Standard Classification of Glycoside Hydrolases Rossi, Mariana Fonseca Mello, Beatriz Schrago, Carlos G Evol Bioinform Online Original Research Glycoside hydrolases (GHs) are carbohydrate-active enzymes that assist the hydrolysis of glycoside bonds of complex sugars into carbohydrates. The current standard GH family classification is available in the CAZy database, which is based on the similarities of amino acid sequences and curated semi-automatically. However, with the exponential increase in data availability from genome sequences, automated classification methods are required for the fast annotation of coding sequences. Currently, the dbCAN database offers automatic annotations of signature domains from CAZy-defined classifications using a statistical approach, the hidden Markov models (HMMs). However, dbCAN does not contain the entire set of CAZy GH families. Moreover, no evaluation has been conducted so far of the viability of using HMM profiles as a means of automatically assigning GH amino acid sequences to the standard CAZy GH family classification itself. In this work, we performed a meta-analysis in which amino acid sequences from CAZy-defined GH families were used to build HMM family-specific profiles. We then queried a set with ~300 000 GH sequences against our database of HMM profiles estimated from CAZy families. We conducted the same evaluation against the available dbCAN HMM profiles. Our analyses recovered 65% of matches with the standard CAZy classification, whereas dbCAN HMMs resulted in 61% of matches. We also provided an analysis of the types of errors commonly found when HMMs are used to recover CAZy-based classifications. Although the performance of HMM was good, further developments are necessary for a fully automated classification of GH, allowing the standardization of GH classification among protein databases. SAGE Publications 2017-04-20 /pmc/articles/PMC5404901/ /pubmed/28469382 http://dx.doi.org/10.1177/1176934317703401 Text en © The Author(s) 2017 http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (http://www.creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page(https://us.sagepub.com/en-us/nam/open-access-at-sage).
spellingShingle Original Research
Rossi, Mariana Fonseca
Mello, Beatriz
Schrago, Carlos G
Performance of Hidden Markov Models in Recovering the Standard Classification of Glycoside Hydrolases
title Performance of Hidden Markov Models in Recovering the Standard Classification of Glycoside Hydrolases
title_full Performance of Hidden Markov Models in Recovering the Standard Classification of Glycoside Hydrolases
title_fullStr Performance of Hidden Markov Models in Recovering the Standard Classification of Glycoside Hydrolases
title_full_unstemmed Performance of Hidden Markov Models in Recovering the Standard Classification of Glycoside Hydrolases
title_short Performance of Hidden Markov Models in Recovering the Standard Classification of Glycoside Hydrolases
title_sort performance of hidden markov models in recovering the standard classification of glycoside hydrolases
topic Original Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5404901/
https://www.ncbi.nlm.nih.gov/pubmed/28469382
http://dx.doi.org/10.1177/1176934317703401
work_keys_str_mv AT rossimarianafonseca performanceofhiddenmarkovmodelsinrecoveringthestandardclassificationofglycosidehydrolases
AT mellobeatriz performanceofhiddenmarkovmodelsinrecoveringthestandardclassificationofglycosidehydrolases
AT schragocarlosg performanceofhiddenmarkovmodelsinrecoveringthestandardclassificationofglycosidehydrolases