Cargando…

Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases

Family 7 glycoside hydrolases (GH7) are among the principal enzymes for cellulose degradation in nature and industrially. These enzymes are often bimodular, including a catalytic domain and carbohydrate-binding module (CBM) attached via a flexible linker, and exhibit an active site that binds cello-...

Descripción completa

Detalles Bibliográficos
Autores principales: Gado, Japheth E., Harrison, Brent E., Sandgren, Mats, Ståhlberg, Jerry, Beckham, Gregg T., Payne, Christina M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Society for Biochemistry and Molecular Biology 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8329511/
https://www.ncbi.nlm.nih.gov/pubmed/34216620
http://dx.doi.org/10.1016/j.jbc.2021.100931
_version_ 1783732518529269760
author Gado, Japheth E.
Harrison, Brent E.
Sandgren, Mats
Ståhlberg, Jerry
Beckham, Gregg T.
Payne, Christina M.
author_facet Gado, Japheth E.
Harrison, Brent E.
Sandgren, Mats
Ståhlberg, Jerry
Beckham, Gregg T.
Payne, Christina M.
author_sort Gado, Japheth E.
collection PubMed
description Family 7 glycoside hydrolases (GH7) are among the principal enzymes for cellulose degradation in nature and industrially. These enzymes are often bimodular, including a catalytic domain and carbohydrate-binding module (CBM) attached via a flexible linker, and exhibit an active site that binds cello-oligomers of up to ten glucosyl moieties. GH7 cellulases consist of two major subtypes: cellobiohydrolases (CBH) and endoglucanases (EG). Despite the critical importance of GH7 enzymes, there remain gaps in our understanding of how GH7 sequence and structure relate to function. Here, we employed machine learning to gain data-driven insights into relationships between sequence, structure, and function across the GH7 family. Machine-learning models, trained only on the number of residues in the active-site loops as features, were able to discriminate GH7 CBHs and EGs with up to 99% accuracy, demonstrating that the lengths of loops A4, B2, B3, and B4 strongly correlate with functional subtype across the GH7 family. Classification rules were derived such that specific residues at 42 different sequence positions each predicted the functional subtype with accuracies surpassing 87%. A random forest model trained on residues at 19 positions in the catalytic domain predicted the presence of a CBM with 89.5% accuracy. Our machine learning results recapitulate, as top-performing features, a substantial number of the sequence positions determined by previous experimental studies to play vital roles in GH7 activity. We surmise that the yet-to-be-explored sequence positions among the top-performing features also contribute to GH7 functional variation and may be exploited to understand and manipulate function.
format Online
Article
Text
id pubmed-8329511
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher American Society for Biochemistry and Molecular Biology
record_format MEDLINE/PubMed
spelling pubmed-83295112021-08-09 Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases Gado, Japheth E. Harrison, Brent E. Sandgren, Mats Ståhlberg, Jerry Beckham, Gregg T. Payne, Christina M. J Biol Chem Research Article Family 7 glycoside hydrolases (GH7) are among the principal enzymes for cellulose degradation in nature and industrially. These enzymes are often bimodular, including a catalytic domain and carbohydrate-binding module (CBM) attached via a flexible linker, and exhibit an active site that binds cello-oligomers of up to ten glucosyl moieties. GH7 cellulases consist of two major subtypes: cellobiohydrolases (CBH) and endoglucanases (EG). Despite the critical importance of GH7 enzymes, there remain gaps in our understanding of how GH7 sequence and structure relate to function. Here, we employed machine learning to gain data-driven insights into relationships between sequence, structure, and function across the GH7 family. Machine-learning models, trained only on the number of residues in the active-site loops as features, were able to discriminate GH7 CBHs and EGs with up to 99% accuracy, demonstrating that the lengths of loops A4, B2, B3, and B4 strongly correlate with functional subtype across the GH7 family. Classification rules were derived such that specific residues at 42 different sequence positions each predicted the functional subtype with accuracies surpassing 87%. A random forest model trained on residues at 19 positions in the catalytic domain predicted the presence of a CBM with 89.5% accuracy. Our machine learning results recapitulate, as top-performing features, a substantial number of the sequence positions determined by previous experimental studies to play vital roles in GH7 activity. We surmise that the yet-to-be-explored sequence positions among the top-performing features also contribute to GH7 functional variation and may be exploited to understand and manipulate function. American Society for Biochemistry and Molecular Biology 2021-07-01 /pmc/articles/PMC8329511/ /pubmed/34216620 http://dx.doi.org/10.1016/j.jbc.2021.100931 Text en © 2021 The Authors https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Research Article
Gado, Japheth E.
Harrison, Brent E.
Sandgren, Mats
Ståhlberg, Jerry
Beckham, Gregg T.
Payne, Christina M.
Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases
title Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases
title_full Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases
title_fullStr Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases
title_full_unstemmed Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases
title_short Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases
title_sort machine learning reveals sequence-function relationships in family 7 glycoside hydrolases
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8329511/
https://www.ncbi.nlm.nih.gov/pubmed/34216620
http://dx.doi.org/10.1016/j.jbc.2021.100931
work_keys_str_mv AT gadojaphethe machinelearningrevealssequencefunctionrelationshipsinfamily7glycosidehydrolases
AT harrisonbrente machinelearningrevealssequencefunctionrelationshipsinfamily7glycosidehydrolases
AT sandgrenmats machinelearningrevealssequencefunctionrelationshipsinfamily7glycosidehydrolases
AT stahlbergjerry machinelearningrevealssequencefunctionrelationshipsinfamily7glycosidehydrolases
AT beckhamgreggt machinelearningrevealssequencefunctionrelationshipsinfamily7glycosidehydrolases
AT paynechristinam machinelearningrevealssequencefunctionrelationshipsinfamily7glycosidehydrolases