Cargando…

Training data composition affects performance of protein structure analysis algorithms

The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug developmen...

Descripción completa

Detalles Bibliográficos
Autores principales: Derry, Alexander, Carpenter, Kristy A., Altman, Russ B.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8669736/
https://www.ncbi.nlm.nih.gov/pubmed/34890132
_version_ 1784614838688808960
author Derry, Alexander
Carpenter, Kristy A.
Altman, Russ B.
author_facet Derry, Alexander
Carpenter, Kristy A.
Altman, Russ B.
author_sort Derry, Alexander
collection PubMed
description The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug development. The accuracy of such models in deployment is directly influenced by training data quality. The use of different experimental methods for protein structure determination may introduce bias into the training data. In this work, we evaluate the magnitude of this effect across three distinct tasks: estimation of model accuracy, protein sequence design, and catalytic residue prediction. Most protein structures are derived from X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy (cryo-EM); we trained each model on datasets consisting of either all three structure types or of only X-ray data. We find that across these tasks, models consistently perform worse on test sets derived from NMR and cryo-EM than they do on test sets of structures derived from X-ray crystallography, but that the difference can be mitigated when NMR and cryo-EM structures are included in the training set. Importantly, we show that including all three types of structures in the training set does not degrade test performance on X-ray structures, and in some cases even increases it. Finally, we examine the relationship between model performance and the biophysical properties of each method, and recommend that the biochemistry of the task of interest should be considered when composing training sets.
format Online
Article
Text
id pubmed-8669736
institution National Center for Biotechnology Information
language English
publishDate 2022
record_format MEDLINE/PubMed
spelling pubmed-86697362022-01-01 Training data composition affects performance of protein structure analysis algorithms Derry, Alexander Carpenter, Kristy A. Altman, Russ B. Pac Symp Biocomput Article The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug development. The accuracy of such models in deployment is directly influenced by training data quality. The use of different experimental methods for protein structure determination may introduce bias into the training data. In this work, we evaluate the magnitude of this effect across three distinct tasks: estimation of model accuracy, protein sequence design, and catalytic residue prediction. Most protein structures are derived from X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy (cryo-EM); we trained each model on datasets consisting of either all three structure types or of only X-ray data. We find that across these tasks, models consistently perform worse on test sets derived from NMR and cryo-EM than they do on test sets of structures derived from X-ray crystallography, but that the difference can be mitigated when NMR and cryo-EM structures are included in the training set. Importantly, we show that including all three types of structures in the training set does not degrade test performance on X-ray structures, and in some cases even increases it. Finally, we examine the relationship between model performance and the biophysical properties of each method, and recommend that the biochemistry of the task of interest should be considered when composing training sets. 2022 /pmc/articles/PMC8669736/ /pubmed/34890132 Text en https://creativecommons.org/licenses/by-nc/4.0/Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License.
spellingShingle Article
Derry, Alexander
Carpenter, Kristy A.
Altman, Russ B.
Training data composition affects performance of protein structure analysis algorithms
title Training data composition affects performance of protein structure analysis algorithms
title_full Training data composition affects performance of protein structure analysis algorithms
title_fullStr Training data composition affects performance of protein structure analysis algorithms
title_full_unstemmed Training data composition affects performance of protein structure analysis algorithms
title_short Training data composition affects performance of protein structure analysis algorithms
title_sort training data composition affects performance of protein structure analysis algorithms
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8669736/
https://www.ncbi.nlm.nih.gov/pubmed/34890132
work_keys_str_mv AT derryalexander trainingdatacompositionaffectsperformanceofproteinstructureanalysisalgorithms
AT carpenterkristya trainingdatacompositionaffectsperformanceofproteinstructureanalysisalgorithms
AT altmanrussb trainingdatacompositionaffectsperformanceofproteinstructureanalysisalgorithms