Cargando…

Training data composition affects performance of protein structure analysis algorithms

The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug developmen...

Descripción completa

Detalles Bibliográficos
Autores principales:	Derry, Alexander, Carpenter, Kristy A., Altman, Russ B.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2022
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8669736/ https://www.ncbi.nlm.nih.gov/pubmed/34890132

_version_	1784614838688808960
author	Derry, Alexander Carpenter, Kristy A. Altman, Russ B.
author_facet	Derry, Alexander Carpenter, Kristy A. Altman, Russ B.
author_sort	Derry, Alexander
collection	PubMed
description	The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug development. The accuracy of such models in deployment is directly influenced by training data quality. The use of different experimental methods for protein structure determination may introduce bias into the training data. In this work, we evaluate the magnitude of this effect across three distinct tasks: estimation of model accuracy, protein sequence design, and catalytic residue prediction. Most protein structures are derived from X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy (cryo-EM); we trained each model on datasets consisting of either all three structure types or of only X-ray data. We find that across these tasks, models consistently perform worse on test sets derived from NMR and cryo-EM than they do on test sets of structures derived from X-ray crystallography, but that the difference can be mitigated when NMR and cryo-EM structures are included in the training set. Importantly, we show that including all three types of structures in the training set does not degrade test performance on X-ray structures, and in some cases even increases it. Finally, we examine the relationship between model performance and the biophysical properties of each method, and recommend that the biochemistry of the task of interest should be considered when composing training sets.
format	Online Article Text
id	pubmed-8669736
institution	National Center for Biotechnology Information
language	English
publishDate	2022
record_format	MEDLINE/PubMed
spelling	pubmed-86697362022-01-01 Training data composition affects performance of protein structure analysis algorithms Derry, Alexander Carpenter, Kristy A. Altman, Russ B. Pac Symp Biocomput Article The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug development. The accuracy of such models in deployment is directly influenced by training data quality. The use of different experimental methods for protein structure determination may introduce bias into the training data. In this work, we evaluate the magnitude of this effect across three distinct tasks: estimation of model accuracy, protein sequence design, and catalytic residue prediction. Most protein structures are derived from X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy (cryo-EM); we trained each model on datasets consisting of either all three structure types or of only X-ray data. We find that across these tasks, models consistently perform worse on test sets derived from NMR and cryo-EM than they do on test sets of structures derived from X-ray crystallography, but that the difference can be mitigated when NMR and cryo-EM structures are included in the training set. Importantly, we show that including all three types of structures in the training set does not degrade test performance on X-ray structures, and in some cases even increases it. Finally, we examine the relationship between model performance and the biophysical properties of each method, and recommend that the biochemistry of the task of interest should be considered when composing training sets. 2022 /pmc/articles/PMC8669736/ /pubmed/34890132 Text en https://creativecommons.org/licenses/by-nc/4.0/Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License.
spellingShingle	Article Derry, Alexander Carpenter, Kristy A. Altman, Russ B. Training data composition affects performance of protein structure analysis algorithms
title	Training data composition affects performance of protein structure analysis algorithms
title_full	Training data composition affects performance of protein structure analysis algorithms
title_fullStr	Training data composition affects performance of protein structure analysis algorithms
title_full_unstemmed	Training data composition affects performance of protein structure analysis algorithms
title_short	Training data composition affects performance of protein structure analysis algorithms
title_sort	training data composition affects performance of protein structure analysis algorithms
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8669736/ https://www.ncbi.nlm.nih.gov/pubmed/34890132
work_keys_str_mv	AT derryalexander trainingdatacompositionaffectsperformanceofproteinstructureanalysisalgorithms AT carpenterkristya trainingdatacompositionaffectsperformanceofproteinstructureanalysisalgorithms AT altmanrussb trainingdatacompositionaffectsperformanceofproteinstructureanalysisalgorithms

Training data composition affects performance of protein structure analysis algorithms

Ejemplares similares