Cargando…
Training data composition affects performance of protein structure analysis algorithms
The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug developmen...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8669736/ https://www.ncbi.nlm.nih.gov/pubmed/34890132 |
_version_ | 1784614838688808960 |
---|---|
author | Derry, Alexander Carpenter, Kristy A. Altman, Russ B. |
author_facet | Derry, Alexander Carpenter, Kristy A. Altman, Russ B. |
author_sort | Derry, Alexander |
collection | PubMed |
description | The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug development. The accuracy of such models in deployment is directly influenced by training data quality. The use of different experimental methods for protein structure determination may introduce bias into the training data. In this work, we evaluate the magnitude of this effect across three distinct tasks: estimation of model accuracy, protein sequence design, and catalytic residue prediction. Most protein structures are derived from X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy (cryo-EM); we trained each model on datasets consisting of either all three structure types or of only X-ray data. We find that across these tasks, models consistently perform worse on test sets derived from NMR and cryo-EM than they do on test sets of structures derived from X-ray crystallography, but that the difference can be mitigated when NMR and cryo-EM structures are included in the training set. Importantly, we show that including all three types of structures in the training set does not degrade test performance on X-ray structures, and in some cases even increases it. Finally, we examine the relationship between model performance and the biophysical properties of each method, and recommend that the biochemistry of the task of interest should be considered when composing training sets. |
format | Online Article Text |
id | pubmed-8669736 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
record_format | MEDLINE/PubMed |
spelling | pubmed-86697362022-01-01 Training data composition affects performance of protein structure analysis algorithms Derry, Alexander Carpenter, Kristy A. Altman, Russ B. Pac Symp Biocomput Article The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug development. The accuracy of such models in deployment is directly influenced by training data quality. The use of different experimental methods for protein structure determination may introduce bias into the training data. In this work, we evaluate the magnitude of this effect across three distinct tasks: estimation of model accuracy, protein sequence design, and catalytic residue prediction. Most protein structures are derived from X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy (cryo-EM); we trained each model on datasets consisting of either all three structure types or of only X-ray data. We find that across these tasks, models consistently perform worse on test sets derived from NMR and cryo-EM than they do on test sets of structures derived from X-ray crystallography, but that the difference can be mitigated when NMR and cryo-EM structures are included in the training set. Importantly, we show that including all three types of structures in the training set does not degrade test performance on X-ray structures, and in some cases even increases it. Finally, we examine the relationship between model performance and the biophysical properties of each method, and recommend that the biochemistry of the task of interest should be considered when composing training sets. 2022 /pmc/articles/PMC8669736/ /pubmed/34890132 Text en https://creativecommons.org/licenses/by-nc/4.0/Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License. |
spellingShingle | Article Derry, Alexander Carpenter, Kristy A. Altman, Russ B. Training data composition affects performance of protein structure analysis algorithms |
title | Training data composition affects performance of protein structure analysis algorithms |
title_full | Training data composition affects performance of protein structure analysis algorithms |
title_fullStr | Training data composition affects performance of protein structure analysis algorithms |
title_full_unstemmed | Training data composition affects performance of protein structure analysis algorithms |
title_short | Training data composition affects performance of protein structure analysis algorithms |
title_sort | training data composition affects performance of protein structure analysis algorithms |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8669736/ https://www.ncbi.nlm.nih.gov/pubmed/34890132 |
work_keys_str_mv | AT derryalexander trainingdatacompositionaffectsperformanceofproteinstructureanalysisalgorithms AT carpenterkristya trainingdatacompositionaffectsperformanceofproteinstructureanalysisalgorithms AT altmanrussb trainingdatacompositionaffectsperformanceofproteinstructureanalysisalgorithms |