Cargando…

Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning

Background: De novo protein coding genes emerge from scratch in the non-coding regions of the genome and have, per definition, no homology to other genes. Therefore, their encoded de novo proteins belong to the so-called "dark protein space". So far, only four de novo protein structures ha...

Descripción completa

Detalles Bibliográficos
Autores principales:	Aubel, Margaux, Eicholt, Lars, Bornberg-Bauer, Erich
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	F1000 Research Limited 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126731/ https://www.ncbi.nlm.nih.gov/pubmed/37113259 http://dx.doi.org/10.12688/f1000research.130443.1

_version_	1785030322516131840
author	Aubel, Margaux Eicholt, Lars Bornberg-Bauer, Erich
author_facet	Aubel, Margaux Eicholt, Lars Bornberg-Bauer, Erich
author_sort	Aubel, Margaux
collection	PubMed
description	Background: De novo protein coding genes emerge from scratch in the non-coding regions of the genome and have, per definition, no homology to other genes. Therefore, their encoded de novo proteins belong to the so-called "dark protein space". So far, only four de novo protein structures have been experimentally approximated. Low homology, presumed high disorder and limited structures result in low confidence structural predictions for de novo proteins in most cases. Here, we look at the most widely used structure and disorder predictors and assess their applicability for de novo emerged proteins. Since AlphaFold2 is based on the generation of multiple sequence alignments and was trained on solved structures of largely conserved and globular proteins, its performance on de novo proteins remains unknown. More recently, natural language models of proteins have been used for alignment-free structure predictions, potentially making them more suitable for de novo proteins than AlphaFold2. Methods: We applied different disorder predictors (IUPred3 short/long, flDPnn) and structure predictors, AlphaFold2 on the one hand and language-based models (Omegafold, ESMfold, RGN2) on the other hand, to four de novo proteins with experimental evidence on structure. We compared the resulting predictions between the different predictors as well as to the existing experimental evidence. Results: Results from IUPred, the most widely used disorder predictor, depend heavily on the choice of parameters and differ significantly from flDPnn which has been found to outperform most other predictors in a comparative assessment study recently. Similarly, different structure predictors yielded varying results and confidence scores for de novo proteins. Conclusions: We suggest that, while in some cases protein language model based approaches might be more accurate than AlphaFold2, the structure prediction of de novo emerged proteins remains a difficult task for any predictor, be it disorder or structure.
format	Online Article Text
id	pubmed-10126731
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	F1000 Research Limited
record_format	MEDLINE/PubMed
spelling	pubmed-101267312023-04-26 Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning Aubel, Margaux Eicholt, Lars Bornberg-Bauer, Erich F1000Res Research Article Background: De novo protein coding genes emerge from scratch in the non-coding regions of the genome and have, per definition, no homology to other genes. Therefore, their encoded de novo proteins belong to the so-called "dark protein space". So far, only four de novo protein structures have been experimentally approximated. Low homology, presumed high disorder and limited structures result in low confidence structural predictions for de novo proteins in most cases. Here, we look at the most widely used structure and disorder predictors and assess their applicability for de novo emerged proteins. Since AlphaFold2 is based on the generation of multiple sequence alignments and was trained on solved structures of largely conserved and globular proteins, its performance on de novo proteins remains unknown. More recently, natural language models of proteins have been used for alignment-free structure predictions, potentially making them more suitable for de novo proteins than AlphaFold2. Methods: We applied different disorder predictors (IUPred3 short/long, flDPnn) and structure predictors, AlphaFold2 on the one hand and language-based models (Omegafold, ESMfold, RGN2) on the other hand, to four de novo proteins with experimental evidence on structure. We compared the resulting predictions between the different predictors as well as to the existing experimental evidence. Results: Results from IUPred, the most widely used disorder predictor, depend heavily on the choice of parameters and differ significantly from flDPnn which has been found to outperform most other predictors in a comparative assessment study recently. Similarly, different structure predictors yielded varying results and confidence scores for de novo proteins. Conclusions: We suggest that, while in some cases protein language model based approaches might be more accurate than AlphaFold2, the structure prediction of de novo emerged proteins remains a difficult task for any predictor, be it disorder or structure. F1000 Research Limited 2023-03-29 /pmc/articles/PMC10126731/ /pubmed/37113259 http://dx.doi.org/10.12688/f1000research.130443.1 Text en Copyright: © 2023 Aubel M et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Aubel, Margaux Eicholt, Lars Bornberg-Bauer, Erich Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning
title	Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning
title_full	Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning
title_fullStr	Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning
title_full_unstemmed	Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning
title_short	Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning
title_sort	assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126731/ https://www.ncbi.nlm.nih.gov/pubmed/37113259 http://dx.doi.org/10.12688/f1000research.130443.1
work_keys_str_mv	AT aubelmargaux assessingstructureanddisorderpredictiontoolsfordenovoemergedproteinsintheageofmachinelearning AT eicholtlars assessingstructureanddisorderpredictiontoolsfordenovoemergedproteinsintheageofmachinelearning AT bornbergbauererich assessingstructureanddisorderpredictiontoolsfordenovoemergedproteinsintheageofmachinelearning

Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning

Ejemplares similares