Cargando…

Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment

Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a method called SPOT-1D-LM combines traditional one-hot enc...

Descripción completa

Detalles Bibliográficos
Autores principales: Singh, Jaspreet, Paliwal, Kuldip, Litfin, Thomas, Singh, Jaswinder, Zhou, Yaoqi
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9085874/
https://www.ncbi.nlm.nih.gov/pubmed/35534620
http://dx.doi.org/10.1038/s41598-022-11684-w
_version_ 1784703913992126464
author Singh, Jaspreet
Paliwal, Kuldip
Litfin, Thomas
Singh, Jaswinder
Zhou, Yaoqi
author_facet Singh, Jaspreet
Paliwal, Kuldip
Litfin, Thomas
Singh, Jaswinder
Zhou, Yaoqi
author_sort Singh, Jaspreet
collection PubMed
description Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a method called SPOT-1D-LM combines traditional one-hot encoding with the embeddings from two different language models (ProtTrans and ESM-1b) for the input and yields a leap in accuracy over single-sequence-based techniques in predicting protein 1D secondary and tertiary structural properties, including backbone torsion angles, solvent accessibility and contact numbers for all six test sets (TEST2018, TEST2020, Neff1-2020, CASP12-FM, CASP13-FM and CASP14-FM). More significantly, it has a performance comparable to profile-based methods for those proteins with homologous sequences. For example, the accuracy for three-state secondary structure (SS3) prediction for TEST2018 and TEST2020 proteins are 86.7% and 79.8% by SPOT-1D-LM, compared to 74.3% and 73.4% by the single-sequence-based method SPOT-1D-Single and 86.2% and 80.5% by the profile-based method SPOT-1D, respectively. For proteins without homologous sequences (Neff1-2020) SS3 is 80.41% by SPOT-1D-LM which is 3.8% and 8.3% higher than SPOT-1D-Single and SPOT-1D, respectively. SPOT-1D-LM is expected to be useful for genome-wide analysis given its fast performance. Moreover, high-accuracy prediction of both secondary and tertiary structural properties such as backbone angles and solvent accessibility without sequence alignment suggests that highly accurate prediction of protein structures may be made without homologous sequences, the remaining obstacle in the post AlphaFold2 era.
format Online
Article
Text
id pubmed-9085874
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-90858742022-05-11 Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment Singh, Jaspreet Paliwal, Kuldip Litfin, Thomas Singh, Jaswinder Zhou, Yaoqi Sci Rep Article Protein language models have emerged as an alternative to multiple sequence alignment for enriching sequence information and improving downstream prediction tasks such as biophysical, structural, and functional properties. Here we show that a method called SPOT-1D-LM combines traditional one-hot encoding with the embeddings from two different language models (ProtTrans and ESM-1b) for the input and yields a leap in accuracy over single-sequence-based techniques in predicting protein 1D secondary and tertiary structural properties, including backbone torsion angles, solvent accessibility and contact numbers for all six test sets (TEST2018, TEST2020, Neff1-2020, CASP12-FM, CASP13-FM and CASP14-FM). More significantly, it has a performance comparable to profile-based methods for those proteins with homologous sequences. For example, the accuracy for three-state secondary structure (SS3) prediction for TEST2018 and TEST2020 proteins are 86.7% and 79.8% by SPOT-1D-LM, compared to 74.3% and 73.4% by the single-sequence-based method SPOT-1D-Single and 86.2% and 80.5% by the profile-based method SPOT-1D, respectively. For proteins without homologous sequences (Neff1-2020) SS3 is 80.41% by SPOT-1D-LM which is 3.8% and 8.3% higher than SPOT-1D-Single and SPOT-1D, respectively. SPOT-1D-LM is expected to be useful for genome-wide analysis given its fast performance. Moreover, high-accuracy prediction of both secondary and tertiary structural properties such as backbone angles and solvent accessibility without sequence alignment suggests that highly accurate prediction of protein structures may be made without homologous sequences, the remaining obstacle in the post AlphaFold2 era. Nature Publishing Group UK 2022-05-09 /pmc/articles/PMC9085874/ /pubmed/35534620 http://dx.doi.org/10.1038/s41598-022-11684-w Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Singh, Jaspreet
Paliwal, Kuldip
Litfin, Thomas
Singh, Jaswinder
Zhou, Yaoqi
Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment
title Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment
title_full Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment
title_fullStr Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment
title_full_unstemmed Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment
title_short Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment
title_sort reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9085874/
https://www.ncbi.nlm.nih.gov/pubmed/35534620
http://dx.doi.org/10.1038/s41598-022-11684-w
work_keys_str_mv AT singhjaspreet reachingalignmentprofilebasedaccuracyinpredictingproteinsecondaryandtertiarystructuralpropertieswithoutalignment
AT paliwalkuldip reachingalignmentprofilebasedaccuracyinpredictingproteinsecondaryandtertiarystructuralpropertieswithoutalignment
AT litfinthomas reachingalignmentprofilebasedaccuracyinpredictingproteinsecondaryandtertiarystructuralpropertieswithoutalignment
AT singhjaswinder reachingalignmentprofilebasedaccuracyinpredictingproteinsecondaryandtertiarystructuralpropertieswithoutalignment
AT zhouyaoqi reachingalignmentprofilebasedaccuracyinpredictingproteinsecondaryandtertiarystructuralpropertieswithoutalignment