Cargando…

Improved global protein homolog detection with major gains in function identification

There are several hundred million protein sequences, but the relationships among them are not fully available from existing homolog detection methods. There is an essential need for an improved method to push homolog detection to lower levels of sequence identity. The method used here relies on a la...

Descripción completa

Detalles Bibliográficos
Autores principales: Kilinc, Mesih, Jia, Kejue, Jernigan, Robert L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9992864/
https://www.ncbi.nlm.nih.gov/pubmed/36827259
http://dx.doi.org/10.1073/pnas.2211823120
_version_ 1784902412845187072
author Kilinc, Mesih
Jia, Kejue
Jernigan, Robert L.
author_facet Kilinc, Mesih
Jia, Kejue
Jernigan, Robert L.
author_sort Kilinc, Mesih
collection PubMed
description There are several hundred million protein sequences, but the relationships among them are not fully available from existing homolog detection methods. There is an essential need for an improved method to push homolog detection to lower levels of sequence identity. The method used here relies on a language model to represent proteins numerically in a matrix (an embedding) and uses discrete cosine transforms to compress the data to extract the most essential part, significantly reducing the data size. This PRotein Ortholog Search Tool (PROST) is significantly faster with linear runtimes, and most importantly, computes the distances between pairs of protein sequences to yield homologs at significantly lower levels of sequence identity than previously. The extent of allosteric effects in proteins points out the importance of global aspects of structure and sequence. PROST excels at global homology detection but not at detecting local homologs. Results are validated by strong similarities between the corresponding pairs of structures. The number of remote homologs detected increased significantly and pushes the effective sequence matches more deeply into the twilight zone. Human protein sequences presently having no assigned function now find significant numbers of putative homologs for 93% of cases and structurally verified assigned functions for 76.4% of these cases. The data compression enables massive searches for homologs with short search times while yielding significant gains in the numbers of remote homologs detected. The method is sufficiently efficient to permit whole-genome/proteome comparisons. The PROST web server is accessible at https://mesihk.github.io/prost.
format Online
Article
Text
id pubmed-9992864
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-99928642023-03-09 Improved global protein homolog detection with major gains in function identification Kilinc, Mesih Jia, Kejue Jernigan, Robert L. Proc Natl Acad Sci U S A Biological Sciences There are several hundred million protein sequences, but the relationships among them are not fully available from existing homolog detection methods. There is an essential need for an improved method to push homolog detection to lower levels of sequence identity. The method used here relies on a language model to represent proteins numerically in a matrix (an embedding) and uses discrete cosine transforms to compress the data to extract the most essential part, significantly reducing the data size. This PRotein Ortholog Search Tool (PROST) is significantly faster with linear runtimes, and most importantly, computes the distances between pairs of protein sequences to yield homologs at significantly lower levels of sequence identity than previously. The extent of allosteric effects in proteins points out the importance of global aspects of structure and sequence. PROST excels at global homology detection but not at detecting local homologs. Results are validated by strong similarities between the corresponding pairs of structures. The number of remote homologs detected increased significantly and pushes the effective sequence matches more deeply into the twilight zone. Human protein sequences presently having no assigned function now find significant numbers of putative homologs for 93% of cases and structurally verified assigned functions for 76.4% of these cases. The data compression enables massive searches for homologs with short search times while yielding significant gains in the numbers of remote homologs detected. The method is sufficiently efficient to permit whole-genome/proteome comparisons. The PROST web server is accessible at https://mesihk.github.io/prost. National Academy of Sciences 2023-02-24 2023-02-28 /pmc/articles/PMC9992864/ /pubmed/36827259 http://dx.doi.org/10.1073/pnas.2211823120 Text en Copyright © 2023 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by/4.0/This open access article is distributed under Creative Commons Attribution License 4.0 (CC BY) (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Biological Sciences
Kilinc, Mesih
Jia, Kejue
Jernigan, Robert L.
Improved global protein homolog detection with major gains in function identification
title Improved global protein homolog detection with major gains in function identification
title_full Improved global protein homolog detection with major gains in function identification
title_fullStr Improved global protein homolog detection with major gains in function identification
title_full_unstemmed Improved global protein homolog detection with major gains in function identification
title_short Improved global protein homolog detection with major gains in function identification
title_sort improved global protein homolog detection with major gains in function identification
topic Biological Sciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9992864/
https://www.ncbi.nlm.nih.gov/pubmed/36827259
http://dx.doi.org/10.1073/pnas.2211823120
work_keys_str_mv AT kilincmesih improvedglobalproteinhomologdetectionwithmajorgainsinfunctionidentification
AT jiakejue improvedglobalproteinhomologdetectionwithmajorgainsinfunctionidentification
AT jerniganrobertl improvedglobalproteinhomologdetectionwithmajorgainsinfunctionidentification