Cargando…

Large language models improve annotation of viral proteins

Viral sequences are poorly annotated in environmental samples, a major roadblock to understanding how viruses influence microbial community structure. Current annotation approaches rely on alignment-based sequence ho-mology methods, which are limited by available viral sequences and sequence diverge...

Descripción completa

Detalles Bibliográficos
Autores principales: Flamholz, Zachary N., Biller, Steve J., Kelly, Libusha
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Journal Experts 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/
https://www.ncbi.nlm.nih.gov/pubmed/37205395
http://dx.doi.org/10.21203/rs.3.rs-2852098/v1
_version_ 1785042731296358400
author Flamholz, Zachary N.
Biller, Steve J.
Kelly, Libusha
author_facet Flamholz, Zachary N.
Biller, Steve J.
Kelly, Libusha
author_sort Flamholz, Zachary N.
collection PubMed
description Viral sequences are poorly annotated in environmental samples, a major roadblock to understanding how viruses influence microbial community structure. Current annotation approaches rely on alignment-based sequence ho-mology methods, which are limited by available viral sequences and sequence divergence in viral proteins. Here, we show that protein language model representations capture viral protein function beyond the limits of remote sequence homology by targeting two axes of viral sequence annotation: systematic labeling of protein families and function identification for biologic discovery. Protein language model representations capture protein functional properties specific to viruses and expand the annotated fraction of ocean virome viral protein sequences by 37%. Among unannotated viral protein families, we identify a novel DNA editing protein family that defines a new mobile element in marine picocyanobacteria. Protein language models thus significantly enhance remote homology detection of viral proteins and can be utilized to enable new biological discovery across diverse functional categories.
format Online
Article
Text
id pubmed-10187409
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher American Journal Experts
record_format MEDLINE/PubMed
spelling pubmed-101874092023-05-17 Large language models improve annotation of viral proteins Flamholz, Zachary N. Biller, Steve J. Kelly, Libusha Res Sq Article Viral sequences are poorly annotated in environmental samples, a major roadblock to understanding how viruses influence microbial community structure. Current annotation approaches rely on alignment-based sequence ho-mology methods, which are limited by available viral sequences and sequence divergence in viral proteins. Here, we show that protein language model representations capture viral protein function beyond the limits of remote sequence homology by targeting two axes of viral sequence annotation: systematic labeling of protein families and function identification for biologic discovery. Protein language model representations capture protein functional properties specific to viruses and expand the annotated fraction of ocean virome viral protein sequences by 37%. Among unannotated viral protein families, we identify a novel DNA editing protein family that defines a new mobile element in marine picocyanobacteria. Protein language models thus significantly enhance remote homology detection of viral proteins and can be utilized to enable new biological discovery across diverse functional categories. American Journal Experts 2023-05-02 /pmc/articles/PMC10187409/ /pubmed/37205395 http://dx.doi.org/10.21203/rs.3.rs-2852098/v1 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. https://creativecommons.org/licenses/by/4.0/License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License (https://creativecommons.org/licenses/by/4.0/)
spellingShingle Article
Flamholz, Zachary N.
Biller, Steve J.
Kelly, Libusha
Large language models improve annotation of viral proteins
title Large language models improve annotation of viral proteins
title_full Large language models improve annotation of viral proteins
title_fullStr Large language models improve annotation of viral proteins
title_full_unstemmed Large language models improve annotation of viral proteins
title_short Large language models improve annotation of viral proteins
title_sort large language models improve annotation of viral proteins
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/
https://www.ncbi.nlm.nih.gov/pubmed/37205395
http://dx.doi.org/10.21203/rs.3.rs-2852098/v1
work_keys_str_mv AT flamholzzacharyn largelanguagemodelsimproveannotationofviralproteins
AT billerstevej largelanguagemodelsimproveannotationofviralproteins
AT kellylibusha largelanguagemodelsimproveannotationofviralproteins