Cargando…

Evaluation of large language models for discovery of gene set function

Gene set analysis is a mainstay of functional genomics, but it relies on manually curated databases of gene functions that are incomplete and unaware of biological context. Here we evaluate the ability of OpenAI’s GPT-4, a Large Language Model (LLM), to develop hypotheses about common gene functions...

Descripción completa

Detalles Bibliográficos
Autores principales: Hu, Mengzhou, Alkhairy, Sahar, Lee, Ingoo, Pillich, Rudolf T., Bachelder, Robin, Ideker, Trey, Pratt, Dexter
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cornell University 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10508824/
https://www.ncbi.nlm.nih.gov/pubmed/37731657
_version_ 1785107615672434688
author Hu, Mengzhou
Alkhairy, Sahar
Lee, Ingoo
Pillich, Rudolf T.
Bachelder, Robin
Ideker, Trey
Pratt, Dexter
author_facet Hu, Mengzhou
Alkhairy, Sahar
Lee, Ingoo
Pillich, Rudolf T.
Bachelder, Robin
Ideker, Trey
Pratt, Dexter
author_sort Hu, Mengzhou
collection PubMed
description Gene set analysis is a mainstay of functional genomics, but it relies on manually curated databases of gene functions that are incomplete and unaware of biological context. Here we evaluate the ability of OpenAI’s GPT-4, a Large Language Model (LLM), to develop hypotheses about common gene functions from its embedded biomedical knowledge. We created a GPT-4 pipeline to label gene sets with names that summarize their consensus functions, substantiated by analysis text and citations. Benchmarking against named gene sets in the Gene Ontology, GPT-4 generated very similar names in 50% of cases, while in most remaining cases it recovered the name of a more general concept. In gene sets discovered in ‘omics data, GPT-4 names were more informative than gene set enrichment, with supporting statements and citations that largely verified in human review. The ability to rapidly synthesize common gene functions positions LLMs as valuable functional genomics assistants.
format Online
Article
Text
id pubmed-10508824
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cornell University
record_format MEDLINE/PubMed
spelling pubmed-105088242023-09-20 Evaluation of large language models for discovery of gene set function Hu, Mengzhou Alkhairy, Sahar Lee, Ingoo Pillich, Rudolf T. Bachelder, Robin Ideker, Trey Pratt, Dexter ArXiv Article Gene set analysis is a mainstay of functional genomics, but it relies on manually curated databases of gene functions that are incomplete and unaware of biological context. Here we evaluate the ability of OpenAI’s GPT-4, a Large Language Model (LLM), to develop hypotheses about common gene functions from its embedded biomedical knowledge. We created a GPT-4 pipeline to label gene sets with names that summarize their consensus functions, substantiated by analysis text and citations. Benchmarking against named gene sets in the Gene Ontology, GPT-4 generated very similar names in 50% of cases, while in most remaining cases it recovered the name of a more general concept. In gene sets discovered in ‘omics data, GPT-4 names were more informative than gene set enrichment, with supporting statements and citations that largely verified in human review. The ability to rapidly synthesize common gene functions positions LLMs as valuable functional genomics assistants. Cornell University 2023-09-07 /pmc/articles/PMC10508824/ /pubmed/37731657 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Hu, Mengzhou
Alkhairy, Sahar
Lee, Ingoo
Pillich, Rudolf T.
Bachelder, Robin
Ideker, Trey
Pratt, Dexter
Evaluation of large language models for discovery of gene set function
title Evaluation of large language models for discovery of gene set function
title_full Evaluation of large language models for discovery of gene set function
title_fullStr Evaluation of large language models for discovery of gene set function
title_full_unstemmed Evaluation of large language models for discovery of gene set function
title_short Evaluation of large language models for discovery of gene set function
title_sort evaluation of large language models for discovery of gene set function
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10508824/
https://www.ncbi.nlm.nih.gov/pubmed/37731657
work_keys_str_mv AT humengzhou evaluationoflargelanguagemodelsfordiscoveryofgenesetfunction
AT alkhairysahar evaluationoflargelanguagemodelsfordiscoveryofgenesetfunction
AT leeingoo evaluationoflargelanguagemodelsfordiscoveryofgenesetfunction
AT pillichrudolft evaluationoflargelanguagemodelsfordiscoveryofgenesetfunction
AT bachelderrobin evaluationoflargelanguagemodelsfordiscoveryofgenesetfunction
AT idekertrey evaluationoflargelanguagemodelsfordiscoveryofgenesetfunction
AT prattdexter evaluationoflargelanguagemodelsfordiscoveryofgenesetfunction