Cargando…

GenePT: A Simple But Hard-to-Beat Foundation Model for Genes and Cells Built From ChatGPT

There has been significant recent progress in leveraging large-scale gene expression data to develop foundation models for single-cell transcriptomes such as Geneformer [1], scGPT [2], and scBERT [3]. These models infer gene functions and interrelations from the gene expression profiles of millions...

Descripción completa

Detalles Bibliográficos
Autores principales: Chen, Yiqun T., Zou, James
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10614824/
https://www.ncbi.nlm.nih.gov/pubmed/37905130
http://dx.doi.org/10.1101/2023.10.16.562533
_version_ 1785129106494455808
author Chen, Yiqun T.
Zou, James
author_facet Chen, Yiqun T.
Zou, James
author_sort Chen, Yiqun T.
collection PubMed
description There has been significant recent progress in leveraging large-scale gene expression data to develop foundation models for single-cell transcriptomes such as Geneformer [1], scGPT [2], and scBERT [3]. These models infer gene functions and interrelations from the gene expression profiles of millions of cells, which requires extensive data curation and resource-intensive training. Here, we explore a much simpler alternative by leveraging ChatGPT embeddings of genes based on literature. Our proposal, GenePT, uses NCBI text descriptions of individual genes with GPT-3.5 to generate gene embeddings. From there, GenePT generates single-cell embeddings in two ways: (i) by averaging the gene embeddings, weighted by each gene’s expression level; or (ii) by creating a sentence embedding for each cell, using gene names ordered by the expression level. Without the need for dataset curation and additional pretraining, GenePT is efficient and easy to use. On many downstream tasks used to evaluate recent single-cell foundation models — e.g., classifying gene properties and cell types — GenePT achieves comparable, and often better, performance than Geneformer and other methods. GenePT demonstrates that large language model embedding of literature is a simple and effective path for biological foundation models.
format Online
Article
Text
id pubmed-10614824
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-106148242023-10-31 GenePT: A Simple But Hard-to-Beat Foundation Model for Genes and Cells Built From ChatGPT Chen, Yiqun T. Zou, James bioRxiv Article There has been significant recent progress in leveraging large-scale gene expression data to develop foundation models for single-cell transcriptomes such as Geneformer [1], scGPT [2], and scBERT [3]. These models infer gene functions and interrelations from the gene expression profiles of millions of cells, which requires extensive data curation and resource-intensive training. Here, we explore a much simpler alternative by leveraging ChatGPT embeddings of genes based on literature. Our proposal, GenePT, uses NCBI text descriptions of individual genes with GPT-3.5 to generate gene embeddings. From there, GenePT generates single-cell embeddings in two ways: (i) by averaging the gene embeddings, weighted by each gene’s expression level; or (ii) by creating a sentence embedding for each cell, using gene names ordered by the expression level. Without the need for dataset curation and additional pretraining, GenePT is efficient and easy to use. On many downstream tasks used to evaluate recent single-cell foundation models — e.g., classifying gene properties and cell types — GenePT achieves comparable, and often better, performance than Geneformer and other methods. GenePT demonstrates that large language model embedding of literature is a simple and effective path for biological foundation models. Cold Spring Harbor Laboratory 2023-10-19 /pmc/articles/PMC10614824/ /pubmed/37905130 http://dx.doi.org/10.1101/2023.10.16.562533 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle Article
Chen, Yiqun T.
Zou, James
GenePT: A Simple But Hard-to-Beat Foundation Model for Genes and Cells Built From ChatGPT
title GenePT: A Simple But Hard-to-Beat Foundation Model for Genes and Cells Built From ChatGPT
title_full GenePT: A Simple But Hard-to-Beat Foundation Model for Genes and Cells Built From ChatGPT
title_fullStr GenePT: A Simple But Hard-to-Beat Foundation Model for Genes and Cells Built From ChatGPT
title_full_unstemmed GenePT: A Simple But Hard-to-Beat Foundation Model for Genes and Cells Built From ChatGPT
title_short GenePT: A Simple But Hard-to-Beat Foundation Model for Genes and Cells Built From ChatGPT
title_sort genept: a simple but hard-to-beat foundation model for genes and cells built from chatgpt
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10614824/
https://www.ncbi.nlm.nih.gov/pubmed/37905130
http://dx.doi.org/10.1101/2023.10.16.562533
work_keys_str_mv AT chenyiqunt geneptasimplebuthardtobeatfoundationmodelforgenesandcellsbuiltfromchatgpt
AT zoujames geneptasimplebuthardtobeatfoundationmodelforgenesandcellsbuiltfromchatgpt