Cargando…

Harnessing large language models (LLMs) for candidate gene prioritization and selection

BACKGROUND: Feature selection is a critical step for translating advances afforded by systems-scale molecular profiling into actionable clinical insights. While data-driven methods are commonly utilized for selecting candidate genes, knowledge-driven methods must contend with the challenge of effici...

Descripción completa

Detalles Bibliográficos
Autores principales:	Toufiq, Mohammed, Rinchai, Darawan, Bettacchioli, Eleonore, Kabeer, Basirudeen Syed Ahamed, Khan, Taushif, Subba, Bishesh, White, Olivia, Yurieva, Marina, George, Joshy, Jourde-Chiche, Noemie, Chiche, Laurent, Palucka, Karolina, Chaussabel, Damien
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2023
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10580627/ https://www.ncbi.nlm.nih.gov/pubmed/37845713 http://dx.doi.org/10.1186/s12967-023-04576-8

_version_	1785121983303778304
author	Toufiq, Mohammed Rinchai, Darawan Bettacchioli, Eleonore Kabeer, Basirudeen Syed Ahamed Khan, Taushif Subba, Bishesh White, Olivia Yurieva, Marina George, Joshy Jourde-Chiche, Noemie Chiche, Laurent Palucka, Karolina Chaussabel, Damien
author_facet	Toufiq, Mohammed Rinchai, Darawan Bettacchioli, Eleonore Kabeer, Basirudeen Syed Ahamed Khan, Taushif Subba, Bishesh White, Olivia Yurieva, Marina George, Joshy Jourde-Chiche, Noemie Chiche, Laurent Palucka, Karolina Chaussabel, Damien
author_sort	Toufiq, Mohammed
collection	PubMed
description	BACKGROUND: Feature selection is a critical step for translating advances afforded by systems-scale molecular profiling into actionable clinical insights. While data-driven methods are commonly utilized for selecting candidate genes, knowledge-driven methods must contend with the challenge of efficiently sifting through extensive volumes of biomedical information. This work aimed to assess the utility of large language models (LLMs) for knowledge-driven gene prioritization and selection. METHODS: In this proof of concept, we focused on 11 blood transcriptional modules associated with an Erythroid cells signature. We evaluated four leading LLMs across multiple tasks. Next, we established a workflow leveraging LLMs. The steps consisted of: (1) Selecting one of the 11 modules; (2) Identifying functional convergences among constituent genes using the LLMs; (3) Scoring candidate genes across six criteria capturing the gene’s biological and clinical relevance; (4) Prioritizing candidate genes and summarizing justifications; (5) Fact-checking justifications and identifying supporting references; (6) Selecting a top candidate gene based on validated scoring justifications; and (7) Factoring in transcriptome profiling data to finalize the selection of the top candidate gene. RESULTS: Of the four LLMs evaluated, OpenAI's GPT-4 and Anthropic's Claude demonstrated the best performance and were chosen for the implementation of the candidate gene prioritization and selection workflow. This workflow was run in parallel for each of the 11 erythroid cell modules by participants in a data mining workshop. Module M9.2 served as an illustrative use case. The 30 candidate genes forming this module were assessed, and the top five scoring genes were identified as BCL2L1, ALAS2, SLC4A1, CA1, and FECH. Researchers carefully fact-checked the summarized scoring justifications, after which the LLMs were prompted to select a top candidate based on this information. GPT-4 initially chose BCL2L1, while Claude selected ALAS2. When transcriptional profiling data from three reference datasets were provided for additional context, GPT-4 revised its initial choice to ALAS2, whereas Claude reaffirmed its original selection for this module. CONCLUSIONS: Taken together, our findings highlight the ability of LLMs to prioritize candidate genes with minimal human intervention. This suggests the potential of this technology to boost productivity, especially for tasks that require leveraging extensive biomedical knowledge. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12967-023-04576-8.
format	Online Article Text
id	pubmed-10580627
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-105806272023-10-18 Harnessing large language models (LLMs) for candidate gene prioritization and selection Toufiq, Mohammed Rinchai, Darawan Bettacchioli, Eleonore Kabeer, Basirudeen Syed Ahamed Khan, Taushif Subba, Bishesh White, Olivia Yurieva, Marina George, Joshy Jourde-Chiche, Noemie Chiche, Laurent Palucka, Karolina Chaussabel, Damien J Transl Med Research BACKGROUND: Feature selection is a critical step for translating advances afforded by systems-scale molecular profiling into actionable clinical insights. While data-driven methods are commonly utilized for selecting candidate genes, knowledge-driven methods must contend with the challenge of efficiently sifting through extensive volumes of biomedical information. This work aimed to assess the utility of large language models (LLMs) for knowledge-driven gene prioritization and selection. METHODS: In this proof of concept, we focused on 11 blood transcriptional modules associated with an Erythroid cells signature. We evaluated four leading LLMs across multiple tasks. Next, we established a workflow leveraging LLMs. The steps consisted of: (1) Selecting one of the 11 modules; (2) Identifying functional convergences among constituent genes using the LLMs; (3) Scoring candidate genes across six criteria capturing the gene’s biological and clinical relevance; (4) Prioritizing candidate genes and summarizing justifications; (5) Fact-checking justifications and identifying supporting references; (6) Selecting a top candidate gene based on validated scoring justifications; and (7) Factoring in transcriptome profiling data to finalize the selection of the top candidate gene. RESULTS: Of the four LLMs evaluated, OpenAI's GPT-4 and Anthropic's Claude demonstrated the best performance and were chosen for the implementation of the candidate gene prioritization and selection workflow. This workflow was run in parallel for each of the 11 erythroid cell modules by participants in a data mining workshop. Module M9.2 served as an illustrative use case. The 30 candidate genes forming this module were assessed, and the top five scoring genes were identified as BCL2L1, ALAS2, SLC4A1, CA1, and FECH. Researchers carefully fact-checked the summarized scoring justifications, after which the LLMs were prompted to select a top candidate based on this information. GPT-4 initially chose BCL2L1, while Claude selected ALAS2. When transcriptional profiling data from three reference datasets were provided for additional context, GPT-4 revised its initial choice to ALAS2, whereas Claude reaffirmed its original selection for this module. CONCLUSIONS: Taken together, our findings highlight the ability of LLMs to prioritize candidate genes with minimal human intervention. This suggests the potential of this technology to boost productivity, especially for tasks that require leveraging extensive biomedical knowledge. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12967-023-04576-8. BioMed Central 2023-10-16 /pmc/articles/PMC10580627/ /pubmed/37845713 http://dx.doi.org/10.1186/s12967-023-04576-8 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Toufiq, Mohammed Rinchai, Darawan Bettacchioli, Eleonore Kabeer, Basirudeen Syed Ahamed Khan, Taushif Subba, Bishesh White, Olivia Yurieva, Marina George, Joshy Jourde-Chiche, Noemie Chiche, Laurent Palucka, Karolina Chaussabel, Damien Harnessing large language models (LLMs) for candidate gene prioritization and selection
title	Harnessing large language models (LLMs) for candidate gene prioritization and selection
title_full	Harnessing large language models (LLMs) for candidate gene prioritization and selection
title_fullStr	Harnessing large language models (LLMs) for candidate gene prioritization and selection
title_full_unstemmed	Harnessing large language models (LLMs) for candidate gene prioritization and selection
title_short	Harnessing large language models (LLMs) for candidate gene prioritization and selection
title_sort	harnessing large language models (llms) for candidate gene prioritization and selection
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10580627/ https://www.ncbi.nlm.nih.gov/pubmed/37845713 http://dx.doi.org/10.1186/s12967-023-04576-8
work_keys_str_mv	AT toufiqmohammed harnessinglargelanguagemodelsllmsforcandidategeneprioritizationandselection AT rinchaidarawan harnessinglargelanguagemodelsllmsforcandidategeneprioritizationandselection AT bettacchiolieleonore harnessinglargelanguagemodelsllmsforcandidategeneprioritizationandselection AT kabeerbasirudeensyedahamed harnessinglargelanguagemodelsllmsforcandidategeneprioritizationandselection AT khantaushif harnessinglargelanguagemodelsllmsforcandidategeneprioritizationandselection AT subbabishesh harnessinglargelanguagemodelsllmsforcandidategeneprioritizationandselection AT whiteolivia harnessinglargelanguagemodelsllmsforcandidategeneprioritizationandselection AT yurievamarina harnessinglargelanguagemodelsllmsforcandidategeneprioritizationandselection AT georgejoshy harnessinglargelanguagemodelsllmsforcandidategeneprioritizationandselection AT jourdechichenoemie harnessinglargelanguagemodelsllmsforcandidategeneprioritizationandselection AT chichelaurent harnessinglargelanguagemodelsllmsforcandidategeneprioritizationandselection AT paluckakarolina harnessinglargelanguagemodelsllmsforcandidategeneprioritizationandselection AT chaussabeldamien harnessinglargelanguagemodelsllmsforcandidategeneprioritizationandselection

Harnessing large language models (LLMs) for candidate gene prioritization and selection

Ejemplares similares