Cargando…

Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study

BACKGROUND: Large language models have had a huge impact on natural language processing (NLP) in recent years. However, their application in epidemiological research is still limited to the analysis of electronic health records and social media data. OBJECTIVES: To demonstrate the potential of NLP b...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wibaek, Rasmus, Andersen, Gregers Stig, Dahm, Christina C, Witte, Daniel R, Hulman, Adam
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications Inc 2023
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10547934/ https://www.ncbi.nlm.nih.gov/pubmed/37787655 http://dx.doi.org/10.2196/43638

_version_	1785115165371400192
author	Wibaek, Rasmus Andersen, Gregers Stig Dahm, Christina C Witte, Daniel R Hulman, Adam
author_facet	Wibaek, Rasmus Andersen, Gregers Stig Dahm, Christina C Witte, Daniel R Hulman, Adam
author_sort	Wibaek, Rasmus
collection	PubMed
description	BACKGROUND: Large language models have had a huge impact on natural language processing (NLP) in recent years. However, their application in epidemiological research is still limited to the analysis of electronic health records and social media data. OBJECTIVES: To demonstrate the potential of NLP beyond these domains, we aimed to develop prediction models based on texts collected from an epidemiological cohort and compare their performance to classical regression methods. METHODS: We used data from the British National Child Development Study, where 10,567 children aged 11 years wrote essays about how they imagined themselves as 25-year-olds. Overall, 15% of the data set was set aside as a test set for performance evaluation. Pretrained language models were fine-tuned using AutoTrain (Hugging Face) to predict current reading comprehension score (range: 0-35) and future BMI and physical activity (active vs inactive) at the age of 33 years. We then compared their predictive performance (accuracy or discrimination) with linear and logistic regression models, including demographic and lifestyle factors of the parents and children from birth to the age of 11 years as predictors. RESULTS: NLP clearly outperformed linear regression when predicting reading comprehension scores (root mean square error: 3.89, 95% CI 3.74-4.05 for NLP vs 4.14, 95% CI 3.98-4.30 and 5.41, 95% CI 5.23-5.58 for regression models with and without general ability score as a predictor, respectively). Predictive performance for physical activity was similarly poor for the 2 methods (area under the receiver operating characteristic curve: 0.55, 95% CI 0.52-0.60 for both) but was slightly better than random assignment, whereas linear regression clearly outperformed the NLP approach when predicting BMI (root mean square error: 4.38, 95% CI 4.02-4.74 for NLP vs 3.85, 95% CI 3.54-4.16 for regression). The NLP approach did not perform better than simply assigning the mean BMI from the training set as a predictor. CONCLUSIONS: Our study demonstrated the potential of using large language models on text collected from epidemiological studies. The performance of the approach appeared to depend on how directly the topic of the text was related to the outcome. Open-ended questions specifically designed to capture certain health concepts and lived experiences in combination with NLP methods should receive more attention in future epidemiological studies.
format	Online Article Text
id	pubmed-10547934
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	JMIR Publications Inc
record_format	MEDLINE/PubMed
spelling	pubmed-105479342023-10-05 Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study Wibaek, Rasmus Andersen, Gregers Stig Dahm, Christina C Witte, Daniel R Hulman, Adam JMIR Med Inform Original Paper BACKGROUND: Large language models have had a huge impact on natural language processing (NLP) in recent years. However, their application in epidemiological research is still limited to the analysis of electronic health records and social media data. OBJECTIVES: To demonstrate the potential of NLP beyond these domains, we aimed to develop prediction models based on texts collected from an epidemiological cohort and compare their performance to classical regression methods. METHODS: We used data from the British National Child Development Study, where 10,567 children aged 11 years wrote essays about how they imagined themselves as 25-year-olds. Overall, 15% of the data set was set aside as a test set for performance evaluation. Pretrained language models were fine-tuned using AutoTrain (Hugging Face) to predict current reading comprehension score (range: 0-35) and future BMI and physical activity (active vs inactive) at the age of 33 years. We then compared their predictive performance (accuracy or discrimination) with linear and logistic regression models, including demographic and lifestyle factors of the parents and children from birth to the age of 11 years as predictors. RESULTS: NLP clearly outperformed linear regression when predicting reading comprehension scores (root mean square error: 3.89, 95% CI 3.74-4.05 for NLP vs 4.14, 95% CI 3.98-4.30 and 5.41, 95% CI 5.23-5.58 for regression models with and without general ability score as a predictor, respectively). Predictive performance for physical activity was similarly poor for the 2 methods (area under the receiver operating characteristic curve: 0.55, 95% CI 0.52-0.60 for both) but was slightly better than random assignment, whereas linear regression clearly outperformed the NLP approach when predicting BMI (root mean square error: 4.38, 95% CI 4.02-4.74 for NLP vs 3.85, 95% CI 3.54-4.16 for regression). The NLP approach did not perform better than simply assigning the mean BMI from the training set as a predictor. CONCLUSIONS: Our study demonstrated the potential of using large language models on text collected from epidemiological studies. The performance of the approach appeared to depend on how directly the topic of the text was related to the outcome. Open-ended questions specifically designed to capture certain health concepts and lived experiences in combination with NLP methods should receive more attention in future epidemiological studies. JMIR Publications Inc 2023-09-19 /pmc/articles/PMC10547934/ /pubmed/37787655 http://dx.doi.org/10.2196/43638 Text en © Rasmus Wibaek, Gregers Stig Andersen, Christina C Dahm, Daniel R Witte, Adam Hulman. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 19.9.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Wibaek, Rasmus Andersen, Gregers Stig Dahm, Christina C Witte, Daniel R Hulman, Adam Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study
title	Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study
title_full	Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study
title_fullStr	Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study
title_full_unstemmed	Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study
title_short	Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study
title_sort	large language models for epidemiological research via automated machine learning: case study using data from the british national child development study
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10547934/ https://www.ncbi.nlm.nih.gov/pubmed/37787655 http://dx.doi.org/10.2196/43638
work_keys_str_mv	AT wibaekrasmus largelanguagemodelsforepidemiologicalresearchviaautomatedmachinelearningcasestudyusingdatafromthebritishnationalchilddevelopmentstudy AT andersengregersstig largelanguagemodelsforepidemiologicalresearchviaautomatedmachinelearningcasestudyusingdatafromthebritishnationalchilddevelopmentstudy AT dahmchristinac largelanguagemodelsforepidemiologicalresearchviaautomatedmachinelearningcasestudyusingdatafromthebritishnationalchilddevelopmentstudy AT wittedanielr largelanguagemodelsforepidemiologicalresearchviaautomatedmachinelearningcasestudyusingdatafromthebritishnationalchilddevelopmentstudy AT hulmanadam largelanguagemodelsforepidemiologicalresearchviaautomatedmachinelearningcasestudyusingdatafromthebritishnationalchilddevelopmentstudy

Large Language Models for Epidemiological Research via Automated Machine Learning: Case Study Using Data From the British National Child Development Study

Ejemplares similares