Cargando…

Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models

Background: AI-driven digital health tools often rely on estimates of disease incidence or prevalence, but obtaining these estimates is costly and time-consuming. We explored the use of machine learning models that leverage contextual information about diseases from unstructured text, to estimate di...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhang, Yuanzhao, Walecki, Robert, Winter, Joanne R., Bragman, Felix J. S., Lourenco, Sara, Hart, Christopher, Baker, Adam, Perov, Yura, Johri, Saurabh
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2020
Materias:	Digital Health
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8521977/ https://www.ncbi.nlm.nih.gov/pubmed/34713043 http://dx.doi.org/10.3389/fdgth.2020.569261

_version_	1784585000566390784
author	Zhang, Yuanzhao Walecki, Robert Winter, Joanne R. Bragman, Felix J. S. Lourenco, Sara Hart, Christopher Baker, Adam Perov, Yura Johri, Saurabh
author_facet	Zhang, Yuanzhao Walecki, Robert Winter, Joanne R. Bragman, Felix J. S. Lourenco, Sara Hart, Christopher Baker, Adam Perov, Yura Johri, Saurabh
author_sort	Zhang, Yuanzhao
collection	PubMed
description	Background: AI-driven digital health tools often rely on estimates of disease incidence or prevalence, but obtaining these estimates is costly and time-consuming. We explored the use of machine learning models that leverage contextual information about diseases from unstructured text, to estimate disease incidence. Methods: We used a class of machine learning models, called language models, to extract contextual information relating to disease incidence. We evaluated three different language models: BioBERT, Global Vectors for Word Representation (GloVe), and the Universal Sentence Encoder (USE), as well as an approach which uses all jointly. The output of these models is a mathematical representation of the underlying data, known as “embeddings.” We used these to train neural network models to predict disease incidence. The neural networks were trained and validated using data from the Global Burden of Disease study, and tested using independent data sourced from the epidemiological literature. Findings: A variety of language models can be used to encode contextual information of diseases. We found that, on average, BioBERT embeddings were the best for disease names across multiple tasks. In particular, BioBERT was the best performing model when predicting specific disease-country pairs, whilst a fusion model combining BioBERT, GloVe, and USE performed best on average when predicting disease incidence in unseen countries. We also found that GloVe embeddings performed better than BioBERT embeddings when applied to country names. However, we also noticed that the models were limited in view of predicting previously unseen diseases. Further limitations were also observed with substantial variations across age groups and notably lower performance for diseases that are highly dependent on location and climate. Interpretation: We demonstrate that context-aware machine learning models can be used for estimating disease incidence. This method is quicker to implement than traditional epidemiological approaches. We therefore suggest it complements existing modeling efforts, where data is required more rapidly or at larger scale. This may particularly benefit AI-driven digital health products where the data will undergo further processing and a validated approximation of the disease incidence is adequate.
format	Online Article Text
id	pubmed-8521977
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-85219772021-10-27 Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models Zhang, Yuanzhao Walecki, Robert Winter, Joanne R. Bragman, Felix J. S. Lourenco, Sara Hart, Christopher Baker, Adam Perov, Yura Johri, Saurabh Front Digit Health Digital Health Background: AI-driven digital health tools often rely on estimates of disease incidence or prevalence, but obtaining these estimates is costly and time-consuming. We explored the use of machine learning models that leverage contextual information about diseases from unstructured text, to estimate disease incidence. Methods: We used a class of machine learning models, called language models, to extract contextual information relating to disease incidence. We evaluated three different language models: BioBERT, Global Vectors for Word Representation (GloVe), and the Universal Sentence Encoder (USE), as well as an approach which uses all jointly. The output of these models is a mathematical representation of the underlying data, known as “embeddings.” We used these to train neural network models to predict disease incidence. The neural networks were trained and validated using data from the Global Burden of Disease study, and tested using independent data sourced from the epidemiological literature. Findings: A variety of language models can be used to encode contextual information of diseases. We found that, on average, BioBERT embeddings were the best for disease names across multiple tasks. In particular, BioBERT was the best performing model when predicting specific disease-country pairs, whilst a fusion model combining BioBERT, GloVe, and USE performed best on average when predicting disease incidence in unseen countries. We also found that GloVe embeddings performed better than BioBERT embeddings when applied to country names. However, we also noticed that the models were limited in view of predicting previously unseen diseases. Further limitations were also observed with substantial variations across age groups and notably lower performance for diseases that are highly dependent on location and climate. Interpretation: We demonstrate that context-aware machine learning models can be used for estimating disease incidence. This method is quicker to implement than traditional epidemiological approaches. We therefore suggest it complements existing modeling efforts, where data is required more rapidly or at larger scale. This may particularly benefit AI-driven digital health products where the data will undergo further processing and a validated approximation of the disease incidence is adequate. Frontiers Media S.A. 2020-12-15 /pmc/articles/PMC8521977/ /pubmed/34713043 http://dx.doi.org/10.3389/fdgth.2020.569261 Text en Copyright © 2020 Zhang, Walecki, Winter, Bragman, Lourenco, Hart, Baker, Perov and Johri. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Digital Health Zhang, Yuanzhao Walecki, Robert Winter, Joanne R. Bragman, Felix J. S. Lourenco, Sara Hart, Christopher Baker, Adam Perov, Yura Johri, Saurabh Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models
title	Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models
title_full	Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models
title_fullStr	Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models
title_full_unstemmed	Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models
title_short	Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models
title_sort	applying artificial intelligence methods for the estimation of disease incidence: the utility of language models
topic	Digital Health
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8521977/ https://www.ncbi.nlm.nih.gov/pubmed/34713043 http://dx.doi.org/10.3389/fdgth.2020.569261
work_keys_str_mv	AT zhangyuanzhao applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT waleckirobert applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT winterjoanner applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT bragmanfelixjs applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT lourencosara applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT hartchristopher applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT bakeradam applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT perovyura applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels AT johrisaurabh applyingartificialintelligencemethodsfortheestimationofdiseaseincidencetheutilityoflanguagemodels

Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models

Ejemplares similares