Cargando…

Estimating Rare Disease Incidences With Large-scale Internet Search Data: Development and Evaluation of a Two-step Machine Learning Method

BACKGROUND: As rare diseases (RDs) receive increasing attention, obtaining accurate RD incidence estimates has become an essential concern in public health. Since RDs are difficult to diagnose, include diverse types, and have scarce cases, traditional epidemiological methods are costly in RD registr...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Jiayu, He, Zhiyu, Zhang, Min, Ma, Weizhi, Jin, Ye, Zhang, Lei, Zhang, Shuyang, Liu, Yiqun, Ma, Shaoping
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10182453/
http://dx.doi.org/10.2196/42721
_version_ 1785041770389700608
author Li, Jiayu
He, Zhiyu
Zhang, Min
Ma, Weizhi
Jin, Ye
Zhang, Lei
Zhang, Shuyang
Liu, Yiqun
Ma, Shaoping
author_facet Li, Jiayu
He, Zhiyu
Zhang, Min
Ma, Weizhi
Jin, Ye
Zhang, Lei
Zhang, Shuyang
Liu, Yiqun
Ma, Shaoping
author_sort Li, Jiayu
collection PubMed
description BACKGROUND: As rare diseases (RDs) receive increasing attention, obtaining accurate RD incidence estimates has become an essential concern in public health. Since RDs are difficult to diagnose, include diverse types, and have scarce cases, traditional epidemiological methods are costly in RD registries. With the development of the internet, users have become accustomed to searching for disease-related information through search engines before seeking medical treatment. Therefore, online search data provide a new source for estimating RD incidences. OBJECTIVE: The aim of this study was to estimate the incidences of multiple RDs in distinct regions of China with online search data. METHODS: Our research scale included 15 RDs in China from 2016 to 2019. The online search data were obtained from Sogou, one of the top 3 commercial search engines in China. By matching to multilevel keywords related to 15 RDs during the 4 years, we retrieved keyword-matched RD-related queries. The queries used before and after the keyword-matched queries formed the basis of the RD-related search sessions. A two-step method was developed to estimate RD incidences with users’ intents conveyed by the sessions. In the first step, a combination of long short-term memory and multilayer perceptron algorithms was used to predict whether the intents of search sessions were RD-concerned, news-concerned, or others. The second step utilized a linear regression (LR) model to estimate the incidences of multiple RDs in distinct regions based on the RD- and news-concerned session numbers. For evaluation, the estimated incidences were compared with RD incidences collected from China’s national multicenter clinical database of RDs. The root mean square error (RMSE) and relative error rate (RER) were used as the evaluation metrics. RESULTS: The RD-related online data included 2,749,257 queries and 1,769,986 sessions from 1,380,186 users from 2016 to 2019. The best LR model with sessions as the input estimated the RD incidences with an RMSE of 0.017 (95% CI 0.016-0.017) and an RER of 0.365 (95% CI 0.341-0.388). The best LR model with queries as input had an RMSE of 0.023 (95% CI 0.017-0.029) and an RER of 0.511 (95% CI 0.377-0.645). Compared with queries, using session intents achieved an error decrease of 28.57% in terms of the RER (P=.01). Analysis of different RDs and regions showed that session input was more suitable for estimating the incidences of most diseases (14 of 15 RDs). Moreover, examples focusing on two RDs showed that news-concerned session intents reflected news of an outbreak and helped correct the overestimation of incidences. Experiments on RD types further indicated that type had no significant influence on the RD estimation task. CONCLUSIONS: This work sheds light on a novel method for rapid estimation of RD incidences in the internet era, and demonstrates that search session intents were especially helpful for the estimation. The proposed two-step estimation method could be a valuable supplement to the traditional registry for understanding RDs, planning policies, and allocating medical resources. The utilization of search sessions in disease detection and estimation could be transferred to infoveillance of large-scale epidemics or chronic diseases.
format Online
Article
Text
id pubmed-10182453
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-101824532023-05-14 Estimating Rare Disease Incidences With Large-scale Internet Search Data: Development and Evaluation of a Two-step Machine Learning Method Li, Jiayu He, Zhiyu Zhang, Min Ma, Weizhi Jin, Ye Zhang, Lei Zhang, Shuyang Liu, Yiqun Ma, Shaoping JMIR Infodemiology Original Paper BACKGROUND: As rare diseases (RDs) receive increasing attention, obtaining accurate RD incidence estimates has become an essential concern in public health. Since RDs are difficult to diagnose, include diverse types, and have scarce cases, traditional epidemiological methods are costly in RD registries. With the development of the internet, users have become accustomed to searching for disease-related information through search engines before seeking medical treatment. Therefore, online search data provide a new source for estimating RD incidences. OBJECTIVE: The aim of this study was to estimate the incidences of multiple RDs in distinct regions of China with online search data. METHODS: Our research scale included 15 RDs in China from 2016 to 2019. The online search data were obtained from Sogou, one of the top 3 commercial search engines in China. By matching to multilevel keywords related to 15 RDs during the 4 years, we retrieved keyword-matched RD-related queries. The queries used before and after the keyword-matched queries formed the basis of the RD-related search sessions. A two-step method was developed to estimate RD incidences with users’ intents conveyed by the sessions. In the first step, a combination of long short-term memory and multilayer perceptron algorithms was used to predict whether the intents of search sessions were RD-concerned, news-concerned, or others. The second step utilized a linear regression (LR) model to estimate the incidences of multiple RDs in distinct regions based on the RD- and news-concerned session numbers. For evaluation, the estimated incidences were compared with RD incidences collected from China’s national multicenter clinical database of RDs. The root mean square error (RMSE) and relative error rate (RER) were used as the evaluation metrics. RESULTS: The RD-related online data included 2,749,257 queries and 1,769,986 sessions from 1,380,186 users from 2016 to 2019. The best LR model with sessions as the input estimated the RD incidences with an RMSE of 0.017 (95% CI 0.016-0.017) and an RER of 0.365 (95% CI 0.341-0.388). The best LR model with queries as input had an RMSE of 0.023 (95% CI 0.017-0.029) and an RER of 0.511 (95% CI 0.377-0.645). Compared with queries, using session intents achieved an error decrease of 28.57% in terms of the RER (P=.01). Analysis of different RDs and regions showed that session input was more suitable for estimating the incidences of most diseases (14 of 15 RDs). Moreover, examples focusing on two RDs showed that news-concerned session intents reflected news of an outbreak and helped correct the overestimation of incidences. Experiments on RD types further indicated that type had no significant influence on the RD estimation task. CONCLUSIONS: This work sheds light on a novel method for rapid estimation of RD incidences in the internet era, and demonstrates that search session intents were especially helpful for the estimation. The proposed two-step estimation method could be a valuable supplement to the traditional registry for understanding RDs, planning policies, and allocating medical resources. The utilization of search sessions in disease detection and estimation could be transferred to infoveillance of large-scale epidemics or chronic diseases. JMIR Publications 2023-04-28 /pmc/articles/PMC10182453/ http://dx.doi.org/10.2196/42721 Text en ©Jiayu Li, Zhiyu He, Min Zhang, Weizhi Ma, Ye Jin, Lei Zhang, Shuyang Zhang, Yiqun Liu, Shaoping Ma. Originally published in JMIR Infodemiology (https://infodemiology.jmir.org), 28.04.2023. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Infodemiology, is properly cited. The complete bibliographic information, a link to the original publication on https://infodemiology.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Li, Jiayu
He, Zhiyu
Zhang, Min
Ma, Weizhi
Jin, Ye
Zhang, Lei
Zhang, Shuyang
Liu, Yiqun
Ma, Shaoping
Estimating Rare Disease Incidences With Large-scale Internet Search Data: Development and Evaluation of a Two-step Machine Learning Method
title Estimating Rare Disease Incidences With Large-scale Internet Search Data: Development and Evaluation of a Two-step Machine Learning Method
title_full Estimating Rare Disease Incidences With Large-scale Internet Search Data: Development and Evaluation of a Two-step Machine Learning Method
title_fullStr Estimating Rare Disease Incidences With Large-scale Internet Search Data: Development and Evaluation of a Two-step Machine Learning Method
title_full_unstemmed Estimating Rare Disease Incidences With Large-scale Internet Search Data: Development and Evaluation of a Two-step Machine Learning Method
title_short Estimating Rare Disease Incidences With Large-scale Internet Search Data: Development and Evaluation of a Two-step Machine Learning Method
title_sort estimating rare disease incidences with large-scale internet search data: development and evaluation of a two-step machine learning method
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10182453/
http://dx.doi.org/10.2196/42721
work_keys_str_mv AT lijiayu estimatingrarediseaseincidenceswithlargescaleinternetsearchdatadevelopmentandevaluationofatwostepmachinelearningmethod
AT hezhiyu estimatingrarediseaseincidenceswithlargescaleinternetsearchdatadevelopmentandevaluationofatwostepmachinelearningmethod
AT zhangmin estimatingrarediseaseincidenceswithlargescaleinternetsearchdatadevelopmentandevaluationofatwostepmachinelearningmethod
AT maweizhi estimatingrarediseaseincidenceswithlargescaleinternetsearchdatadevelopmentandevaluationofatwostepmachinelearningmethod
AT jinye estimatingrarediseaseincidenceswithlargescaleinternetsearchdatadevelopmentandevaluationofatwostepmachinelearningmethod
AT zhanglei estimatingrarediseaseincidenceswithlargescaleinternetsearchdatadevelopmentandevaluationofatwostepmachinelearningmethod
AT zhangshuyang estimatingrarediseaseincidenceswithlargescaleinternetsearchdatadevelopmentandevaluationofatwostepmachinelearningmethod
AT liuyiqun estimatingrarediseaseincidenceswithlargescaleinternetsearchdatadevelopmentandevaluationofatwostepmachinelearningmethod
AT mashaoping estimatingrarediseaseincidenceswithlargescaleinternetsearchdatadevelopmentandevaluationofatwostepmachinelearningmethod