Cargando…
Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model
Although open-access data are increasingly common and useful to epidemiological research, the curation of such datasets is resource-intensive and time-consuming. Despite the existence of a major source of COVID-19 data, the regularly disclosed case reports were often written in natural language with...
Autores principales: | , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Elsevier
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9441477/ https://www.ncbi.nlm.nih.gov/pubmed/36093379 http://dx.doi.org/10.1016/j.isci.2022.105079 |
_version_ | 1784782583063642112 |
---|---|
author | Wang, Zhizheng Liu, Xiao Fan Du, Zhanwei Wang, Lin Wu, Ye Holme, Petter Lachmann, Michael Lin, Hongfei Wong, Zoie S.Y. Xu, Xiao-Ke Sun, Yuanyuan |
author_facet | Wang, Zhizheng Liu, Xiao Fan Du, Zhanwei Wang, Lin Wu, Ye Holme, Petter Lachmann, Michael Lin, Hongfei Wong, Zoie S.Y. Xu, Xiao-Ke Sun, Yuanyuan |
author_sort | Wang, Zhizheng |
collection | PubMed |
description | Although open-access data are increasingly common and useful to epidemiological research, the curation of such datasets is resource-intensive and time-consuming. Despite the existence of a major source of COVID-19 data, the regularly disclosed case reports were often written in natural language with an unstructured format. Here, we propose a computational framework that can automatically extract epidemiological information from open-access COVID-19 case reports. We develop this framework by coupling a language model developed using deep neural networks with training samples compiled using an optimized data annotation strategy. When applied to the COVID-19 case reports collected from mainland China, our framework outperforms all other state-of-the-art deep learning models. The information extracted from our approach is highly consistent with that obtained from the gold-standard manual coding, with a matching rate of 80%. To disseminate our algorithm, we provide an open-access online platform that is able to estimate key epidemiological statistics in real time, with much less effort for data curation. |
format | Online Article Text |
id | pubmed-9441477 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Elsevier |
record_format | MEDLINE/PubMed |
spelling | pubmed-94414772022-09-06 Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model Wang, Zhizheng Liu, Xiao Fan Du, Zhanwei Wang, Lin Wu, Ye Holme, Petter Lachmann, Michael Lin, Hongfei Wong, Zoie S.Y. Xu, Xiao-Ke Sun, Yuanyuan iScience Article Although open-access data are increasingly common and useful to epidemiological research, the curation of such datasets is resource-intensive and time-consuming. Despite the existence of a major source of COVID-19 data, the regularly disclosed case reports were often written in natural language with an unstructured format. Here, we propose a computational framework that can automatically extract epidemiological information from open-access COVID-19 case reports. We develop this framework by coupling a language model developed using deep neural networks with training samples compiled using an optimized data annotation strategy. When applied to the COVID-19 case reports collected from mainland China, our framework outperforms all other state-of-the-art deep learning models. The information extracted from our approach is highly consistent with that obtained from the gold-standard manual coding, with a matching rate of 80%. To disseminate our algorithm, we provide an open-access online platform that is able to estimate key epidemiological statistics in real time, with much less effort for data curation. Elsevier 2022-09-05 /pmc/articles/PMC9441477/ /pubmed/36093379 http://dx.doi.org/10.1016/j.isci.2022.105079 Text en © 2022 The Authors https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). |
spellingShingle | Article Wang, Zhizheng Liu, Xiao Fan Du, Zhanwei Wang, Lin Wu, Ye Holme, Petter Lachmann, Michael Lin, Hongfei Wong, Zoie S.Y. Xu, Xiao-Ke Sun, Yuanyuan Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model |
title | Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model |
title_full | Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model |
title_fullStr | Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model |
title_full_unstemmed | Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model |
title_short | Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model |
title_sort | epidemiologic information discovery from open-access covid-19 case reports via pretrained language model |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9441477/ https://www.ncbi.nlm.nih.gov/pubmed/36093379 http://dx.doi.org/10.1016/j.isci.2022.105079 |
work_keys_str_mv | AT wangzhizheng epidemiologicinformationdiscoveryfromopenaccesscovid19casereportsviapretrainedlanguagemodel AT liuxiaofan epidemiologicinformationdiscoveryfromopenaccesscovid19casereportsviapretrainedlanguagemodel AT duzhanwei epidemiologicinformationdiscoveryfromopenaccesscovid19casereportsviapretrainedlanguagemodel AT wanglin epidemiologicinformationdiscoveryfromopenaccesscovid19casereportsviapretrainedlanguagemodel AT wuye epidemiologicinformationdiscoveryfromopenaccesscovid19casereportsviapretrainedlanguagemodel AT holmepetter epidemiologicinformationdiscoveryfromopenaccesscovid19casereportsviapretrainedlanguagemodel AT lachmannmichael epidemiologicinformationdiscoveryfromopenaccesscovid19casereportsviapretrainedlanguagemodel AT linhongfei epidemiologicinformationdiscoveryfromopenaccesscovid19casereportsviapretrainedlanguagemodel AT wongzoiesy epidemiologicinformationdiscoveryfromopenaccesscovid19casereportsviapretrainedlanguagemodel AT xuxiaoke epidemiologicinformationdiscoveryfromopenaccesscovid19casereportsviapretrainedlanguagemodel AT sunyuanyuan epidemiologicinformationdiscoveryfromopenaccesscovid19casereportsviapretrainedlanguagemodel |