Cargando…

Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing

We used social media data from “covid19positive” subreddit, from 03/2020 to 03/2022 to identify COVID-19 cases and extract their reported symptoms automatically using natural language processing (NLP). We trained a Bidirectional Encoder Representations from Transformers classification model with chu...

Descripción completa

Detalles Bibliográficos
Autores principales: Guo, Muzhe, Ma, Yong, Eworuke, Efe, Khashei, Melissa, Song, Jaejoon, Zhao, Yueqin, Jin, Fang
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10444846/
https://www.ncbi.nlm.nih.gov/pubmed/37607963
http://dx.doi.org/10.1038/s41598-023-39986-7
_version_ 1785094043791785984
author Guo, Muzhe
Ma, Yong
Eworuke, Efe
Khashei, Melissa
Song, Jaejoon
Zhao, Yueqin
Jin, Fang
author_facet Guo, Muzhe
Ma, Yong
Eworuke, Efe
Khashei, Melissa
Song, Jaejoon
Zhao, Yueqin
Jin, Fang
author_sort Guo, Muzhe
collection PubMed
description We used social media data from “covid19positive” subreddit, from 03/2020 to 03/2022 to identify COVID-19 cases and extract their reported symptoms automatically using natural language processing (NLP). We trained a Bidirectional Encoder Representations from Transformers classification model with chunking to identify COVID-19 cases; also, we developed a novel QuadArm model, which incorporates Question-answering, dual-corpus expansion, Adaptive rotation clustering, and mapping, to extract symptoms. Our classification model achieved a 91.2% accuracy for the early period (03/2020-05/2020) and was applied to the Delta (07/2021–09/2021) and Omicron (12/2021–03/2022) periods for case identification. We identified 310, 8794, and 12,094 COVID-positive authors in the three periods, respectively. The top five common symptoms extracted in the early period were coughing (57%), fever (55%), loss of sense of smell (41%), headache (40%), and sore throat (40%). During the Delta period, these symptoms remained as the top five symptoms with percent authors reporting symptoms reduced to half or fewer than the early period. During the Omicron period, loss of sense of smell was reported less while sore throat was reported more. Our study demonstrated that NLP can be used to identify COVID-19 cases accurately and extracted symptoms efficiently.
format Online
Article
Text
id pubmed-10444846
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Nature Publishing Group UK
record_format MEDLINE/PubMed
spelling pubmed-104448462023-08-24 Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing Guo, Muzhe Ma, Yong Eworuke, Efe Khashei, Melissa Song, Jaejoon Zhao, Yueqin Jin, Fang Sci Rep Article We used social media data from “covid19positive” subreddit, from 03/2020 to 03/2022 to identify COVID-19 cases and extract their reported symptoms automatically using natural language processing (NLP). We trained a Bidirectional Encoder Representations from Transformers classification model with chunking to identify COVID-19 cases; also, we developed a novel QuadArm model, which incorporates Question-answering, dual-corpus expansion, Adaptive rotation clustering, and mapping, to extract symptoms. Our classification model achieved a 91.2% accuracy for the early period (03/2020-05/2020) and was applied to the Delta (07/2021–09/2021) and Omicron (12/2021–03/2022) periods for case identification. We identified 310, 8794, and 12,094 COVID-positive authors in the three periods, respectively. The top five common symptoms extracted in the early period were coughing (57%), fever (55%), loss of sense of smell (41%), headache (40%), and sore throat (40%). During the Delta period, these symptoms remained as the top five symptoms with percent authors reporting symptoms reduced to half or fewer than the early period. During the Omicron period, loss of sense of smell was reported less while sore throat was reported more. Our study demonstrated that NLP can be used to identify COVID-19 cases accurately and extracted symptoms efficiently. Nature Publishing Group UK 2023-08-22 /pmc/articles/PMC10444846/ /pubmed/37607963 http://dx.doi.org/10.1038/s41598-023-39986-7 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) .
spellingShingle Article
Guo, Muzhe
Ma, Yong
Eworuke, Efe
Khashei, Melissa
Song, Jaejoon
Zhao, Yueqin
Jin, Fang
Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing
title Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing
title_full Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing
title_fullStr Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing
title_full_unstemmed Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing
title_short Identifying COVID-19 cases and extracting patient reported symptoms from Reddit using natural language processing
title_sort identifying covid-19 cases and extracting patient reported symptoms from reddit using natural language processing
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10444846/
https://www.ncbi.nlm.nih.gov/pubmed/37607963
http://dx.doi.org/10.1038/s41598-023-39986-7
work_keys_str_mv AT guomuzhe identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing
AT mayong identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing
AT eworukeefe identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing
AT khasheimelissa identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing
AT songjaejoon identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing
AT zhaoyueqin identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing
AT jinfang identifyingcovid19casesandextractingpatientreportedsymptomsfromredditusingnaturallanguageprocessing