Cargando…

COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic

BACKGROUND: The volume of COVID-19–related misinformation has long exceeded the resources available to fact checkers to effectively mitigate its ill effects. Automated and web-based approaches can provide effective deterrents to online misinformation. Machine learning–based methods have achieved rob...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kolluri, Nikhil, Liu, Yunong, Murthy, Dhiraj
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2022
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9987189/ https://www.ncbi.nlm.nih.gov/pubmed/37113446 http://dx.doi.org/10.2196/38756

_version_	1784901329641013248
author	Kolluri, Nikhil Liu, Yunong Murthy, Dhiraj
author_facet	Kolluri, Nikhil Liu, Yunong Murthy, Dhiraj
author_sort	Kolluri, Nikhil
collection	PubMed
description	BACKGROUND: The volume of COVID-19–related misinformation has long exceeded the resources available to fact checkers to effectively mitigate its ill effects. Automated and web-based approaches can provide effective deterrents to online misinformation. Machine learning–based methods have achieved robust performance on text classification tasks, including potentially low-quality-news credibility assessment. Despite the progress of initial, rapid interventions, the enormity of COVID-19–related misinformation continues to overwhelm fact checkers. Therefore, improvement in automated and machine-learned methods for an infodemic response is urgently needed. OBJECTIVE: The aim of this study was to achieve improvement in automated and machine-learned methods for an infodemic response. METHODS: We evaluated three strategies for training a machine-learning model to determine the highest model performance: (1) COVID-19–related fact-checked data only, (2) general fact-checked data only, and (3) combined COVID-19 and general fact-checked data. We created two COVID-19–related misinformation data sets from fact-checked “false” content combined with programmatically retrieved “true” content. The first set contained ~7000 entries from July to August 2020, and the second contained ~31,000 entries from January 2020 to June 2022. We crowdsourced 31,441 votes to human label the first data set. RESULTS: The models achieved an accuracy of 96.55% and 94.56% on the first and second external validation data set, respectively. Our best-performing model was developed using COVID-19–specific content. We were able to successfully develop combined models that outperformed human votes of misinformation. Specifically, when we blended our model predictions with human votes, the highest accuracy we achieved on the first external validation data set was 99.1%. When we considered outputs where the machine-learning model agreed with human votes, we achieved accuracies up to 98.59% on the first validation data set. This outperformed human votes alone with an accuracy of only 73%. CONCLUSIONS: External validation accuracies of 96.55% and 94.56% are evidence that machine learning can produce superior results for the difficult task of classifying the veracity of COVID-19 content. Pretrained language models performed best when fine-tuned on a topic-specific data set, while other models achieved their best accuracy when fine-tuned on a combination of topic-specific and general-topic data sets. Crucially, our study found that blended models, trained/fine-tuned on general-topic content with crowdsourced data, improved our models’ accuracies up to 99.7%. The successful use of crowdsourced data can increase the accuracy of models in situations when expert-labeled data are scarce. The 98.59% accuracy on a “high-confidence” subsection comprised of machine-learned and human labels suggests that crowdsourced votes can optimize machine-learned labels to improve accuracy above human-only levels. These results support the utility of supervised machine learning to deter and combat future health-related disinformation.
format	Online Article Text
id	pubmed-9987189
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-99871892023-04-26 COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic Kolluri, Nikhil Liu, Yunong Murthy, Dhiraj JMIR Infodemiology Original Paper BACKGROUND: The volume of COVID-19–related misinformation has long exceeded the resources available to fact checkers to effectively mitigate its ill effects. Automated and web-based approaches can provide effective deterrents to online misinformation. Machine learning–based methods have achieved robust performance on text classification tasks, including potentially low-quality-news credibility assessment. Despite the progress of initial, rapid interventions, the enormity of COVID-19–related misinformation continues to overwhelm fact checkers. Therefore, improvement in automated and machine-learned methods for an infodemic response is urgently needed. OBJECTIVE: The aim of this study was to achieve improvement in automated and machine-learned methods for an infodemic response. METHODS: We evaluated three strategies for training a machine-learning model to determine the highest model performance: (1) COVID-19–related fact-checked data only, (2) general fact-checked data only, and (3) combined COVID-19 and general fact-checked data. We created two COVID-19–related misinformation data sets from fact-checked “false” content combined with programmatically retrieved “true” content. The first set contained ~7000 entries from July to August 2020, and the second contained ~31,000 entries from January 2020 to June 2022. We crowdsourced 31,441 votes to human label the first data set. RESULTS: The models achieved an accuracy of 96.55% and 94.56% on the first and second external validation data set, respectively. Our best-performing model was developed using COVID-19–specific content. We were able to successfully develop combined models that outperformed human votes of misinformation. Specifically, when we blended our model predictions with human votes, the highest accuracy we achieved on the first external validation data set was 99.1%. When we considered outputs where the machine-learning model agreed with human votes, we achieved accuracies up to 98.59% on the first validation data set. This outperformed human votes alone with an accuracy of only 73%. CONCLUSIONS: External validation accuracies of 96.55% and 94.56% are evidence that machine learning can produce superior results for the difficult task of classifying the veracity of COVID-19 content. Pretrained language models performed best when fine-tuned on a topic-specific data set, while other models achieved their best accuracy when fine-tuned on a combination of topic-specific and general-topic data sets. Crucially, our study found that blended models, trained/fine-tuned on general-topic content with crowdsourced data, improved our models’ accuracies up to 99.7%. The successful use of crowdsourced data can increase the accuracy of models in situations when expert-labeled data are scarce. The 98.59% accuracy on a “high-confidence” subsection comprised of machine-learned and human labels suggests that crowdsourced votes can optimize machine-learned labels to improve accuracy above human-only levels. These results support the utility of supervised machine learning to deter and combat future health-related disinformation. JMIR Publications 2022-08-25 /pmc/articles/PMC9987189/ /pubmed/37113446 http://dx.doi.org/10.2196/38756 Text en ©Nikhil Kolluri, Yunong Liu, Dhiraj Murthy. Originally published in JMIR Infodemiology (https://infodemiology.jmir.org), 25.08.2022. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Infodemiology, is properly cited. The complete bibliographic information, a link to the original publication on https://infodemiology.jmir.org/, as well as this copyright and license information must be included.
spellingShingle	Original Paper Kolluri, Nikhil Liu, Yunong Murthy, Dhiraj COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic
title	COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic
title_full	COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic
title_fullStr	COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic
title_full_unstemmed	COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic
title_short	COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic
title_sort	covid-19 misinformation detection: machine-learned solutions to the infodemic
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9987189/ https://www.ncbi.nlm.nih.gov/pubmed/37113446 http://dx.doi.org/10.2196/38756
work_keys_str_mv	AT kollurinikhil covid19misinformationdetectionmachinelearnedsolutionstotheinfodemic AT liuyunong covid19misinformationdetectionmachinelearnedsolutionstotheinfodemic AT murthydhiraj covid19misinformationdetectionmachinelearnedsolutionstotheinfodemic

COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic

Ejemplares similares