Cargando…

Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions

Twitter has been a remarkable resource for research in pharmacovigilance in the last decade. Traditionally, rule- or lexicon-based methods have been utilized for automatically extracting drug tweets for human annotation. The process of human annotation to create labeled sets for machine learning mod...

Descripción completa

Detalles Bibliográficos
Autores principales:	Tekumalla, Ramya, Banda, Juan M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer London 2021
Materias:	S.I. : LatinX in AI Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8554513/ https://www.ncbi.nlm.nih.gov/pubmed/34728902 http://dx.doi.org/10.1007/s00521-021-06614-2

_version_	1784591817000353792
author	Tekumalla, Ramya Banda, Juan M.
author_facet	Tekumalla, Ramya Banda, Juan M.
author_sort	Tekumalla, Ramya
collection	PubMed
description	Twitter has been a remarkable resource for research in pharmacovigilance in the last decade. Traditionally, rule- or lexicon-based methods have been utilized for automatically extracting drug tweets for human annotation. The process of human annotation to create labeled sets for machine learning models is laborious, time consuming and not scalable. In this work, we demonstrate the feasibility of applying weak supervision (noisy labeling) to select drug data, and build machine learning models using large amounts of noisy labeled data instead of limited gold standard labelled sets. Our results demonstrate the models built with large amounts of noisy data achieve similar performance than models trained on limited gold standard datasets, hence demonstrating that weak supervision helps reduce the need to rely on manual annotation, allowing more data to be easily labeled and useful for downstream machine learning applications, in this case drug mention identification.
format	Online Article Text
id	pubmed-8554513
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Springer London
record_format	MEDLINE/PubMed
spelling	pubmed-85545132021-10-29 Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions Tekumalla, Ramya Banda, Juan M. Neural Comput Appl S.I. : LatinX in AI Research Twitter has been a remarkable resource for research in pharmacovigilance in the last decade. Traditionally, rule- or lexicon-based methods have been utilized for automatically extracting drug tweets for human annotation. The process of human annotation to create labeled sets for machine learning models is laborious, time consuming and not scalable. In this work, we demonstrate the feasibility of applying weak supervision (noisy labeling) to select drug data, and build machine learning models using large amounts of noisy labeled data instead of limited gold standard labelled sets. Our results demonstrate the models built with large amounts of noisy data achieve similar performance than models trained on limited gold standard datasets, hence demonstrating that weak supervision helps reduce the need to rely on manual annotation, allowing more data to be easily labeled and useful for downstream machine learning applications, in this case drug mention identification. Springer London 2021-10-29 /pmc/articles/PMC8554513/ /pubmed/34728902 http://dx.doi.org/10.1007/s00521-021-06614-2 Text en © The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2021 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle	S.I. : LatinX in AI Research Tekumalla, Ramya Banda, Juan M. Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions
title	Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions
title_full	Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions
title_fullStr	Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions
title_full_unstemmed	Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions
title_short	Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions
title_sort	using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions
topic	S.I. : LatinX in AI Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8554513/ https://www.ncbi.nlm.nih.gov/pubmed/34728902 http://dx.doi.org/10.1007/s00521-021-06614-2
work_keys_str_mv	AT tekumallaramya usingweaksupervisiontogeneratetrainingdatasetsfromsocialmediadataaproofofconcepttoidentifydrugmentions AT bandajuanm usingweaksupervisiontogeneratetrainingdatasetsfromsocialmediadataaproofofconcepttoidentifydrugmentions

Using weak supervision to generate training datasets from social media data: a proof of concept to identify drug mentions

Ejemplares similares