Cargando…

Detecting Novel and Emerging Drug Terms Using Natural Language Processing: A Social Media Corpus Study

BACKGROUND: With the rapid development of new psychoactive substances (NPS) and changes in the use of more traditional drugs, it is increasingly difficult for researchers and public health practitioners to keep up with emerging drugs and drug terms. Substance use surveys and diagnostic tools need to...

Descripción completa

Detalles Bibliográficos
Autores principales:	Simpson, Sean S, Adams, Nikki, Brugman, Claudia M, Conners, Thomas J
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2018
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5838358/ https://www.ncbi.nlm.nih.gov/pubmed/29311050 http://dx.doi.org/10.2196/publichealth.7726

_version_	1783304241676288000
author	Simpson, Sean S Adams, Nikki Brugman, Claudia M Conners, Thomas J
author_facet	Simpson, Sean S Adams, Nikki Brugman, Claudia M Conners, Thomas J
author_sort	Simpson, Sean S
collection	PubMed
description	BACKGROUND: With the rapid development of new psychoactive substances (NPS) and changes in the use of more traditional drugs, it is increasingly difficult for researchers and public health practitioners to keep up with emerging drugs and drug terms. Substance use surveys and diagnostic tools need to be able to ask about substances using the terms that drug users themselves are likely to be using. Analyses of social media may offer new ways for researchers to uncover and track changes in drug terms in near real time. This study describes the initial results from an innovative collaboration between substance use epidemiologists and linguistic scientists employing techniques from the field of natural language processing to examine drug-related terms in a sample of tweets from the United States. OBJECTIVE: The objective of this study was to assess the feasibility of using distributed word-vector embeddings trained on social media data to uncover previously unknown (to researchers) drug terms. METHODS: In this pilot study, we trained a continuous bag of words (CBOW) model of distributed word-vector embeddings on a Twitter dataset collected during July 2016 (roughly 884.2 million tokens). We queried the trained word embeddings for terms with high cosine similarity (a proxy for semantic relatedness) to well-known slang terms for marijuana to produce a list of candidate terms likely to function as slang terms for this substance. This candidate list was then compared with an expert-generated list of marijuana terms to assess the accuracy and efficacy of using word-vector embeddings to search for novel drug terminology. RESULTS: The method described here produced a list of 200 candidate terms for the target substance (marijuana). Of these 200 candidates, 115 were determined to in fact relate to marijuana (65 terms for the substance itself, 50 terms related to paraphernalia). This included 30 terms which were used to refer to the target substance in the corpus yet did not appear on the expert-generated list and were therefore considered to be successful cases of uncovering novel drug terminology. Several of these novel terms appear to have been introduced as recently as 1 or 2 months before the corpus time slice used to train the word embeddings. CONCLUSIONS: Though the precision of the method described here is low enough as to still necessitate human review of any candidate term lists generated in such a manner, the fact that this process was able to detect 30 novel terms for the target substance based only on one month’s worth of Twitter data is highly promising. We see this pilot study as an important proof of concept and a first step toward producing a fully automated drug term discovery system capable of tracking emerging NPS terms in real time.
format	Online Article Text
id	pubmed-5838358
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-58383582018-03-09 Detecting Novel and Emerging Drug Terms Using Natural Language Processing: A Social Media Corpus Study Simpson, Sean S Adams, Nikki Brugman, Claudia M Conners, Thomas J JMIR Public Health Surveill Original Paper BACKGROUND: With the rapid development of new psychoactive substances (NPS) and changes in the use of more traditional drugs, it is increasingly difficult for researchers and public health practitioners to keep up with emerging drugs and drug terms. Substance use surveys and diagnostic tools need to be able to ask about substances using the terms that drug users themselves are likely to be using. Analyses of social media may offer new ways for researchers to uncover and track changes in drug terms in near real time. This study describes the initial results from an innovative collaboration between substance use epidemiologists and linguistic scientists employing techniques from the field of natural language processing to examine drug-related terms in a sample of tweets from the United States. OBJECTIVE: The objective of this study was to assess the feasibility of using distributed word-vector embeddings trained on social media data to uncover previously unknown (to researchers) drug terms. METHODS: In this pilot study, we trained a continuous bag of words (CBOW) model of distributed word-vector embeddings on a Twitter dataset collected during July 2016 (roughly 884.2 million tokens). We queried the trained word embeddings for terms with high cosine similarity (a proxy for semantic relatedness) to well-known slang terms for marijuana to produce a list of candidate terms likely to function as slang terms for this substance. This candidate list was then compared with an expert-generated list of marijuana terms to assess the accuracy and efficacy of using word-vector embeddings to search for novel drug terminology. RESULTS: The method described here produced a list of 200 candidate terms for the target substance (marijuana). Of these 200 candidates, 115 were determined to in fact relate to marijuana (65 terms for the substance itself, 50 terms related to paraphernalia). This included 30 terms which were used to refer to the target substance in the corpus yet did not appear on the expert-generated list and were therefore considered to be successful cases of uncovering novel drug terminology. Several of these novel terms appear to have been introduced as recently as 1 or 2 months before the corpus time slice used to train the word embeddings. CONCLUSIONS: Though the precision of the method described here is low enough as to still necessitate human review of any candidate term lists generated in such a manner, the fact that this process was able to detect 30 novel terms for the target substance based only on one month’s worth of Twitter data is highly promising. We see this pilot study as an important proof of concept and a first step toward producing a fully automated drug term discovery system capable of tracking emerging NPS terms in real time. JMIR Publications 2018-01-08 /pmc/articles/PMC5838358/ /pubmed/29311050 http://dx.doi.org/10.2196/publichealth.7726 Text en ©Sean S Simpson, Nikki Adams, Claudia M Brugman, Thomas J Conners. Originally published in JMIR Public Health and Surveillance (http://publichealth.jmir.org), 08.01.2018. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on http://publichealth.jmir.org, as well as this copyright and license information must be included.
spellingShingle	Original Paper Simpson, Sean S Adams, Nikki Brugman, Claudia M Conners, Thomas J Detecting Novel and Emerging Drug Terms Using Natural Language Processing: A Social Media Corpus Study
title	Detecting Novel and Emerging Drug Terms Using Natural Language Processing: A Social Media Corpus Study
title_full	Detecting Novel and Emerging Drug Terms Using Natural Language Processing: A Social Media Corpus Study
title_fullStr	Detecting Novel and Emerging Drug Terms Using Natural Language Processing: A Social Media Corpus Study
title_full_unstemmed	Detecting Novel and Emerging Drug Terms Using Natural Language Processing: A Social Media Corpus Study
title_short	Detecting Novel and Emerging Drug Terms Using Natural Language Processing: A Social Media Corpus Study
title_sort	detecting novel and emerging drug terms using natural language processing: a social media corpus study
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5838358/ https://www.ncbi.nlm.nih.gov/pubmed/29311050 http://dx.doi.org/10.2196/publichealth.7726
work_keys_str_mv	AT simpsonseans detectingnovelandemergingdrugtermsusingnaturallanguageprocessingasocialmediacorpusstudy AT adamsnikki detectingnovelandemergingdrugtermsusingnaturallanguageprocessingasocialmediacorpusstudy AT brugmanclaudiam detectingnovelandemergingdrugtermsusingnaturallanguageprocessingasocialmediacorpusstudy AT connersthomasj detectingnovelandemergingdrugtermsusingnaturallanguageprocessingasocialmediacorpusstudy

Detecting Novel and Emerging Drug Terms Using Natural Language Processing: A Social Media Corpus Study

Ejemplares similares