Cargando…

Development and Validation of Machine Models Using Natural Language Processing to Classify Substances Involved in Overdose Deaths

IMPORTANCE: Overdose is one of the leading causes of death in the US; however, surveillance data lag considerably from medical examiner determination of the death to reporting in national surveillance reports. OBJECTIVE: To automate the classification of deaths related to substances in medical exami...

Descripción completa

Detalles Bibliográficos
Autores principales: Goodman-Meza, David, Shover, Chelsea L., Medina, Jesus A., Tang, Amber B., Shoptaw, Steven, Bui, Alex A. T.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Medical Association 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9361079/
https://www.ncbi.nlm.nih.gov/pubmed/35939303
http://dx.doi.org/10.1001/jamanetworkopen.2022.25593
_version_ 1784764453788581888
author Goodman-Meza, David
Shover, Chelsea L.
Medina, Jesus A.
Tang, Amber B.
Shoptaw, Steven
Bui, Alex A. T.
author_facet Goodman-Meza, David
Shover, Chelsea L.
Medina, Jesus A.
Tang, Amber B.
Shoptaw, Steven
Bui, Alex A. T.
author_sort Goodman-Meza, David
collection PubMed
description IMPORTANCE: Overdose is one of the leading causes of death in the US; however, surveillance data lag considerably from medical examiner determination of the death to reporting in national surveillance reports. OBJECTIVE: To automate the classification of deaths related to substances in medical examiner data using natural language processing (NLP) and machine learning (ML). DESIGN, SETTING, AND PARTICIPANTS: Diagnostic study comparing different natural language processing and machine learning algorithms to identify substances related to overdose in 10 health jurisdictions in the US from January 1, 2020, to December 31, 2020. Unstructured text from 35 433 medical examiner and coroners’ death records was examined. EXPOSURES: Text from each case was manually classified to a substance that was related to the death. Three feature representation methods were used and compared: text frequency–inverse document frequency (TF-IDF), global vectors for word representations (GloVe), and concept unique identifier (CUI) embeddings. Several ML algorithms were trained and best models were selected based on F-scores. The best models were tested on a hold-out test set and results were reported with 95% CIs. MAIN OUTCOMES AND MEASURES: Text data from death certificates were classified as any opioid, fentanyl, alcohol, cocaine, methamphetamine, heroin, prescription opioid, and an aggregate of other substances. Diagnostic metrics and 95% CIs were calculated for each combination of feature extraction method and machine learning classifier. RESULTS: Of 35 433 death records analyzed (decedent median age, 58 years [IQR, 41-72 years]; 24 449 [69%] were male), the most common substances related to deaths included any opioid (5739 [16%]), fentanyl (4758 [13%]), alcohol (2866 [8%]), cocaine (2247 [6%]), methamphetamine (1876 [5%]), heroin (1613 [5%]), prescription opioids (1197 [3%]), and any benzodiazepine (1076 [3%]). The CUI embeddings had similar or better diagnostic metrics compared with word embeddings and TF-IDF for all substances except alcohol. ML classifiers had perfect or near perfect performance in classifying deaths related to any opioids, heroin, fentanyl, prescription opioids, methamphetamine, cocaine, and alcohol. Classification of benzodiazepines was suboptimal using all 3 feature extraction methods. CONCLUSIONS AND RELEVANCE: In this diagnostic study, NLP/ML algorithms demonstrated excellent diagnostic performance at classifying substances related to overdoses. These algorithms should be integrated into workflows to decrease the lag time in reporting overdose surveillance data.
format Online
Article
Text
id pubmed-9361079
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher American Medical Association
record_format MEDLINE/PubMed
spelling pubmed-93610792022-08-19 Development and Validation of Machine Models Using Natural Language Processing to Classify Substances Involved in Overdose Deaths Goodman-Meza, David Shover, Chelsea L. Medina, Jesus A. Tang, Amber B. Shoptaw, Steven Bui, Alex A. T. JAMA Netw Open Original Investigation IMPORTANCE: Overdose is one of the leading causes of death in the US; however, surveillance data lag considerably from medical examiner determination of the death to reporting in national surveillance reports. OBJECTIVE: To automate the classification of deaths related to substances in medical examiner data using natural language processing (NLP) and machine learning (ML). DESIGN, SETTING, AND PARTICIPANTS: Diagnostic study comparing different natural language processing and machine learning algorithms to identify substances related to overdose in 10 health jurisdictions in the US from January 1, 2020, to December 31, 2020. Unstructured text from 35 433 medical examiner and coroners’ death records was examined. EXPOSURES: Text from each case was manually classified to a substance that was related to the death. Three feature representation methods were used and compared: text frequency–inverse document frequency (TF-IDF), global vectors for word representations (GloVe), and concept unique identifier (CUI) embeddings. Several ML algorithms were trained and best models were selected based on F-scores. The best models were tested on a hold-out test set and results were reported with 95% CIs. MAIN OUTCOMES AND MEASURES: Text data from death certificates were classified as any opioid, fentanyl, alcohol, cocaine, methamphetamine, heroin, prescription opioid, and an aggregate of other substances. Diagnostic metrics and 95% CIs were calculated for each combination of feature extraction method and machine learning classifier. RESULTS: Of 35 433 death records analyzed (decedent median age, 58 years [IQR, 41-72 years]; 24 449 [69%] were male), the most common substances related to deaths included any opioid (5739 [16%]), fentanyl (4758 [13%]), alcohol (2866 [8%]), cocaine (2247 [6%]), methamphetamine (1876 [5%]), heroin (1613 [5%]), prescription opioids (1197 [3%]), and any benzodiazepine (1076 [3%]). The CUI embeddings had similar or better diagnostic metrics compared with word embeddings and TF-IDF for all substances except alcohol. ML classifiers had perfect or near perfect performance in classifying deaths related to any opioids, heroin, fentanyl, prescription opioids, methamphetamine, cocaine, and alcohol. Classification of benzodiazepines was suboptimal using all 3 feature extraction methods. CONCLUSIONS AND RELEVANCE: In this diagnostic study, NLP/ML algorithms demonstrated excellent diagnostic performance at classifying substances related to overdoses. These algorithms should be integrated into workflows to decrease the lag time in reporting overdose surveillance data. American Medical Association 2022-08-08 /pmc/articles/PMC9361079/ /pubmed/35939303 http://dx.doi.org/10.1001/jamanetworkopen.2022.25593 Text en Copyright 2022 Goodman-Meza D et al. JAMA Network Open. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the CC-BY License.
spellingShingle Original Investigation
Goodman-Meza, David
Shover, Chelsea L.
Medina, Jesus A.
Tang, Amber B.
Shoptaw, Steven
Bui, Alex A. T.
Development and Validation of Machine Models Using Natural Language Processing to Classify Substances Involved in Overdose Deaths
title Development and Validation of Machine Models Using Natural Language Processing to Classify Substances Involved in Overdose Deaths
title_full Development and Validation of Machine Models Using Natural Language Processing to Classify Substances Involved in Overdose Deaths
title_fullStr Development and Validation of Machine Models Using Natural Language Processing to Classify Substances Involved in Overdose Deaths
title_full_unstemmed Development and Validation of Machine Models Using Natural Language Processing to Classify Substances Involved in Overdose Deaths
title_short Development and Validation of Machine Models Using Natural Language Processing to Classify Substances Involved in Overdose Deaths
title_sort development and validation of machine models using natural language processing to classify substances involved in overdose deaths
topic Original Investigation
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9361079/
https://www.ncbi.nlm.nih.gov/pubmed/35939303
http://dx.doi.org/10.1001/jamanetworkopen.2022.25593
work_keys_str_mv AT goodmanmezadavid developmentandvalidationofmachinemodelsusingnaturallanguageprocessingtoclassifysubstancesinvolvedinoverdosedeaths
AT shoverchelseal developmentandvalidationofmachinemodelsusingnaturallanguageprocessingtoclassifysubstancesinvolvedinoverdosedeaths
AT medinajesusa developmentandvalidationofmachinemodelsusingnaturallanguageprocessingtoclassifysubstancesinvolvedinoverdosedeaths
AT tangamberb developmentandvalidationofmachinemodelsusingnaturallanguageprocessingtoclassifysubstancesinvolvedinoverdosedeaths
AT shoptawsteven developmentandvalidationofmachinemodelsusingnaturallanguageprocessingtoclassifysubstancesinvolvedinoverdosedeaths
AT buialexat developmentandvalidationofmachinemodelsusingnaturallanguageprocessingtoclassifysubstancesinvolvedinoverdosedeaths