Cargando…

Identifying medical terms in patient-authored text: a crowdsourcing-based approach

BACKGROUND AND OBJECTIVE: As people increasingly engage in online health-seeking behavior and contribute to health-oriented websites, the volume of medical text authored by patients and other medical novices grows rapidly. However, we lack an effective method for automatically identifying medical te...

Descripción completa

Detalles Bibliográficos
Autores principales: MacLean, Diana Lynn, Heer, Jeffrey
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BMJ Publishing Group 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3822103/
https://www.ncbi.nlm.nih.gov/pubmed/23645553
http://dx.doi.org/10.1136/amiajnl-2012-001110
_version_ 1782290385536548864
author MacLean, Diana Lynn
Heer, Jeffrey
author_facet MacLean, Diana Lynn
Heer, Jeffrey
author_sort MacLean, Diana Lynn
collection PubMed
description BACKGROUND AND OBJECTIVE: As people increasingly engage in online health-seeking behavior and contribute to health-oriented websites, the volume of medical text authored by patients and other medical novices grows rapidly. However, we lack an effective method for automatically identifying medical terms in patient-authored text (PAT). We demonstrate that crowdsourcing PAT medical term identification tasks to non-experts is a viable method for creating large, accurately-labeled PAT datasets; moreover, such datasets can be used to train classifiers that outperform existing medical term identification tools. MATERIALS AND METHODS: To evaluate the viability of using non-expert crowds to label PAT, we compare expert (registered nurses) and non-expert (Amazon Mechanical Turk workers; Turkers) responses to a PAT medical term identification task. Next, we build a crowd-labeled dataset comprising 10 000 sentences from MedHelp. We train two models on this dataset and evaluate their performance, as well as that of MetaMap, Open Biomedical Annotator (OBA), and NaCTeM's TerMINE, against two gold standard datasets: one from MedHelp and the other from CureTogether. RESULTS: When aggregated according to a corroborative voting policy, Turker responses predict expert responses with an F1 score of 84%. A conditional random field (CRF) trained on 10 000 crowd-labeled MedHelp sentences achieves an F1 score of 78% against the CureTogether gold standard, widely outperforming OBA (47%), TerMINE (43%), and MetaMap (39%). A failure analysis of the CRF suggests that misclassified terms are likely to be either generic or rare. CONCLUSIONS: Our results show that combining statistical models sensitive to sentence-level context with crowd-labeled data is a scalable and effective technique for automatically identifying medical terms in PAT.
format Online
Article
Text
id pubmed-3822103
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BMJ Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-38221032013-12-11 Identifying medical terms in patient-authored text: a crowdsourcing-based approach MacLean, Diana Lynn Heer, Jeffrey J Am Med Inform Assoc Focus on Patient Care BACKGROUND AND OBJECTIVE: As people increasingly engage in online health-seeking behavior and contribute to health-oriented websites, the volume of medical text authored by patients and other medical novices grows rapidly. However, we lack an effective method for automatically identifying medical terms in patient-authored text (PAT). We demonstrate that crowdsourcing PAT medical term identification tasks to non-experts is a viable method for creating large, accurately-labeled PAT datasets; moreover, such datasets can be used to train classifiers that outperform existing medical term identification tools. MATERIALS AND METHODS: To evaluate the viability of using non-expert crowds to label PAT, we compare expert (registered nurses) and non-expert (Amazon Mechanical Turk workers; Turkers) responses to a PAT medical term identification task. Next, we build a crowd-labeled dataset comprising 10 000 sentences from MedHelp. We train two models on this dataset and evaluate their performance, as well as that of MetaMap, Open Biomedical Annotator (OBA), and NaCTeM's TerMINE, against two gold standard datasets: one from MedHelp and the other from CureTogether. RESULTS: When aggregated according to a corroborative voting policy, Turker responses predict expert responses with an F1 score of 84%. A conditional random field (CRF) trained on 10 000 crowd-labeled MedHelp sentences achieves an F1 score of 78% against the CureTogether gold standard, widely outperforming OBA (47%), TerMINE (43%), and MetaMap (39%). A failure analysis of the CRF suggests that misclassified terms are likely to be either generic or rare. CONCLUSIONS: Our results show that combining statistical models sensitive to sentence-level context with crowd-labeled data is a scalable and effective technique for automatically identifying medical terms in PAT. BMJ Publishing Group 2013-11 2013-05-05 /pmc/articles/PMC3822103/ /pubmed/23645553 http://dx.doi.org/10.1136/amiajnl-2012-001110 Text en Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/3.0/
spellingShingle Focus on Patient Care
MacLean, Diana Lynn
Heer, Jeffrey
Identifying medical terms in patient-authored text: a crowdsourcing-based approach
title Identifying medical terms in patient-authored text: a crowdsourcing-based approach
title_full Identifying medical terms in patient-authored text: a crowdsourcing-based approach
title_fullStr Identifying medical terms in patient-authored text: a crowdsourcing-based approach
title_full_unstemmed Identifying medical terms in patient-authored text: a crowdsourcing-based approach
title_short Identifying medical terms in patient-authored text: a crowdsourcing-based approach
title_sort identifying medical terms in patient-authored text: a crowdsourcing-based approach
topic Focus on Patient Care
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3822103/
https://www.ncbi.nlm.nih.gov/pubmed/23645553
http://dx.doi.org/10.1136/amiajnl-2012-001110
work_keys_str_mv AT macleandianalynn identifyingmedicaltermsinpatientauthoredtextacrowdsourcingbasedapproach
AT heerjeffrey identifyingmedicaltermsinpatientauthoredtextacrowdsourcingbasedapproach