Cargando…

Functional evaluation of out-of-the-box text-mining tools for data-mining tasks

Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among tex...

Descripción completa

Detalles Bibliográficos
Autores principales: Jung, Kenneth, LePendu, Paea, Iyer, Srinivasan, Bauer-Mehren, Anna, Percha, Bethany, Shah, Nigam H
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4433377/
https://www.ncbi.nlm.nih.gov/pubmed/25336595
http://dx.doi.org/10.1136/amiajnl-2014-002902
_version_ 1782371636148699136
author Jung, Kenneth
LePendu, Paea
Iyer, Srinivasan
Bauer-Mehren, Anna
Percha, Bethany
Shah, Nigam H
author_facet Jung, Kenneth
LePendu, Paea
Iyer, Srinivasan
Bauer-Mehren, Anna
Percha, Bethany
Shah, Nigam H
author_sort Jung, Kenneth
collection PubMed
description Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug–drug interactions, and learning used-to-treat relationships between drugs and indications. Materials We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. Results There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. Conclusions For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice.
format Online
Article
Text
id pubmed-4433377
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-44333772016-01-01 Functional evaluation of out-of-the-box text-mining tools for data-mining tasks Jung, Kenneth LePendu, Paea Iyer, Srinivasan Bauer-Mehren, Anna Percha, Bethany Shah, Nigam H J Am Med Inform Assoc Research and Applications Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug–drug interactions, and learning used-to-treat relationships between drugs and indications. Materials We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. Results There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. Conclusions For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice. Oxford University Press 2015-01 2014-10-21 /pmc/articles/PMC4433377/ /pubmed/25336595 http://dx.doi.org/10.1136/amiajnl-2014-002902 Text en © The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association. http://creativecommons.org/licenses/by-nc/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.comFor numbered affiliations see end of article.
spellingShingle Research and Applications
Jung, Kenneth
LePendu, Paea
Iyer, Srinivasan
Bauer-Mehren, Anna
Percha, Bethany
Shah, Nigam H
Functional evaluation of out-of-the-box text-mining tools for data-mining tasks
title Functional evaluation of out-of-the-box text-mining tools for data-mining tasks
title_full Functional evaluation of out-of-the-box text-mining tools for data-mining tasks
title_fullStr Functional evaluation of out-of-the-box text-mining tools for data-mining tasks
title_full_unstemmed Functional evaluation of out-of-the-box text-mining tools for data-mining tasks
title_short Functional evaluation of out-of-the-box text-mining tools for data-mining tasks
title_sort functional evaluation of out-of-the-box text-mining tools for data-mining tasks
topic Research and Applications
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4433377/
https://www.ncbi.nlm.nih.gov/pubmed/25336595
http://dx.doi.org/10.1136/amiajnl-2014-002902
work_keys_str_mv AT jungkenneth functionalevaluationofoutoftheboxtextminingtoolsfordataminingtasks
AT lependupaea functionalevaluationofoutoftheboxtextminingtoolsfordataminingtasks
AT iyersrinivasan functionalevaluationofoutoftheboxtextminingtoolsfordataminingtasks
AT bauermehrenanna functionalevaluationofoutoftheboxtextminingtoolsfordataminingtasks
AT perchabethany functionalevaluationofoutoftheboxtextminingtoolsfordataminingtasks
AT shahnigamh functionalevaluationofoutoftheboxtextminingtoolsfordataminingtasks