Cargando…

A UMLS-based spell checker for natural language processing in vaccine safety

BACKGROUND: The Institute of Medicine has identified patient safety as a key goal for health care in the United States. Detecting vaccine adverse events is an important public health activity that contributes to patient safety. Reports about adverse events following immunization (AEFI) from surveill...

Descripción completa

Detalles Bibliográficos
Autores principales: Tolentino, Herman D, Matters, Michael D, Walop, Wikke, Law, Barbara, Tong, Wesley, Liu, Fang, Fontelo, Paul, Kohl, Katrin, Payne, Daniel C
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1805499/
https://www.ncbi.nlm.nih.gov/pubmed/17295907
http://dx.doi.org/10.1186/1472-6947-7-3
_version_ 1782132482819227648
author Tolentino, Herman D
Matters, Michael D
Walop, Wikke
Law, Barbara
Tong, Wesley
Liu, Fang
Fontelo, Paul
Kohl, Katrin
Payne, Daniel C
author_facet Tolentino, Herman D
Matters, Michael D
Walop, Wikke
Law, Barbara
Tong, Wesley
Liu, Fang
Fontelo, Paul
Kohl, Katrin
Payne, Daniel C
author_sort Tolentino, Herman D
collection PubMed
description BACKGROUND: The Institute of Medicine has identified patient safety as a key goal for health care in the United States. Detecting vaccine adverse events is an important public health activity that contributes to patient safety. Reports about adverse events following immunization (AEFI) from surveillance systems contain free-text components that can be analyzed using natural language processing. To extract Unified Medical Language System (UMLS) concepts from free text and classify AEFI reports based on concepts they contain, we first needed to clean the text by expanding abbreviations and shortcuts and correcting spelling errors. Our objective in this paper was to create a UMLS-based spelling error correction tool as a first step in the natural language processing (NLP) pipeline for AEFI reports. METHODS: We developed spell checking algorithms using open source tools. We used de-identified AEFI surveillance reports to create free-text data sets for analysis. After expansion of abbreviated clinical terms and shortcuts, we performed spelling correction in four steps: (1) error detection, (2) word list generation, (3) word list disambiguation and (4) error correction. We then measured the performance of the resulting spell checker by comparing it to manual correction. RESULTS: We used 12,056 words to train the spell checker and tested its performance on 8,131 words. During testing, sensitivity, specificity, and positive predictive value (PPV) for the spell checker were 74% (95% CI: 74–75), 100% (95% CI: 100–100), and 47% (95% CI: 46%–48%), respectively. CONCLUSION: We created a prototype spell checker that can be used to process AEFI reports. We used the UMLS Specialist Lexicon as the primary source of dictionary terms and the WordNet lexicon as a secondary source. We used the UMLS as a domain-specific source of dictionary terms to compare potentially misspelled words in the corpus. The prototype sensitivity was comparable to currently available tools, but the specificity was much superior. The slow processing speed may be improved by trimming it down to the most useful component algorithms. Other investigators may find the methods we developed useful for cleaning text using lexicons specific to their area of interest.
format Text
id pubmed-1805499
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18054992007-02-28 A UMLS-based spell checker for natural language processing in vaccine safety Tolentino, Herman D Matters, Michael D Walop, Wikke Law, Barbara Tong, Wesley Liu, Fang Fontelo, Paul Kohl, Katrin Payne, Daniel C BMC Med Inform Decis Mak Research Article BACKGROUND: The Institute of Medicine has identified patient safety as a key goal for health care in the United States. Detecting vaccine adverse events is an important public health activity that contributes to patient safety. Reports about adverse events following immunization (AEFI) from surveillance systems contain free-text components that can be analyzed using natural language processing. To extract Unified Medical Language System (UMLS) concepts from free text and classify AEFI reports based on concepts they contain, we first needed to clean the text by expanding abbreviations and shortcuts and correcting spelling errors. Our objective in this paper was to create a UMLS-based spelling error correction tool as a first step in the natural language processing (NLP) pipeline for AEFI reports. METHODS: We developed spell checking algorithms using open source tools. We used de-identified AEFI surveillance reports to create free-text data sets for analysis. After expansion of abbreviated clinical terms and shortcuts, we performed spelling correction in four steps: (1) error detection, (2) word list generation, (3) word list disambiguation and (4) error correction. We then measured the performance of the resulting spell checker by comparing it to manual correction. RESULTS: We used 12,056 words to train the spell checker and tested its performance on 8,131 words. During testing, sensitivity, specificity, and positive predictive value (PPV) for the spell checker were 74% (95% CI: 74–75), 100% (95% CI: 100–100), and 47% (95% CI: 46%–48%), respectively. CONCLUSION: We created a prototype spell checker that can be used to process AEFI reports. We used the UMLS Specialist Lexicon as the primary source of dictionary terms and the WordNet lexicon as a secondary source. We used the UMLS as a domain-specific source of dictionary terms to compare potentially misspelled words in the corpus. The prototype sensitivity was comparable to currently available tools, but the specificity was much superior. The slow processing speed may be improved by trimming it down to the most useful component algorithms. Other investigators may find the methods we developed useful for cleaning text using lexicons specific to their area of interest. BioMed Central 2007-02-12 /pmc/articles/PMC1805499/ /pubmed/17295907 http://dx.doi.org/10.1186/1472-6947-7-3 Text en Copyright © 2007 Tolentino et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Tolentino, Herman D
Matters, Michael D
Walop, Wikke
Law, Barbara
Tong, Wesley
Liu, Fang
Fontelo, Paul
Kohl, Katrin
Payne, Daniel C
A UMLS-based spell checker for natural language processing in vaccine safety
title A UMLS-based spell checker for natural language processing in vaccine safety
title_full A UMLS-based spell checker for natural language processing in vaccine safety
title_fullStr A UMLS-based spell checker for natural language processing in vaccine safety
title_full_unstemmed A UMLS-based spell checker for natural language processing in vaccine safety
title_short A UMLS-based spell checker for natural language processing in vaccine safety
title_sort umls-based spell checker for natural language processing in vaccine safety
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1805499/
https://www.ncbi.nlm.nih.gov/pubmed/17295907
http://dx.doi.org/10.1186/1472-6947-7-3
work_keys_str_mv AT tolentinohermand aumlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT mattersmichaeld aumlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT walopwikke aumlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT lawbarbara aumlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT tongwesley aumlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT liufang aumlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT fontelopaul aumlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT kohlkatrin aumlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT paynedanielc aumlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT tolentinohermand umlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT mattersmichaeld umlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT walopwikke umlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT lawbarbara umlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT tongwesley umlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT liufang umlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT fontelopaul umlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT kohlkatrin umlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety
AT paynedanielc umlsbasedspellcheckerfornaturallanguageprocessinginvaccinesafety