Cargando…

Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports

BACKGROUND: Existing bacterial culture test results for infectious diseases are written in unrefined text, resulting in many problems, including typographical errors and stop words. Effective spelling correction processes are needed to ensure the accuracy and reliability of data for the study of inf...

Descripción completa

Detalles Bibliográficos
Autores principales: Kim, Taehyeong, Han, Sung Won, Kang, Minji, Lee, Se Ha, Kim, Jong-Ho, Joo, Hyung Joon, Sohn, Jang Wook
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7939936/
https://www.ncbi.nlm.nih.gov/pubmed/33616536
http://dx.doi.org/10.2196/25530
_version_ 1783661843736166400
author Kim, Taehyeong
Han, Sung Won
Kang, Minji
Lee, Se Ha
Kim, Jong-Ho
Joo, Hyung Joon
Sohn, Jang Wook
author_facet Kim, Taehyeong
Han, Sung Won
Kang, Minji
Lee, Se Ha
Kim, Jong-Ho
Joo, Hyung Joon
Sohn, Jang Wook
author_sort Kim, Taehyeong
collection PubMed
description BACKGROUND: Existing bacterial culture test results for infectious diseases are written in unrefined text, resulting in many problems, including typographical errors and stop words. Effective spelling correction processes are needed to ensure the accuracy and reliability of data for the study of infectious diseases, including medical terminology extraction. If a dictionary is established, spelling algorithms using edit distance are efficient. However, in the absence of a dictionary, traditional spelling correction algorithms that utilize only edit distances have limitations. OBJECTIVE: In this research, we proposed a similarity-based spelling correction algorithm using pretrained word embedding with the BioWordVec technique. This method uses a character-level N-grams–based distributed representation through unsupervised learning rather than the existing rule-based method. In other words, we propose a framework that detects and corrects typographical errors when a dictionary is not in place. METHODS: For detected typographical errors not mapped to Systematized Nomenclature of Medicine (SNOMED) clinical terms, a correction candidate group with high similarity considering the edit distance was generated using pretrained word embedding from the clinical database. From the embedding matrix in which the vocabulary is arranged in descending order according to frequency, a grid search was used to search for candidate groups of similar words. Thereafter, the correction candidate words were ranked in consideration of the frequency of the words, and the typographical errors were finally corrected according to the ranking. RESULTS: Bacterial identification words were extracted from 27,544 bacterial culture and antimicrobial susceptibility reports, and 16 types of spelling errors and 914 misspelled words were found. The similarity-based spelling correction algorithm using BioWordVec proposed in this research corrected 12 types of typographical errors and showed very high performance in correcting 97.48% (based on F1 score) of all spelling errors. CONCLUSIONS: This tool corrected spelling errors effectively in the absence of a dictionary based on bacterial identification words in bacterial culture and antimicrobial susceptibility reports. This method will help build a high-quality refined database of vast text data for electronic health records.
format Online
Article
Text
id pubmed-7939936
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-79399362021-03-12 Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports Kim, Taehyeong Han, Sung Won Kang, Minji Lee, Se Ha Kim, Jong-Ho Joo, Hyung Joon Sohn, Jang Wook JMIR Med Inform Original Paper BACKGROUND: Existing bacterial culture test results for infectious diseases are written in unrefined text, resulting in many problems, including typographical errors and stop words. Effective spelling correction processes are needed to ensure the accuracy and reliability of data for the study of infectious diseases, including medical terminology extraction. If a dictionary is established, spelling algorithms using edit distance are efficient. However, in the absence of a dictionary, traditional spelling correction algorithms that utilize only edit distances have limitations. OBJECTIVE: In this research, we proposed a similarity-based spelling correction algorithm using pretrained word embedding with the BioWordVec technique. This method uses a character-level N-grams–based distributed representation through unsupervised learning rather than the existing rule-based method. In other words, we propose a framework that detects and corrects typographical errors when a dictionary is not in place. METHODS: For detected typographical errors not mapped to Systematized Nomenclature of Medicine (SNOMED) clinical terms, a correction candidate group with high similarity considering the edit distance was generated using pretrained word embedding from the clinical database. From the embedding matrix in which the vocabulary is arranged in descending order according to frequency, a grid search was used to search for candidate groups of similar words. Thereafter, the correction candidate words were ranked in consideration of the frequency of the words, and the typographical errors were finally corrected according to the ranking. RESULTS: Bacterial identification words were extracted from 27,544 bacterial culture and antimicrobial susceptibility reports, and 16 types of spelling errors and 914 misspelled words were found. The similarity-based spelling correction algorithm using BioWordVec proposed in this research corrected 12 types of typographical errors and showed very high performance in correcting 97.48% (based on F1 score) of all spelling errors. CONCLUSIONS: This tool corrected spelling errors effectively in the absence of a dictionary based on bacterial identification words in bacterial culture and antimicrobial susceptibility reports. This method will help build a high-quality refined database of vast text data for electronic health records. JMIR Publications 2021-02-22 /pmc/articles/PMC7939936/ /pubmed/33616536 http://dx.doi.org/10.2196/25530 Text en ©Taehyeong Kim, Sung Won Han, Minji Kang, Se Ha Lee, Jong-Ho Kim, Hyung Joon Joo, Jang Wook Sohn. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 22.02.2021. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Kim, Taehyeong
Han, Sung Won
Kang, Minji
Lee, Se Ha
Kim, Jong-Ho
Joo, Hyung Joon
Sohn, Jang Wook
Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports
title Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports
title_full Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports
title_fullStr Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports
title_full_unstemmed Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports
title_short Similarity-Based Unsupervised Spelling Correction Using BioWordVec: Development and Usability Study of Bacterial Culture and Antimicrobial Susceptibility Reports
title_sort similarity-based unsupervised spelling correction using biowordvec: development and usability study of bacterial culture and antimicrobial susceptibility reports
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7939936/
https://www.ncbi.nlm.nih.gov/pubmed/33616536
http://dx.doi.org/10.2196/25530
work_keys_str_mv AT kimtaehyeong similaritybasedunsupervisedspellingcorrectionusingbiowordvecdevelopmentandusabilitystudyofbacterialcultureandantimicrobialsusceptibilityreports
AT hansungwon similaritybasedunsupervisedspellingcorrectionusingbiowordvecdevelopmentandusabilitystudyofbacterialcultureandantimicrobialsusceptibilityreports
AT kangminji similaritybasedunsupervisedspellingcorrectionusingbiowordvecdevelopmentandusabilitystudyofbacterialcultureandantimicrobialsusceptibilityreports
AT leeseha similaritybasedunsupervisedspellingcorrectionusingbiowordvecdevelopmentandusabilitystudyofbacterialcultureandantimicrobialsusceptibilityreports
AT kimjongho similaritybasedunsupervisedspellingcorrectionusingbiowordvecdevelopmentandusabilitystudyofbacterialcultureandantimicrobialsusceptibilityreports
AT joohyungjoon similaritybasedunsupervisedspellingcorrectionusingbiowordvecdevelopmentandusabilitystudyofbacterialcultureandantimicrobialsusceptibilityreports
AT sohnjangwook similaritybasedunsupervisedspellingcorrectionusingbiowordvecdevelopmentandusabilitystudyofbacterialcultureandantimicrobialsusceptibilityreports