Cargando…

Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study

BACKGROUND: Since medical research based on big data has become more common, the community’s interest and effort to analyze a large amount of semistructured or unstructured text data, such as examination reports, have rapidly increased. However, these large-scale text data are often not readily appl...

Descripción completa

Detalles Bibliográficos
Autores principales: Woo, Hyunki, Kim, Kyunga, Cha, KyeongMin, Lee, Jin-Young, Mun, Hansong, Cho, Soo Jin, Chung, Ji In, Pyo, Jeung Hui, Lee, Kun-Chul, Kang, Mira
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6329435/
https://www.ncbi.nlm.nih.gov/pubmed/30622098
http://dx.doi.org/10.2196/10013
_version_ 1783386829986201600
author Woo, Hyunki
Kim, Kyunga
Cha, KyeongMin
Lee, Jin-Young
Mun, Hansong
Cho, Soo Jin
Chung, Ji In
Pyo, Jeung Hui
Lee, Kun-Chul
Kang, Mira
author_facet Woo, Hyunki
Kim, Kyunga
Cha, KyeongMin
Lee, Jin-Young
Mun, Hansong
Cho, Soo Jin
Chung, Ji In
Pyo, Jeung Hui
Lee, Kun-Chul
Kang, Mira
author_sort Woo, Hyunki
collection PubMed
description BACKGROUND: Since medical research based on big data has become more common, the community’s interest and effort to analyze a large amount of semistructured or unstructured text data, such as examination reports, have rapidly increased. However, these large-scale text data are often not readily applicable to analysis owing to typographical errors, inconsistencies, or data entry problems. Therefore, an efficient data cleaning process is required to ensure the veracity of such data. OBJECTIVE: In this paper, we proposed an efficient data cleaning process for large-scale medical text data, which employs text clustering methods and value-converting technique, and evaluated its performance with medical examination text data. METHODS: The proposed data cleaning process consists of text clustering and value-merging. In the text clustering step, we suggested the use of key collision and nearest neighbor methods in a complementary manner. Words (called values) in the same cluster would be expected as a correct value and its wrong representations. In the value-converting step, wrong values for each identified cluster would be converted into their correct value. We applied these data cleaning process to 574,266 stool examination reports produced for parasite analysis at Samsung Medical Center from 1995 to 2015. The performance of the proposed process was examined and compared with data cleaning processes based on a single clustering method. We used OpenRefine 2.7, an open source application that provides various text clustering methods and an efficient user interface for value-converting with common-value suggestion. RESULTS: A total of 1,167,104 words in stool examination reports were surveyed. In the data cleaning process, we discovered 30 correct words and 45 patterns of typographical errors and duplicates. We observed high correction rates for words with typographical errors (98.61%) and typographical error patterns (97.78%). The resulting data accuracy was nearly 100% based on the number of total words. CONCLUSIONS: Our data cleaning process based on the combinatorial use of key collision and nearest neighbor methods provides an efficient cleaning of large-scale text data and hence improves data accuracy.
format Online
Article
Text
id pubmed-6329435
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-63294352019-02-11 Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study Woo, Hyunki Kim, Kyunga Cha, KyeongMin Lee, Jin-Young Mun, Hansong Cho, Soo Jin Chung, Ji In Pyo, Jeung Hui Lee, Kun-Chul Kang, Mira J Med Internet Res Original Paper BACKGROUND: Since medical research based on big data has become more common, the community’s interest and effort to analyze a large amount of semistructured or unstructured text data, such as examination reports, have rapidly increased. However, these large-scale text data are often not readily applicable to analysis owing to typographical errors, inconsistencies, or data entry problems. Therefore, an efficient data cleaning process is required to ensure the veracity of such data. OBJECTIVE: In this paper, we proposed an efficient data cleaning process for large-scale medical text data, which employs text clustering methods and value-converting technique, and evaluated its performance with medical examination text data. METHODS: The proposed data cleaning process consists of text clustering and value-merging. In the text clustering step, we suggested the use of key collision and nearest neighbor methods in a complementary manner. Words (called values) in the same cluster would be expected as a correct value and its wrong representations. In the value-converting step, wrong values for each identified cluster would be converted into their correct value. We applied these data cleaning process to 574,266 stool examination reports produced for parasite analysis at Samsung Medical Center from 1995 to 2015. The performance of the proposed process was examined and compared with data cleaning processes based on a single clustering method. We used OpenRefine 2.7, an open source application that provides various text clustering methods and an efficient user interface for value-converting with common-value suggestion. RESULTS: A total of 1,167,104 words in stool examination reports were surveyed. In the data cleaning process, we discovered 30 correct words and 45 patterns of typographical errors and duplicates. We observed high correction rates for words with typographical errors (98.61%) and typographical error patterns (97.78%). The resulting data accuracy was nearly 100% based on the number of total words. CONCLUSIONS: Our data cleaning process based on the combinatorial use of key collision and nearest neighbor methods provides an efficient cleaning of large-scale text data and hence improves data accuracy. JMIR Publications 2019-01-08 /pmc/articles/PMC6329435/ /pubmed/30622098 http://dx.doi.org/10.2196/10013 Text en ©Hyunki Woo, Kyunga Kim, KyeongMin Cha, Jin-Young Lee, Hansong Mun, Soo Jin Cho, Ji In Chung, Jeung Hui Pyo, Kun-Chul Lee, Mira Kang. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 08.01.2019. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Woo, Hyunki
Kim, Kyunga
Cha, KyeongMin
Lee, Jin-Young
Mun, Hansong
Cho, Soo Jin
Chung, Ji In
Pyo, Jeung Hui
Lee, Kun-Chul
Kang, Mira
Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study
title Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study
title_full Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study
title_fullStr Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study
title_full_unstemmed Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study
title_short Application of Efficient Data Cleaning Using Text Clustering for Semistructured Medical Reports to Large-Scale Stool Examination Reports: Methodology Study
title_sort application of efficient data cleaning using text clustering for semistructured medical reports to large-scale stool examination reports: methodology study
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6329435/
https://www.ncbi.nlm.nih.gov/pubmed/30622098
http://dx.doi.org/10.2196/10013
work_keys_str_mv AT woohyunki applicationofefficientdatacleaningusingtextclusteringforsemistructuredmedicalreportstolargescalestoolexaminationreportsmethodologystudy
AT kimkyunga applicationofefficientdatacleaningusingtextclusteringforsemistructuredmedicalreportstolargescalestoolexaminationreportsmethodologystudy
AT chakyeongmin applicationofefficientdatacleaningusingtextclusteringforsemistructuredmedicalreportstolargescalestoolexaminationreportsmethodologystudy
AT leejinyoung applicationofefficientdatacleaningusingtextclusteringforsemistructuredmedicalreportstolargescalestoolexaminationreportsmethodologystudy
AT munhansong applicationofefficientdatacleaningusingtextclusteringforsemistructuredmedicalreportstolargescalestoolexaminationreportsmethodologystudy
AT chosoojin applicationofefficientdatacleaningusingtextclusteringforsemistructuredmedicalreportstolargescalestoolexaminationreportsmethodologystudy
AT chungjiin applicationofefficientdatacleaningusingtextclusteringforsemistructuredmedicalreportstolargescalestoolexaminationreportsmethodologystudy
AT pyojeunghui applicationofefficientdatacleaningusingtextclusteringforsemistructuredmedicalreportstolargescalestoolexaminationreportsmethodologystudy
AT leekunchul applicationofefficientdatacleaningusingtextclusteringforsemistructuredmedicalreportstolargescalestoolexaminationreportsmethodologystudy
AT kangmira applicationofefficientdatacleaningusingtextclusteringforsemistructuredmedicalreportstolargescalestoolexaminationreportsmethodologystudy