Cargando…

A scoping review of preprocessing methods for unstructured text data to assess data quality

INTRODUCTION: Unstructured text data (UTD) are increasingly found in many databases that were never intended to be used for research, including electronic medical record (EMR) databases. Data quality can impact the usefulness of UTD for research. UTD are typically prepared for analysis (i.e., prepro...

Descripción completa

Detalles Bibliográficos
Autores principales: Nesca, Marcello, Katz, Alan, Leung, Carson K., Lix, Lisa M.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Swansea University 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10476151/
https://www.ncbi.nlm.nih.gov/pubmed/37670734
http://dx.doi.org/10.23889/ijpds.v6i1.1757
_version_ 1785100865333362688
author Nesca, Marcello
Katz, Alan
Leung, Carson K.
Lix, Lisa M.
author_facet Nesca, Marcello
Katz, Alan
Leung, Carson K.
Lix, Lisa M.
author_sort Nesca, Marcello
collection PubMed
description INTRODUCTION: Unstructured text data (UTD) are increasingly found in many databases that were never intended to be used for research, including electronic medical record (EMR) databases. Data quality can impact the usefulness of UTD for research. UTD are typically prepared for analysis (i.e., preprocessed) and analyzed using natural language processing (NLP) techniques. Different NLP methods are used to preprocess UTD and may affect data quality. OBJECTIVE: Our objective was to systematically document current research and practices about NLP preprocessing methods to describe or improve the quality of UTD, including UTD found in EMR databases. METHODS: A scoping review was undertaken of peer-reviewed studies published between December 2002 and January 2021. Scopus, Web of Science, ProQuest, and EBSCOhost were searched for literature relevant to the study objective. Information extracted from the studies included article characteristics (i.e., year of publication, journal discipline), data characteristics, types of preprocessing methods, and data quality topics. Study data were presented using a narrative synthesis. RESULTS: A total of 41 articles were included in the scoping review; over 50% were published between 2016 and 2021. Almost 20% of the articles were published in health science journals. Common preprocessing methods included removal of extraneous text elements such as stop words, punctuation, and numbers, word tokenization, and parts of speech tagging. Data quality topics for articles about EMR data included misspelled words, security (i.e., de-identification), word variability, sources of noise, quality of annotations, and ambiguity of abbreviations. CONCLUSIONS: Multiple NLP techniques have been proposed to preprocess UTD, with some differences in techniques applied to EMR data. There are similarities in the data quality dimensions used to characterize structured data and UTD. While a few general-purpose measures of data quality that do not require external data; most of these focus on the measurement of noise.
format Online
Article
Text
id pubmed-10476151
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Swansea University
record_format MEDLINE/PubMed
spelling pubmed-104761512023-09-05 A scoping review of preprocessing methods for unstructured text data to assess data quality Nesca, Marcello Katz, Alan Leung, Carson K. Lix, Lisa M. Int J Popul Data Sci Population Data Science INTRODUCTION: Unstructured text data (UTD) are increasingly found in many databases that were never intended to be used for research, including electronic medical record (EMR) databases. Data quality can impact the usefulness of UTD for research. UTD are typically prepared for analysis (i.e., preprocessed) and analyzed using natural language processing (NLP) techniques. Different NLP methods are used to preprocess UTD and may affect data quality. OBJECTIVE: Our objective was to systematically document current research and practices about NLP preprocessing methods to describe or improve the quality of UTD, including UTD found in EMR databases. METHODS: A scoping review was undertaken of peer-reviewed studies published between December 2002 and January 2021. Scopus, Web of Science, ProQuest, and EBSCOhost were searched for literature relevant to the study objective. Information extracted from the studies included article characteristics (i.e., year of publication, journal discipline), data characteristics, types of preprocessing methods, and data quality topics. Study data were presented using a narrative synthesis. RESULTS: A total of 41 articles were included in the scoping review; over 50% were published between 2016 and 2021. Almost 20% of the articles were published in health science journals. Common preprocessing methods included removal of extraneous text elements such as stop words, punctuation, and numbers, word tokenization, and parts of speech tagging. Data quality topics for articles about EMR data included misspelled words, security (i.e., de-identification), word variability, sources of noise, quality of annotations, and ambiguity of abbreviations. CONCLUSIONS: Multiple NLP techniques have been proposed to preprocess UTD, with some differences in techniques applied to EMR data. There are similarities in the data quality dimensions used to characterize structured data and UTD. While a few general-purpose measures of data quality that do not require external data; most of these focus on the measurement of noise. Swansea University 2022-10-04 /pmc/articles/PMC10476151/ /pubmed/37670734 http://dx.doi.org/10.23889/ijpds.v6i1.1757 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
spellingShingle Population Data Science
Nesca, Marcello
Katz, Alan
Leung, Carson K.
Lix, Lisa M.
A scoping review of preprocessing methods for unstructured text data to assess data quality
title A scoping review of preprocessing methods for unstructured text data to assess data quality
title_full A scoping review of preprocessing methods for unstructured text data to assess data quality
title_fullStr A scoping review of preprocessing methods for unstructured text data to assess data quality
title_full_unstemmed A scoping review of preprocessing methods for unstructured text data to assess data quality
title_short A scoping review of preprocessing methods for unstructured text data to assess data quality
title_sort scoping review of preprocessing methods for unstructured text data to assess data quality
topic Population Data Science
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10476151/
https://www.ncbi.nlm.nih.gov/pubmed/37670734
http://dx.doi.org/10.23889/ijpds.v6i1.1757
work_keys_str_mv AT nescamarcello ascopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality
AT katzalan ascopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality
AT leungcarsonk ascopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality
AT lixlisam ascopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality
AT nescamarcello scopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality
AT katzalan scopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality
AT leungcarsonk scopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality
AT lixlisam scopingreviewofpreprocessingmethodsforunstructuredtextdatatoassessdataquality