Cargando…

Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification

BACKGROUND: Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describ...

Descripción completa

Detalles Bibliográficos
Autores principales: Hanauer, David A., Mei, Qiaozhu, Vydiswaran, V. G. Vinod, Singh, Karandeep, Landis-Lewis, Zach, Weng, Chunhua
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6448181/
https://www.ncbi.nlm.nih.gov/pubmed/30944012
http://dx.doi.org/10.1186/s12911-019-0784-1
_version_ 1783408646434062336
author Hanauer, David A.
Mei, Qiaozhu
Vydiswaran, V. G. Vinod
Singh, Karandeep
Landis-Lewis, Zach
Weng, Chunhua
author_facet Hanauer, David A.
Mei, Qiaozhu
Vydiswaran, V. G. Vinod
Singh, Karandeep
Landis-Lewis, Zach
Weng, Chunhua
author_sort Hanauer, David A.
collection PubMed
description BACKGROUND: Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describes an analysis of the variation in how numbers and numerical concepts are represented in clinical notes. METHODS: We used an inverted index of approximately 100 million notes to obtain the frequency of various permutations of numbers and numerical concepts, including the use of Roman numerals, numbers spelled as English words, and invalid dates, among others. Overall, twelve types of lexical variants were analyzed. RESULTS: We found substantial variation in how these concepts were represented in the notes, including multiple data quality issues. We also demonstrate that not considering these variations could have substantial real-world implications for cohort identification tasks, with one case missing > 80% of potential patients. CONCLUSIONS: Numbering within clinical notes can be variable, and not taking these variations into account could result in missing or inaccurate information for natural language processing and information retrieval tasks.
format Online
Article
Text
id pubmed-6448181
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-64481812019-04-15 Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification Hanauer, David A. Mei, Qiaozhu Vydiswaran, V. G. Vinod Singh, Karandeep Landis-Lewis, Zach Weng, Chunhua BMC Med Inform Decis Mak Research BACKGROUND: Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describes an analysis of the variation in how numbers and numerical concepts are represented in clinical notes. METHODS: We used an inverted index of approximately 100 million notes to obtain the frequency of various permutations of numbers and numerical concepts, including the use of Roman numerals, numbers spelled as English words, and invalid dates, among others. Overall, twelve types of lexical variants were analyzed. RESULTS: We found substantial variation in how these concepts were represented in the notes, including multiple data quality issues. We also demonstrate that not considering these variations could have substantial real-world implications for cohort identification tasks, with one case missing > 80% of potential patients. CONCLUSIONS: Numbering within clinical notes can be variable, and not taking these variations into account could result in missing or inaccurate information for natural language processing and information retrieval tasks. BioMed Central 2019-04-04 /pmc/articles/PMC6448181/ /pubmed/30944012 http://dx.doi.org/10.1186/s12911-019-0784-1 Text en © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Hanauer, David A.
Mei, Qiaozhu
Vydiswaran, V. G. Vinod
Singh, Karandeep
Landis-Lewis, Zach
Weng, Chunhua
Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
title Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
title_full Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
title_fullStr Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
title_full_unstemmed Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
title_short Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
title_sort complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6448181/
https://www.ncbi.nlm.nih.gov/pubmed/30944012
http://dx.doi.org/10.1186/s12911-019-0784-1
work_keys_str_mv AT hanauerdavida complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification
AT meiqiaozhu complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification
AT vydiswaranvgvinod complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification
AT singhkarandeep complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification
AT landislewiszach complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification
AT wengchunhua complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification