Cargando…
Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification
BACKGROUND: Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describ...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6448181/ https://www.ncbi.nlm.nih.gov/pubmed/30944012 http://dx.doi.org/10.1186/s12911-019-0784-1 |
_version_ | 1783408646434062336 |
---|---|
author | Hanauer, David A. Mei, Qiaozhu Vydiswaran, V. G. Vinod Singh, Karandeep Landis-Lewis, Zach Weng, Chunhua |
author_facet | Hanauer, David A. Mei, Qiaozhu Vydiswaran, V. G. Vinod Singh, Karandeep Landis-Lewis, Zach Weng, Chunhua |
author_sort | Hanauer, David A. |
collection | PubMed |
description | BACKGROUND: Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describes an analysis of the variation in how numbers and numerical concepts are represented in clinical notes. METHODS: We used an inverted index of approximately 100 million notes to obtain the frequency of various permutations of numbers and numerical concepts, including the use of Roman numerals, numbers spelled as English words, and invalid dates, among others. Overall, twelve types of lexical variants were analyzed. RESULTS: We found substantial variation in how these concepts were represented in the notes, including multiple data quality issues. We also demonstrate that not considering these variations could have substantial real-world implications for cohort identification tasks, with one case missing > 80% of potential patients. CONCLUSIONS: Numbering within clinical notes can be variable, and not taking these variations into account could result in missing or inaccurate information for natural language processing and information retrieval tasks. |
format | Online Article Text |
id | pubmed-6448181 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-64481812019-04-15 Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification Hanauer, David A. Mei, Qiaozhu Vydiswaran, V. G. Vinod Singh, Karandeep Landis-Lewis, Zach Weng, Chunhua BMC Med Inform Decis Mak Research BACKGROUND: Numbers and numerical concepts appear frequently in free text clinical notes from electronic health records. Knowledge of the frequent lexical variations of these numerical concepts, and their accurate identification, is important for many information extraction tasks. This paper describes an analysis of the variation in how numbers and numerical concepts are represented in clinical notes. METHODS: We used an inverted index of approximately 100 million notes to obtain the frequency of various permutations of numbers and numerical concepts, including the use of Roman numerals, numbers spelled as English words, and invalid dates, among others. Overall, twelve types of lexical variants were analyzed. RESULTS: We found substantial variation in how these concepts were represented in the notes, including multiple data quality issues. We also demonstrate that not considering these variations could have substantial real-world implications for cohort identification tasks, with one case missing > 80% of potential patients. CONCLUSIONS: Numbering within clinical notes can be variable, and not taking these variations into account could result in missing or inaccurate information for natural language processing and information retrieval tasks. BioMed Central 2019-04-04 /pmc/articles/PMC6448181/ /pubmed/30944012 http://dx.doi.org/10.1186/s12911-019-0784-1 Text en © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Hanauer, David A. Mei, Qiaozhu Vydiswaran, V. G. Vinod Singh, Karandeep Landis-Lewis, Zach Weng, Chunhua Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification |
title | Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification |
title_full | Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification |
title_fullStr | Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification |
title_full_unstemmed | Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification |
title_short | Complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification |
title_sort | complexities, variations, and errors of numbering within clinical notes: the potential impact on information extraction and cohort-identification |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6448181/ https://www.ncbi.nlm.nih.gov/pubmed/30944012 http://dx.doi.org/10.1186/s12911-019-0784-1 |
work_keys_str_mv | AT hanauerdavida complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification AT meiqiaozhu complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification AT vydiswaranvgvinod complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification AT singhkarandeep complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification AT landislewiszach complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification AT wengchunhua complexitiesvariationsanderrorsofnumberingwithinclinicalnotesthepotentialimpactoninformationextractionandcohortidentification |