Cargando…

Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

BACKGROUND: The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering sema...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cohen, Raphael, Elhadad, Michael, Elhadad, Noémie
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2013
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3599108/ https://www.ncbi.nlm.nih.gov/pubmed/23323800 http://dx.doi.org/10.1186/1471-2105-14-10

_version_	1782262888055963648
author	Cohen, Raphael Elhadad, Michael Elhadad, Noémie
author_facet	Cohen, Raphael Elhadad, Michael Elhadad, Noémie
author_sort	Cohen, Raphael
collection	PubMed
description	BACKGROUND: The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste redundancy: clinicians typically copy and paste information from previous notes when documenting a current patient encounter. Thus, within a longitudinal patient record, one expects to observe heavy redundancy. In this paper, we ask three research questions: (i) How can redundancy be quantified in large-scale text corpora? (ii) Conventional wisdom is that larger corpora yield better results in text mining. But how does the observed EHR redundancy affect text mining? Does such redundancy introduce a bias that distorts learned models? Or does the redundancy introduce benefits by highlighting stable and important subsets of the corpus? (iii) How can one mitigate the impact of redundancy on text mining? RESULTS: We analyze a large-scale EHR corpus and quantify redundancy both in terms of word and semantic concept repetition. We observe redundancy levels of about 30% and non-standard distribution of both words and concepts. We measure the impact of redundancy on two standard text-mining applications: collocation identification and topic modeling. We compare the results of these methods on synthetic data with controlled levels of redundancy and observe significant performance variation. Finally, we compare two mitigation strategies to avoid redundancy-induced bias: (i) a baseline strategy, keeping only the last note for each patient in the corpus; (ii) removing redundant notes with an efficient fingerprinting-based algorithm. (a)For text mining, preprocessing the EHR corpus with fingerprinting yields significantly better results. CONCLUSIONS: Before applying text-mining techniques, one must pay careful attention to the structure of the analyzed corpora. While the importance of data cleaning has been known for low-level text characteristics (e.g., encoding and spelling), high-level and difficult-to-quantify corpus characteristics, such as naturally occurring redundancy, can also hurt text mining. Fingerprinting enables text-mining techniques to leverage available data in the EHR corpus, while avoiding the bias introduced by redundancy.
format	Online Article Text
id	pubmed-3599108
institution	National Center for Biotechnology Information
language	English
publishDate	2013
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-35991082013-03-17 Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies Cohen, Raphael Elhadad, Michael Elhadad, Noémie BMC Bioinformatics Research Article BACKGROUND: The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste redundancy: clinicians typically copy and paste information from previous notes when documenting a current patient encounter. Thus, within a longitudinal patient record, one expects to observe heavy redundancy. In this paper, we ask three research questions: (i) How can redundancy be quantified in large-scale text corpora? (ii) Conventional wisdom is that larger corpora yield better results in text mining. But how does the observed EHR redundancy affect text mining? Does such redundancy introduce a bias that distorts learned models? Or does the redundancy introduce benefits by highlighting stable and important subsets of the corpus? (iii) How can one mitigate the impact of redundancy on text mining? RESULTS: We analyze a large-scale EHR corpus and quantify redundancy both in terms of word and semantic concept repetition. We observe redundancy levels of about 30% and non-standard distribution of both words and concepts. We measure the impact of redundancy on two standard text-mining applications: collocation identification and topic modeling. We compare the results of these methods on synthetic data with controlled levels of redundancy and observe significant performance variation. Finally, we compare two mitigation strategies to avoid redundancy-induced bias: (i) a baseline strategy, keeping only the last note for each patient in the corpus; (ii) removing redundant notes with an efficient fingerprinting-based algorithm. (a)For text mining, preprocessing the EHR corpus with fingerprinting yields significantly better results. CONCLUSIONS: Before applying text-mining techniques, one must pay careful attention to the structure of the analyzed corpora. While the importance of data cleaning has been known for low-level text characteristics (e.g., encoding and spelling), high-level and difficult-to-quantify corpus characteristics, such as naturally occurring redundancy, can also hurt text mining. Fingerprinting enables text-mining techniques to leverage available data in the EHR corpus, while avoiding the bias introduced by redundancy. BioMed Central 2013-01-16 /pmc/articles/PMC3599108/ /pubmed/23323800 http://dx.doi.org/10.1186/1471-2105-14-10 Text en Copyright ©2013 Cohen et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Cohen, Raphael Elhadad, Michael Elhadad, Noémie Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies
title	Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies
title_full	Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies
title_fullStr	Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies
title_full_unstemmed	Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies
title_short	Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies
title_sort	redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3599108/ https://www.ncbi.nlm.nih.gov/pubmed/23323800 http://dx.doi.org/10.1186/1471-2105-14-10
work_keys_str_mv	AT cohenraphael redundancyinelectronichealthrecordcorporaanalysisimpactontextminingperformanceandmitigationstrategies AT elhadadmichael redundancyinelectronichealthrecordcorporaanalysisimpactontextminingperformanceandmitigationstrategies AT elhadadnoemie redundancyinelectronichealthrecordcorporaanalysisimpactontextminingperformanceandmitigationstrategies

Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

Ejemplares similares