Cargando…

Validation of an Improved Computer-Assisted Technique for Mining Free-Text Electronic Medical Records

BACKGROUND: The use of electronic medical records (EMRs) offers opportunity for clinical epidemiological research. With large EMR databases, automated analysis processes are necessary but require thorough validation before they can be routinely used. OBJECTIVE: The aim of this study was to validate...

Descripción completa

Detalles Bibliográficos
Autores principales: Duz, Marco, Marshall, John F, Parkin, Tim
Formato: Online Artículo Texto
Lenguaje:English
Publicado: JMIR Publications 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5509949/
https://www.ncbi.nlm.nih.gov/pubmed/28663163
http://dx.doi.org/10.2196/medinform.7123
_version_ 1783250099528269824
author Duz, Marco
Marshall, John F
Parkin, Tim
author_facet Duz, Marco
Marshall, John F
Parkin, Tim
author_sort Duz, Marco
collection PubMed
description BACKGROUND: The use of electronic medical records (EMRs) offers opportunity for clinical epidemiological research. With large EMR databases, automated analysis processes are necessary but require thorough validation before they can be routinely used. OBJECTIVE: The aim of this study was to validate a computer-assisted technique using commercially available content analysis software (SimStat-WordStat v.6 (SS/WS), Provalis Research) for mining free-text EMRs. METHODS: The dataset used for the validation process included life-long EMRs from 335 patients (17,563 rows of data), selected at random from a larger dataset (141,543 patients, ~2.6 million rows of data) and obtained from 10 equine veterinary practices in the United Kingdom. The ability of the computer-assisted technique to detect rows of data (cases) of colic, renal failure, right dorsal colitis, and non-steroidal anti-inflammatory drug (NSAID) use in the population was compared with manual classification. The first step of the computer-assisted analysis process was the definition of inclusion dictionaries to identify cases, including terms identifying a condition of interest. Words in inclusion dictionaries were selected from the list of all words in the dataset obtained in SS/WS. The second step consisted of defining an exclusion dictionary, including combinations of words to remove cases erroneously classified by the inclusion dictionary alone. The third step was the definition of a reinclusion dictionary to reinclude cases that had been erroneously classified by the exclusion dictionary. Finally, cases obtained by the exclusion dictionary were removed from cases obtained by the inclusion dictionary, and cases from the reinclusion dictionary were subsequently reincluded using Rv3.0.2 (R Foundation for Statistical Computing, Vienna, Austria). Manual analysis was performed as a separate process by a single experienced clinician reading through the dataset once and classifying each row of data based on the interpretation of the free-text notes. Validation was performed by comparison of the computer-assisted method with manual analysis, which was used as the gold standard. Sensitivity, specificity, negative predictive values (NPVs), positive predictive values (PPVs), and F values of the computer-assisted process were calculated by comparing them with the manual classification. RESULTS: Lowest sensitivity, specificity, PPVs, NPVs, and F values were 99.82% (1128/1130), 99.88% (16410/16429), 94.6% (223/239), 100.00% (16410/16412), and 99.0% (100×2×0.983×0.998/[0.983+0.998]), respectively. The computer-assisted process required few seconds to run, although an estimated 30 h were required for dictionary creation. Manual classification required approximately 80 man-hours. CONCLUSIONS: The critical step in this work is the creation of accurate and inclusive dictionaries to ensure that no potential cases are missed. It is significantly easier to remove false positive terms from a SS/WS selected subset of a large database than search that original database for potential false negatives. The benefits of using this method are proportional to the size of the dataset to be analyzed.
format Online
Article
Text
id pubmed-5509949
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher JMIR Publications
record_format MEDLINE/PubMed
spelling pubmed-55099492017-07-26 Validation of an Improved Computer-Assisted Technique for Mining Free-Text Electronic Medical Records Duz, Marco Marshall, John F Parkin, Tim JMIR Med Inform Original Paper BACKGROUND: The use of electronic medical records (EMRs) offers opportunity for clinical epidemiological research. With large EMR databases, automated analysis processes are necessary but require thorough validation before they can be routinely used. OBJECTIVE: The aim of this study was to validate a computer-assisted technique using commercially available content analysis software (SimStat-WordStat v.6 (SS/WS), Provalis Research) for mining free-text EMRs. METHODS: The dataset used for the validation process included life-long EMRs from 335 patients (17,563 rows of data), selected at random from a larger dataset (141,543 patients, ~2.6 million rows of data) and obtained from 10 equine veterinary practices in the United Kingdom. The ability of the computer-assisted technique to detect rows of data (cases) of colic, renal failure, right dorsal colitis, and non-steroidal anti-inflammatory drug (NSAID) use in the population was compared with manual classification. The first step of the computer-assisted analysis process was the definition of inclusion dictionaries to identify cases, including terms identifying a condition of interest. Words in inclusion dictionaries were selected from the list of all words in the dataset obtained in SS/WS. The second step consisted of defining an exclusion dictionary, including combinations of words to remove cases erroneously classified by the inclusion dictionary alone. The third step was the definition of a reinclusion dictionary to reinclude cases that had been erroneously classified by the exclusion dictionary. Finally, cases obtained by the exclusion dictionary were removed from cases obtained by the inclusion dictionary, and cases from the reinclusion dictionary were subsequently reincluded using Rv3.0.2 (R Foundation for Statistical Computing, Vienna, Austria). Manual analysis was performed as a separate process by a single experienced clinician reading through the dataset once and classifying each row of data based on the interpretation of the free-text notes. Validation was performed by comparison of the computer-assisted method with manual analysis, which was used as the gold standard. Sensitivity, specificity, negative predictive values (NPVs), positive predictive values (PPVs), and F values of the computer-assisted process were calculated by comparing them with the manual classification. RESULTS: Lowest sensitivity, specificity, PPVs, NPVs, and F values were 99.82% (1128/1130), 99.88% (16410/16429), 94.6% (223/239), 100.00% (16410/16412), and 99.0% (100×2×0.983×0.998/[0.983+0.998]), respectively. The computer-assisted process required few seconds to run, although an estimated 30 h were required for dictionary creation. Manual classification required approximately 80 man-hours. CONCLUSIONS: The critical step in this work is the creation of accurate and inclusive dictionaries to ensure that no potential cases are missed. It is significantly easier to remove false positive terms from a SS/WS selected subset of a large database than search that original database for potential false negatives. The benefits of using this method are proportional to the size of the dataset to be analyzed. JMIR Publications 2017-06-29 /pmc/articles/PMC5509949/ /pubmed/28663163 http://dx.doi.org/10.2196/medinform.7123 Text en ©Marco Duz, John F Marshall, Tim Parkin. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 29.06.2017. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
spellingShingle Original Paper
Duz, Marco
Marshall, John F
Parkin, Tim
Validation of an Improved Computer-Assisted Technique for Mining Free-Text Electronic Medical Records
title Validation of an Improved Computer-Assisted Technique for Mining Free-Text Electronic Medical Records
title_full Validation of an Improved Computer-Assisted Technique for Mining Free-Text Electronic Medical Records
title_fullStr Validation of an Improved Computer-Assisted Technique for Mining Free-Text Electronic Medical Records
title_full_unstemmed Validation of an Improved Computer-Assisted Technique for Mining Free-Text Electronic Medical Records
title_short Validation of an Improved Computer-Assisted Technique for Mining Free-Text Electronic Medical Records
title_sort validation of an improved computer-assisted technique for mining free-text electronic medical records
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5509949/
https://www.ncbi.nlm.nih.gov/pubmed/28663163
http://dx.doi.org/10.2196/medinform.7123
work_keys_str_mv AT duzmarco validationofanimprovedcomputerassistedtechniqueforminingfreetextelectronicmedicalrecords
AT marshalljohnf validationofanimprovedcomputerassistedtechniqueforminingfreetextelectronicmedicalrecords
AT parkintim validationofanimprovedcomputerassistedtechniqueforminingfreetextelectronicmedicalrecords