Cargando…

Improving sensitivity of machine learning methods for automated case identification from free-text electronic medical records

BACKGROUND: Distinguishing cases from non-cases in free-text electronic medical records is an important initial step in observational epidemiological studies, but manual record validation is time-consuming and cumbersome. We compared different approaches to develop an automatic case identification s...

Descripción completa

Detalles Bibliográficos
Autores principales: Afzal, Zubair, Schuemie, Martijn J, van Blijderveen, Jan C, Sen, Elif F, Sturkenboom, Miriam CJM, Kors, Jan A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3602667/
https://www.ncbi.nlm.nih.gov/pubmed/23452306
http://dx.doi.org/10.1186/1472-6947-13-30
_version_ 1782263591656751104
author Afzal, Zubair
Schuemie, Martijn J
van Blijderveen, Jan C
Sen, Elif F
Sturkenboom, Miriam CJM
Kors, Jan A
author_facet Afzal, Zubair
Schuemie, Martijn J
van Blijderveen, Jan C
Sen, Elif F
Sturkenboom, Miriam CJM
Kors, Jan A
author_sort Afzal, Zubair
collection PubMed
description BACKGROUND: Distinguishing cases from non-cases in free-text electronic medical records is an important initial step in observational epidemiological studies, but manual record validation is time-consuming and cumbersome. We compared different approaches to develop an automatic case identification system with high sensitivity to assist manual annotators. METHODS: We used four different machine-learning algorithms to build case identification systems for two data sets, one comprising hepatobiliary disease patients, the other acute renal failure patients. To improve the sensitivity of the systems, we varied the imbalance ratio between positive cases and negative cases using under- and over-sampling techniques, and applied cost-sensitive learning with various misclassification costs. RESULTS: For the hepatobiliary data set, we obtained a high sensitivity of 0.95 (on a par with manual annotators, as compared to 0.91 for a baseline classifier) with specificity 0.56. For the acute renal failure data set, sensitivity increased from 0.69 to 0.89, with specificity 0.59. Performance differences between the various machine-learning algorithms were not large. Classifiers performed best when trained on data sets with imbalance ratio below 10. CONCLUSIONS: We were able to achieve high sensitivity with moderate specificity for automatic case identification on two data sets of electronic medical records. Such a high-sensitive case identification system can be used as a pre-filter to significantly reduce the burden of manual record validation.
format Online
Article
Text
id pubmed-3602667
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-36026672013-03-21 Improving sensitivity of machine learning methods for automated case identification from free-text electronic medical records Afzal, Zubair Schuemie, Martijn J van Blijderveen, Jan C Sen, Elif F Sturkenboom, Miriam CJM Kors, Jan A BMC Med Inform Decis Mak Research Article BACKGROUND: Distinguishing cases from non-cases in free-text electronic medical records is an important initial step in observational epidemiological studies, but manual record validation is time-consuming and cumbersome. We compared different approaches to develop an automatic case identification system with high sensitivity to assist manual annotators. METHODS: We used four different machine-learning algorithms to build case identification systems for two data sets, one comprising hepatobiliary disease patients, the other acute renal failure patients. To improve the sensitivity of the systems, we varied the imbalance ratio between positive cases and negative cases using under- and over-sampling techniques, and applied cost-sensitive learning with various misclassification costs. RESULTS: For the hepatobiliary data set, we obtained a high sensitivity of 0.95 (on a par with manual annotators, as compared to 0.91 for a baseline classifier) with specificity 0.56. For the acute renal failure data set, sensitivity increased from 0.69 to 0.89, with specificity 0.59. Performance differences between the various machine-learning algorithms were not large. Classifiers performed best when trained on data sets with imbalance ratio below 10. CONCLUSIONS: We were able to achieve high sensitivity with moderate specificity for automatic case identification on two data sets of electronic medical records. Such a high-sensitive case identification system can be used as a pre-filter to significantly reduce the burden of manual record validation. BioMed Central 2013-03-02 /pmc/articles/PMC3602667/ /pubmed/23452306 http://dx.doi.org/10.1186/1472-6947-13-30 Text en Copyright ©2013 Afzal et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Afzal, Zubair
Schuemie, Martijn J
van Blijderveen, Jan C
Sen, Elif F
Sturkenboom, Miriam CJM
Kors, Jan A
Improving sensitivity of machine learning methods for automated case identification from free-text electronic medical records
title Improving sensitivity of machine learning methods for automated case identification from free-text electronic medical records
title_full Improving sensitivity of machine learning methods for automated case identification from free-text electronic medical records
title_fullStr Improving sensitivity of machine learning methods for automated case identification from free-text electronic medical records
title_full_unstemmed Improving sensitivity of machine learning methods for automated case identification from free-text electronic medical records
title_short Improving sensitivity of machine learning methods for automated case identification from free-text electronic medical records
title_sort improving sensitivity of machine learning methods for automated case identification from free-text electronic medical records
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3602667/
https://www.ncbi.nlm.nih.gov/pubmed/23452306
http://dx.doi.org/10.1186/1472-6947-13-30
work_keys_str_mv AT afzalzubair improvingsensitivityofmachinelearningmethodsforautomatedcaseidentificationfromfreetextelectronicmedicalrecords
AT schuemiemartijnj improvingsensitivityofmachinelearningmethodsforautomatedcaseidentificationfromfreetextelectronicmedicalrecords
AT vanblijderveenjanc improvingsensitivityofmachinelearningmethodsforautomatedcaseidentificationfromfreetextelectronicmedicalrecords
AT seneliff improvingsensitivityofmachinelearningmethodsforautomatedcaseidentificationfromfreetextelectronicmedicalrecords
AT sturkenboommiriamcjm improvingsensitivityofmachinelearningmethodsforautomatedcaseidentificationfromfreetextelectronicmedicalrecords
AT korsjana improvingsensitivityofmachinelearningmethodsforautomatedcaseidentificationfromfreetextelectronicmedicalrecords