Cargando…

Extracting Diagnoses and Investigation Results from Unstructured Text in Electronic Health Records by Semi-Supervised Machine Learning

BACKGROUND: Electronic health records are invaluable for medical research, but much of the information is recorded as unstructured free text which is time-consuming to review manually. AIM: To develop an algorithm to identify relevant free texts automatically based on labelled examples. METHODS: We...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Zhuoran, Shah, Anoop D., Tate, A. Rosemary, Denaxas, Spiros, Shawe-Taylor, John, Hemingway, Harry
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3261909/
https://www.ncbi.nlm.nih.gov/pubmed/22276193
http://dx.doi.org/10.1371/journal.pone.0030412
_version_ 1782221653256699904
author Wang, Zhuoran
Shah, Anoop D.
Tate, A. Rosemary
Denaxas, Spiros
Shawe-Taylor, John
Hemingway, Harry
author_facet Wang, Zhuoran
Shah, Anoop D.
Tate, A. Rosemary
Denaxas, Spiros
Shawe-Taylor, John
Hemingway, Harry
author_sort Wang, Zhuoran
collection PubMed
description BACKGROUND: Electronic health records are invaluable for medical research, but much of the information is recorded as unstructured free text which is time-consuming to review manually. AIM: To develop an algorithm to identify relevant free texts automatically based on labelled examples. METHODS: We developed a novel machine learning algorithm, the ‘Semi-supervised Set Covering Machine’ (S3CM), and tested its ability to detect the presence of coronary angiogram results and ovarian cancer diagnoses in free text in the General Practice Research Database. For training the algorithm, we used texts classified as positive and negative according to their associated Read diagnostic codes, rather than by manual annotation. We evaluated the precision (positive predictive value) and recall (sensitivity) of S3CM in classifying unlabelled texts against the gold standard of manual review. We compared the performance of S3CM with the Transductive Vector Support Machine (TVSM), the original fully-supervised Set Covering Machine (SCM) and our ‘Freetext Matching Algorithm’ natural language processor. RESULTS: Only 60% of texts with Read codes for angiogram actually contained angiogram results. However, the S3CM algorithm achieved 87% recall with 64% precision on detecting coronary angiogram results, outperforming the fully-supervised SCM (recall 78%, precision 60%) and TSVM (recall 2%, precision 3%). For ovarian cancer diagnoses, S3CM had higher recall than the other algorithms tested (86%). The Freetext Matching Algorithm had better precision than S3CM (85% versus 74%) but lower recall (62%). CONCLUSIONS: Our novel S3CM machine learning algorithm effectively detected free texts in primary care records associated with angiogram results and ovarian cancer diagnoses, after training on pre-classified test sets. It should be easy to adapt to other disease areas as it does not rely on linguistic rules, but needs further testing in other electronic health record datasets.
format Online
Article
Text
id pubmed-3261909
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-32619092012-01-24 Extracting Diagnoses and Investigation Results from Unstructured Text in Electronic Health Records by Semi-Supervised Machine Learning Wang, Zhuoran Shah, Anoop D. Tate, A. Rosemary Denaxas, Spiros Shawe-Taylor, John Hemingway, Harry PLoS One Research Article BACKGROUND: Electronic health records are invaluable for medical research, but much of the information is recorded as unstructured free text which is time-consuming to review manually. AIM: To develop an algorithm to identify relevant free texts automatically based on labelled examples. METHODS: We developed a novel machine learning algorithm, the ‘Semi-supervised Set Covering Machine’ (S3CM), and tested its ability to detect the presence of coronary angiogram results and ovarian cancer diagnoses in free text in the General Practice Research Database. For training the algorithm, we used texts classified as positive and negative according to their associated Read diagnostic codes, rather than by manual annotation. We evaluated the precision (positive predictive value) and recall (sensitivity) of S3CM in classifying unlabelled texts against the gold standard of manual review. We compared the performance of S3CM with the Transductive Vector Support Machine (TVSM), the original fully-supervised Set Covering Machine (SCM) and our ‘Freetext Matching Algorithm’ natural language processor. RESULTS: Only 60% of texts with Read codes for angiogram actually contained angiogram results. However, the S3CM algorithm achieved 87% recall with 64% precision on detecting coronary angiogram results, outperforming the fully-supervised SCM (recall 78%, precision 60%) and TSVM (recall 2%, precision 3%). For ovarian cancer diagnoses, S3CM had higher recall than the other algorithms tested (86%). The Freetext Matching Algorithm had better precision than S3CM (85% versus 74%) but lower recall (62%). CONCLUSIONS: Our novel S3CM machine learning algorithm effectively detected free texts in primary care records associated with angiogram results and ovarian cancer diagnoses, after training on pre-classified test sets. It should be easy to adapt to other disease areas as it does not rely on linguistic rules, but needs further testing in other electronic health record datasets. Public Library of Science 2012-01-19 /pmc/articles/PMC3261909/ /pubmed/22276193 http://dx.doi.org/10.1371/journal.pone.0030412 Text en Wang et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Wang, Zhuoran
Shah, Anoop D.
Tate, A. Rosemary
Denaxas, Spiros
Shawe-Taylor, John
Hemingway, Harry
Extracting Diagnoses and Investigation Results from Unstructured Text in Electronic Health Records by Semi-Supervised Machine Learning
title Extracting Diagnoses and Investigation Results from Unstructured Text in Electronic Health Records by Semi-Supervised Machine Learning
title_full Extracting Diagnoses and Investigation Results from Unstructured Text in Electronic Health Records by Semi-Supervised Machine Learning
title_fullStr Extracting Diagnoses and Investigation Results from Unstructured Text in Electronic Health Records by Semi-Supervised Machine Learning
title_full_unstemmed Extracting Diagnoses and Investigation Results from Unstructured Text in Electronic Health Records by Semi-Supervised Machine Learning
title_short Extracting Diagnoses and Investigation Results from Unstructured Text in Electronic Health Records by Semi-Supervised Machine Learning
title_sort extracting diagnoses and investigation results from unstructured text in electronic health records by semi-supervised machine learning
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3261909/
https://www.ncbi.nlm.nih.gov/pubmed/22276193
http://dx.doi.org/10.1371/journal.pone.0030412
work_keys_str_mv AT wangzhuoran extractingdiagnosesandinvestigationresultsfromunstructuredtextinelectronichealthrecordsbysemisupervisedmachinelearning
AT shahanoopd extractingdiagnosesandinvestigationresultsfromunstructuredtextinelectronichealthrecordsbysemisupervisedmachinelearning
AT tatearosemary extractingdiagnosesandinvestigationresultsfromunstructuredtextinelectronichealthrecordsbysemisupervisedmachinelearning
AT denaxasspiros extractingdiagnosesandinvestigationresultsfromunstructuredtextinelectronichealthrecordsbysemisupervisedmachinelearning
AT shawetaylorjohn extractingdiagnosesandinvestigationresultsfromunstructuredtextinelectronichealthrecordsbysemisupervisedmachinelearning
AT hemingwayharry extractingdiagnosesandinvestigationresultsfromunstructuredtextinelectronichealthrecordsbysemisupervisedmachinelearning