Cargando…

Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records

BACKGROUND: The electronic medical record (EMR) offers unique possibilities for clinical research, but some important patient attributes are not readily available due to its unstructured properties. We applied text mining using machine learning to enable automatic classification of unstructured info...

Descripción completa

Detalles Bibliográficos
Autores principales:	Caccamisi, Andrea, Jørgensen, Leif, Dalianis, Hercules, Rosenlund, Mats
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Taylor & Francis 2020
Materias:	Original Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7594865/ https://www.ncbi.nlm.nih.gov/pubmed/32696698 http://dx.doi.org/10.1080/03009734.2020.1792010

_version_	1783601716551221248
author	Caccamisi, Andrea Jørgensen, Leif Dalianis, Hercules Rosenlund, Mats
author_facet	Caccamisi, Andrea Jørgensen, Leif Dalianis, Hercules Rosenlund, Mats
author_sort	Caccamisi, Andrea
collection	PubMed
description	BACKGROUND: The electronic medical record (EMR) offers unique possibilities for clinical research, but some important patient attributes are not readily available due to its unstructured properties. We applied text mining using machine learning to enable automatic classification of unstructured information on smoking status from Swedish EMR data. METHODS: Data on patients’ smoking status from EMRs were used to develop 32 different predictive models that were trained using Weka, changing sentence frequency, classifier type, tokenization, and attribute selection in a database of 85,000 classified sentences. The models were evaluated using F-score and accuracy based on out-of-sample test data including 8500 sentences. The error weight matrix was used to select the best model, assigning a weight to each type of misclassification and applying it to the model confusion matrices. The best performing model was then compared to a rule-based method. RESULTS: The best performing model was based on the Support Vector Machine (SVM) Sequential Minimal Optimization (SMO) classifier using a combination of unigrams and bigrams as tokens. Sentence frequency and attributes selection did not improve model performance. SMO achieved 98.14% accuracy and 0.981 F-score versus 79.32% and 0.756 for the rule-based model. CONCLUSION: A model using machine-learning algorithms to automatically classify patients’ smoking status was successfully developed. Such algorithms may enable automatic assessment of smoking status and other unstructured data directly from EMRs without manual classification of complete case notes.
format	Online Article Text
id	pubmed-7594865
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Taylor & Francis
record_format	MEDLINE/PubMed
spelling	pubmed-75948652020-11-10 Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records Caccamisi, Andrea Jørgensen, Leif Dalianis, Hercules Rosenlund, Mats Ups J Med Sci Original Articles BACKGROUND: The electronic medical record (EMR) offers unique possibilities for clinical research, but some important patient attributes are not readily available due to its unstructured properties. We applied text mining using machine learning to enable automatic classification of unstructured information on smoking status from Swedish EMR data. METHODS: Data on patients’ smoking status from EMRs were used to develop 32 different predictive models that were trained using Weka, changing sentence frequency, classifier type, tokenization, and attribute selection in a database of 85,000 classified sentences. The models were evaluated using F-score and accuracy based on out-of-sample test data including 8500 sentences. The error weight matrix was used to select the best model, assigning a weight to each type of misclassification and applying it to the model confusion matrices. The best performing model was then compared to a rule-based method. RESULTS: The best performing model was based on the Support Vector Machine (SVM) Sequential Minimal Optimization (SMO) classifier using a combination of unigrams and bigrams as tokens. Sentence frequency and attributes selection did not improve model performance. SMO achieved 98.14% accuracy and 0.981 F-score versus 79.32% and 0.756 for the rule-based model. CONCLUSION: A model using machine-learning algorithms to automatically classify patients’ smoking status was successfully developed. Such algorithms may enable automatic assessment of smoking status and other unstructured data directly from EMRs without manual classification of complete case notes. Taylor & Francis 2020-07-22 /pmc/articles/PMC7594865/ /pubmed/32696698 http://dx.doi.org/10.1080/03009734.2020.1792010 Text en © 2020 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Articles Caccamisi, Andrea Jørgensen, Leif Dalianis, Hercules Rosenlund, Mats Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records
title	Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records
title_full	Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records
title_fullStr	Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records
title_full_unstemmed	Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records
title_short	Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records
title_sort	natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records
topic	Original Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7594865/ https://www.ncbi.nlm.nih.gov/pubmed/32696698 http://dx.doi.org/10.1080/03009734.2020.1792010
work_keys_str_mv	AT caccamisiandrea naturallanguageprocessingandmachinelearningtoenableautomaticextractionandclassificationofpatientssmokingstatusfromelectronicmedicalrecords AT jørgensenleif naturallanguageprocessingandmachinelearningtoenableautomaticextractionandclassificationofpatientssmokingstatusfromelectronicmedicalrecords AT dalianishercules naturallanguageprocessingandmachinelearningtoenableautomaticextractionandclassificationofpatientssmokingstatusfromelectronicmedicalrecords AT rosenlundmats naturallanguageprocessingandmachinelearningtoenableautomaticextractionandclassificationofpatientssmokingstatusfromelectronicmedicalrecords

Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records

Ejemplares similares