Cargando…

Automatic classification of diseases from free-text death certificates for real-time surveillance

BACKGROUND: Death certificates provide an invaluable source for mortality statistics which can be used for surveillance and early warnings of increases in disease activity and to support the development and monitoring of prevention or response strategies. However, their value can be realised only if...

Descripción completa

Detalles Bibliográficos
Autores principales: Koopman, Bevan, Karimi, Sarvnaz, Nguyen, Anthony, McGuire, Rhydwyn, Muscatello, David, Kemp, Madonna, Truran, Donna, Zhang, Ming, Thackway, Sarah
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4502908/
https://www.ncbi.nlm.nih.gov/pubmed/26174442
http://dx.doi.org/10.1186/s12911-015-0174-2
_version_ 1782381271187456000
author Koopman, Bevan
Karimi, Sarvnaz
Nguyen, Anthony
McGuire, Rhydwyn
Muscatello, David
Kemp, Madonna
Truran, Donna
Zhang, Ming
Thackway, Sarah
author_facet Koopman, Bevan
Karimi, Sarvnaz
Nguyen, Anthony
McGuire, Rhydwyn
Muscatello, David
Kemp, Madonna
Truran, Donna
Zhang, Ming
Thackway, Sarah
author_sort Koopman, Bevan
collection PubMed
description BACKGROUND: Death certificates provide an invaluable source for mortality statistics which can be used for surveillance and early warnings of increases in disease activity and to support the development and monitoring of prevention or response strategies. However, their value can be realised only if accurate, quantitative data can be extracted from death certificates, an aim hampered by both the volume and variable nature of certificates written in natural language. This study aims to develop a set of machine learning and rule-based methods to automatically classify death certificates according to four high impact diseases of interest: diabetes, influenza, pneumonia and HIV. METHODS: Two classification methods are presented: i) a machine learning approach, where detailed features (terms, term n-grams and SNOMED CT concepts) are extracted from death certificates and used to train a set of supervised machine learning models (Support Vector Machines); and ii) a set of keyword-matching rules. These methods were used to identify the presence of diabetes, influenza, pneumonia and HIV in a death certificate. An empirical evaluation was conducted using 340,142 death certificates, divided between training and test sets, covering deaths from 2000–2007 in New South Wales, Australia. Precision and recall (positive predictive value and sensitivity) were used as evaluation measures, with F-measure providing a single, overall measure of effectiveness. A detailed error analysis was performed on classification errors. RESULTS: Classification of diabetes, influenza, pneumonia and HIV was highly accurate (F-measure 0.96). More fine-grained ICD-10 classification effectiveness was more variable but still high (F-measure 0.80). The error analysis revealed that word variations as well as certain word combinations adversely affected classification. In addition, anomalies in the ground truth likely led to an underestimation of the effectiveness. CONCLUSIONS: The high accuracy and low cost of the classification methods allow for an effective means for automatic and real-time surveillance of diabetes, influenza, pneumonia and HIV deaths. In addition, the methods are generally applicable to other diseases of interest and to other sources of medical free-text besides death certificates.
format Online
Article
Text
id pubmed-4502908
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-45029082015-07-16 Automatic classification of diseases from free-text death certificates for real-time surveillance Koopman, Bevan Karimi, Sarvnaz Nguyen, Anthony McGuire, Rhydwyn Muscatello, David Kemp, Madonna Truran, Donna Zhang, Ming Thackway, Sarah BMC Med Inform Decis Mak Research Article BACKGROUND: Death certificates provide an invaluable source for mortality statistics which can be used for surveillance and early warnings of increases in disease activity and to support the development and monitoring of prevention or response strategies. However, their value can be realised only if accurate, quantitative data can be extracted from death certificates, an aim hampered by both the volume and variable nature of certificates written in natural language. This study aims to develop a set of machine learning and rule-based methods to automatically classify death certificates according to four high impact diseases of interest: diabetes, influenza, pneumonia and HIV. METHODS: Two classification methods are presented: i) a machine learning approach, where detailed features (terms, term n-grams and SNOMED CT concepts) are extracted from death certificates and used to train a set of supervised machine learning models (Support Vector Machines); and ii) a set of keyword-matching rules. These methods were used to identify the presence of diabetes, influenza, pneumonia and HIV in a death certificate. An empirical evaluation was conducted using 340,142 death certificates, divided between training and test sets, covering deaths from 2000–2007 in New South Wales, Australia. Precision and recall (positive predictive value and sensitivity) were used as evaluation measures, with F-measure providing a single, overall measure of effectiveness. A detailed error analysis was performed on classification errors. RESULTS: Classification of diabetes, influenza, pneumonia and HIV was highly accurate (F-measure 0.96). More fine-grained ICD-10 classification effectiveness was more variable but still high (F-measure 0.80). The error analysis revealed that word variations as well as certain word combinations adversely affected classification. In addition, anomalies in the ground truth likely led to an underestimation of the effectiveness. CONCLUSIONS: The high accuracy and low cost of the classification methods allow for an effective means for automatic and real-time surveillance of diabetes, influenza, pneumonia and HIV deaths. In addition, the methods are generally applicable to other diseases of interest and to other sources of medical free-text besides death certificates. BioMed Central 2015-07-15 /pmc/articles/PMC4502908/ /pubmed/26174442 http://dx.doi.org/10.1186/s12911-015-0174-2 Text en © Koopman et al. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Koopman, Bevan
Karimi, Sarvnaz
Nguyen, Anthony
McGuire, Rhydwyn
Muscatello, David
Kemp, Madonna
Truran, Donna
Zhang, Ming
Thackway, Sarah
Automatic classification of diseases from free-text death certificates for real-time surveillance
title Automatic classification of diseases from free-text death certificates for real-time surveillance
title_full Automatic classification of diseases from free-text death certificates for real-time surveillance
title_fullStr Automatic classification of diseases from free-text death certificates for real-time surveillance
title_full_unstemmed Automatic classification of diseases from free-text death certificates for real-time surveillance
title_short Automatic classification of diseases from free-text death certificates for real-time surveillance
title_sort automatic classification of diseases from free-text death certificates for real-time surveillance
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4502908/
https://www.ncbi.nlm.nih.gov/pubmed/26174442
http://dx.doi.org/10.1186/s12911-015-0174-2
work_keys_str_mv AT koopmanbevan automaticclassificationofdiseasesfromfreetextdeathcertificatesforrealtimesurveillance
AT karimisarvnaz automaticclassificationofdiseasesfromfreetextdeathcertificatesforrealtimesurveillance
AT nguyenanthony automaticclassificationofdiseasesfromfreetextdeathcertificatesforrealtimesurveillance
AT mcguirerhydwyn automaticclassificationofdiseasesfromfreetextdeathcertificatesforrealtimesurveillance
AT muscatellodavid automaticclassificationofdiseasesfromfreetextdeathcertificatesforrealtimesurveillance
AT kempmadonna automaticclassificationofdiseasesfromfreetextdeathcertificatesforrealtimesurveillance
AT trurandonna automaticclassificationofdiseasesfromfreetextdeathcertificatesforrealtimesurveillance
AT zhangming automaticclassificationofdiseasesfromfreetextdeathcertificatesforrealtimesurveillance
AT thackwaysarah automaticclassificationofdiseasesfromfreetextdeathcertificatesforrealtimesurveillance