Cargando…

A Supervised Learning Process to Validate Online Disease Reports for Use in Predictive Models

Pathogen distribution models that predict spatial variation in disease occurrence require data from a large number of geographic locations to generate disease risk maps. Traditionally, this process has used data from public health reporting systems; however, using online reports of new infections co...

Descripción completa

Detalles Bibliográficos
Autores principales: Patching, Helena M.M., Hudson, Laurence M., Cooke, Warrick, Garcia, Andres J., Hay, Simon I., Roberts, Mark, Moyes, Catherine L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Mary Ann Liebert, Inc. 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4722556/
https://www.ncbi.nlm.nih.gov/pubmed/26858916
http://dx.doi.org/10.1089/big.2015.0019
_version_ 1782411376123183104
author Patching, Helena M.M.
Hudson, Laurence M.
Cooke, Warrick
Garcia, Andres J.
Hay, Simon I.
Roberts, Mark
Moyes, Catherine L.
author_facet Patching, Helena M.M.
Hudson, Laurence M.
Cooke, Warrick
Garcia, Andres J.
Hay, Simon I.
Roberts, Mark
Moyes, Catherine L.
author_sort Patching, Helena M.M.
collection PubMed
description Pathogen distribution models that predict spatial variation in disease occurrence require data from a large number of geographic locations to generate disease risk maps. Traditionally, this process has used data from public health reporting systems; however, using online reports of new infections could speed up the process dramatically. Data from both public health systems and online sources must be validated before they can be used, but no mechanisms exist to validate data from online media reports. We have developed a supervised learning process to validate geolocated disease outbreak data in a timely manner. The process uses three input features, the data source and two metrics derived from the location of each disease occurrence. The location of disease occurrence provides information on the probability of disease occurrence at that location based on environmental and socioeconomic factors and the distance within or outside the current known disease extent. The process also uses validation scores, generated by disease experts who review a subset of the data, to build a training data set. The aim of the supervised learning process is to generate validation scores that can be used as weights going into the pathogen distribution model. After analyzing the three input features and testing the performance of alternative processes, we selected a cascade of ensembles comprising logistic regressors. Parameter values for the training data subset size, number of predictors, and number of layers in the cascade were tested before the process was deployed. The final configuration was tested using data for two contrasting diseases (dengue and cholera), and 66%–79% of data points were assigned a validation score. The remaining data points are scored by the experts, and the results inform the training data set for the next set of predictors, as well as going to the pathogen distribution model. The new supervised learning process has been implemented within our live site and is being used to validate the data that our system uses to produce updated predictive disease maps on a weekly basis.
format Online
Article
Text
id pubmed-4722556
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Mary Ann Liebert, Inc.
record_format MEDLINE/PubMed
spelling pubmed-47225562016-02-08 A Supervised Learning Process to Validate Online Disease Reports for Use in Predictive Models Patching, Helena M.M. Hudson, Laurence M. Cooke, Warrick Garcia, Andres J. Hay, Simon I. Roberts, Mark Moyes, Catherine L. Big Data Original Articles Pathogen distribution models that predict spatial variation in disease occurrence require data from a large number of geographic locations to generate disease risk maps. Traditionally, this process has used data from public health reporting systems; however, using online reports of new infections could speed up the process dramatically. Data from both public health systems and online sources must be validated before they can be used, but no mechanisms exist to validate data from online media reports. We have developed a supervised learning process to validate geolocated disease outbreak data in a timely manner. The process uses three input features, the data source and two metrics derived from the location of each disease occurrence. The location of disease occurrence provides information on the probability of disease occurrence at that location based on environmental and socioeconomic factors and the distance within or outside the current known disease extent. The process also uses validation scores, generated by disease experts who review a subset of the data, to build a training data set. The aim of the supervised learning process is to generate validation scores that can be used as weights going into the pathogen distribution model. After analyzing the three input features and testing the performance of alternative processes, we selected a cascade of ensembles comprising logistic regressors. Parameter values for the training data subset size, number of predictors, and number of layers in the cascade were tested before the process was deployed. The final configuration was tested using data for two contrasting diseases (dengue and cholera), and 66%–79% of data points were assigned a validation score. The remaining data points are scored by the experts, and the results inform the training data set for the next set of predictors, as well as going to the pathogen distribution model. The new supervised learning process has been implemented within our live site and is being used to validate the data that our system uses to produce updated predictive disease maps on a weekly basis. Mary Ann Liebert, Inc. 2015-12-01 /pmc/articles/PMC4722556/ /pubmed/26858916 http://dx.doi.org/10.1089/big.2015.0019 Text en © Helena M.M. Patching et al. 2016; Published by Mary Ann Liebert, Inc. This Open Access article is distributed under the terms of the Creative Commons License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
spellingShingle Original Articles
Patching, Helena M.M.
Hudson, Laurence M.
Cooke, Warrick
Garcia, Andres J.
Hay, Simon I.
Roberts, Mark
Moyes, Catherine L.
A Supervised Learning Process to Validate Online Disease Reports for Use in Predictive Models
title A Supervised Learning Process to Validate Online Disease Reports for Use in Predictive Models
title_full A Supervised Learning Process to Validate Online Disease Reports for Use in Predictive Models
title_fullStr A Supervised Learning Process to Validate Online Disease Reports for Use in Predictive Models
title_full_unstemmed A Supervised Learning Process to Validate Online Disease Reports for Use in Predictive Models
title_short A Supervised Learning Process to Validate Online Disease Reports for Use in Predictive Models
title_sort supervised learning process to validate online disease reports for use in predictive models
topic Original Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4722556/
https://www.ncbi.nlm.nih.gov/pubmed/26858916
http://dx.doi.org/10.1089/big.2015.0019
work_keys_str_mv AT patchinghelenamm asupervisedlearningprocesstovalidateonlinediseasereportsforuseinpredictivemodels
AT hudsonlaurencem asupervisedlearningprocesstovalidateonlinediseasereportsforuseinpredictivemodels
AT cookewarrick asupervisedlearningprocesstovalidateonlinediseasereportsforuseinpredictivemodels
AT garciaandresj asupervisedlearningprocesstovalidateonlinediseasereportsforuseinpredictivemodels
AT haysimoni asupervisedlearningprocesstovalidateonlinediseasereportsforuseinpredictivemodels
AT robertsmark asupervisedlearningprocesstovalidateonlinediseasereportsforuseinpredictivemodels
AT moyescatherinel asupervisedlearningprocesstovalidateonlinediseasereportsforuseinpredictivemodels
AT patchinghelenamm supervisedlearningprocesstovalidateonlinediseasereportsforuseinpredictivemodels
AT hudsonlaurencem supervisedlearningprocesstovalidateonlinediseasereportsforuseinpredictivemodels
AT cookewarrick supervisedlearningprocesstovalidateonlinediseasereportsforuseinpredictivemodels
AT garciaandresj supervisedlearningprocesstovalidateonlinediseasereportsforuseinpredictivemodels
AT haysimoni supervisedlearningprocesstovalidateonlinediseasereportsforuseinpredictivemodels
AT robertsmark supervisedlearningprocesstovalidateonlinediseasereportsforuseinpredictivemodels
AT moyescatherinel supervisedlearningprocesstovalidateonlinediseasereportsforuseinpredictivemodels