Cargando…

Using Change Point Detection for Monitoring the Quality of Aggregate Data

INTRODUCTION: Data consisting of counts or indicators aggregated from multiple sources pose particular problems for data quality monitoring when the users of the aggregate data are blind to the individual sources. This arises when agencies wish to share data but for privacy or contractual reasons ar...

Descripción completa

Detalles Bibliográficos
Autores principales: Painter, Ian, Eaton, Julie, Lober, Bill
Formato: Online Artículo Texto
Lenguaje:English
Publicado: University of Illinois at Chicago Library 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3692838/
_version_ 1782274667771330560
author Painter, Ian
Eaton, Julie
Lober, Bill
author_facet Painter, Ian
Eaton, Julie
Lober, Bill
author_sort Painter, Ian
collection PubMed
description INTRODUCTION: Data consisting of counts or indicators aggregated from multiple sources pose particular problems for data quality monitoring when the users of the aggregate data are blind to the individual sources. This arises when agencies wish to share data but for privacy or contractual reasons are only able to share data at an aggregate level. If the aggregators of the data are unable to guarantee the quality of either the sources of the data or the aggregation process then the quality of the aggregate data may be compromised. This situation arose in the Distribute surveillance system (1). Distribute was a national emergency department syndromic surveillance project developed by the International Society for Disease Surveillance for influenza-like-illness (ILI) that integrated data from existing state and local public health department surveillance systems, and operated from 2006 until mid 2012. Distribute was designed to work solely with aggregated data, with sites providing data aggregated from sources within their jurisdiction, and for which detailed information on the un-aggregated ‘raw’ data was unavailable. Previous work (2) on Distribute data quality identified several issues caused in part by the nature of the system: transient problems due to inconsistent uploads, problems associated with transient or long-term changes in the source make up of the reporting sites and lack of data timeliness due to individual site data accruing over time rather than in batch. Data timeliness was addressed using prediction intervals to assess the reliability of the partially accrued data (3). The types of data quality issues present in the Distribute data are likely to appear to some extent in any aggregate data surveillance system where direct control over the quality of the source data is not possible. In this work we present methods for detecting both transient and long-term changes in the source data makeup. METHODS: We examined methods to detect transient changes in data sources, which manifest as classical outliers. We found that traditional statistical process control methods did not work well for detecting transient issues due to the presence of discontinuities cause by long term changes in the source makeup. As both transient and long-term changes in source makeup manifest as step changes, we examined the performance of change point detection methods for monitoring this data. These methods have been previously used for detecting changes in disease trends in data aggregated from Distribute (4). Following Kass-Hout (4), we used the Bayesian change point estimation procedure of Barry (5) as implemented in the R package BCP (6). We examined both offline and online detection using time series held at a constant lag. RESULTS: We found that transient problems could be detected offline as neighboring change points with high posterior probability. When multiple outliers exist close together, detection can be improved by iteratively removing flagged data points and re-running the change point detection on the reduced data. Following the removal of outliers, remaining change points indicated long-term changes. To enable real-time monitoring for data quality problems we modified this offline detection process to in addition flag individual change points (rather than pairs of change points) detected in the most recent 5 days. [Figure: see text]
format Online
Article
Text
id pubmed-3692838
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher University of Illinois at Chicago Library
record_format MEDLINE/PubMed
spelling pubmed-36928382013-06-26 Using Change Point Detection for Monitoring the Quality of Aggregate Data Painter, Ian Eaton, Julie Lober, Bill Online J Public Health Inform ISDS 2012 Conference Abstracts INTRODUCTION: Data consisting of counts or indicators aggregated from multiple sources pose particular problems for data quality monitoring when the users of the aggregate data are blind to the individual sources. This arises when agencies wish to share data but for privacy or contractual reasons are only able to share data at an aggregate level. If the aggregators of the data are unable to guarantee the quality of either the sources of the data or the aggregation process then the quality of the aggregate data may be compromised. This situation arose in the Distribute surveillance system (1). Distribute was a national emergency department syndromic surveillance project developed by the International Society for Disease Surveillance for influenza-like-illness (ILI) that integrated data from existing state and local public health department surveillance systems, and operated from 2006 until mid 2012. Distribute was designed to work solely with aggregated data, with sites providing data aggregated from sources within their jurisdiction, and for which detailed information on the un-aggregated ‘raw’ data was unavailable. Previous work (2) on Distribute data quality identified several issues caused in part by the nature of the system: transient problems due to inconsistent uploads, problems associated with transient or long-term changes in the source make up of the reporting sites and lack of data timeliness due to individual site data accruing over time rather than in batch. Data timeliness was addressed using prediction intervals to assess the reliability of the partially accrued data (3). The types of data quality issues present in the Distribute data are likely to appear to some extent in any aggregate data surveillance system where direct control over the quality of the source data is not possible. In this work we present methods for detecting both transient and long-term changes in the source data makeup. METHODS: We examined methods to detect transient changes in data sources, which manifest as classical outliers. We found that traditional statistical process control methods did not work well for detecting transient issues due to the presence of discontinuities cause by long term changes in the source makeup. As both transient and long-term changes in source makeup manifest as step changes, we examined the performance of change point detection methods for monitoring this data. These methods have been previously used for detecting changes in disease trends in data aggregated from Distribute (4). Following Kass-Hout (4), we used the Bayesian change point estimation procedure of Barry (5) as implemented in the R package BCP (6). We examined both offline and online detection using time series held at a constant lag. RESULTS: We found that transient problems could be detected offline as neighboring change points with high posterior probability. When multiple outliers exist close together, detection can be improved by iteratively removing flagged data points and re-running the change point detection on the reduced data. Following the removal of outliers, remaining change points indicated long-term changes. To enable real-time monitoring for data quality problems we modified this offline detection process to in addition flag individual change points (rather than pairs of change points) detected in the most recent 5 days. [Figure: see text] University of Illinois at Chicago Library 2013-04-04 /pmc/articles/PMC3692838/ Text en ©2013 the author(s) http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/ojphi/about/submissions#copyrightNotice This is an Open Access article. Authors own copyright of their articles appearing in the Online Journal of Public Health Informatics. Readers may copy articles without permission of the copyright owner(s), as long as the author and OJPHI are acknowledged in the copy and the copy is used for educational, not-for-profit purposes.
spellingShingle ISDS 2012 Conference Abstracts
Painter, Ian
Eaton, Julie
Lober, Bill
Using Change Point Detection for Monitoring the Quality of Aggregate Data
title Using Change Point Detection for Monitoring the Quality of Aggregate Data
title_full Using Change Point Detection for Monitoring the Quality of Aggregate Data
title_fullStr Using Change Point Detection for Monitoring the Quality of Aggregate Data
title_full_unstemmed Using Change Point Detection for Monitoring the Quality of Aggregate Data
title_short Using Change Point Detection for Monitoring the Quality of Aggregate Data
title_sort using change point detection for monitoring the quality of aggregate data
topic ISDS 2012 Conference Abstracts
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3692838/
work_keys_str_mv AT painterian usingchangepointdetectionformonitoringthequalityofaggregatedata
AT eatonjulie usingchangepointdetectionformonitoringthequalityofaggregatedata
AT loberbill usingchangepointdetectionformonitoringthequalityofaggregatedata