Cargando…

The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project

BACKGROUND: Large and complex studies are now routine, and quality assurance and quality control (QC) procedures ensure reliable results and conclusions. Standard procedures may comprise manual verification and double entry, but these labour-intensive methods often leave errors undetected. Outlier d...

Descripción completa

Detalles Bibliográficos
Autores principales: Sunderland, Kelly M., Beaton, Derek, Fraser, Julia, Kwan, Donna, McLaughlin, Paula M., Montero-Odasso, Manuel, Peltsch, Alicia J., Pieruccini-Faria, Frederico, Sahlas, Demetrios J., Swartz, Richard H., Strother, Stephen C., Binns, Malcolm A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6521365/
https://www.ncbi.nlm.nih.gov/pubmed/31092212
http://dx.doi.org/10.1186/s12874-019-0737-5
_version_ 1783418940955820032
author Sunderland, Kelly M.
Beaton, Derek
Fraser, Julia
Kwan, Donna
McLaughlin, Paula M.
Montero-Odasso, Manuel
Peltsch, Alicia J.
Pieruccini-Faria, Frederico
Sahlas, Demetrios J.
Swartz, Richard H.
Strother, Stephen C.
Binns, Malcolm A.
author_facet Sunderland, Kelly M.
Beaton, Derek
Fraser, Julia
Kwan, Donna
McLaughlin, Paula M.
Montero-Odasso, Manuel
Peltsch, Alicia J.
Pieruccini-Faria, Frederico
Sahlas, Demetrios J.
Swartz, Richard H.
Strother, Stephen C.
Binns, Malcolm A.
author_sort Sunderland, Kelly M.
collection PubMed
description BACKGROUND: Large and complex studies are now routine, and quality assurance and quality control (QC) procedures ensure reliable results and conclusions. Standard procedures may comprise manual verification and double entry, but these labour-intensive methods often leave errors undetected. Outlier detection uses a data-driven approach to identify patterns exhibited by the majority of the data and highlights data points that deviate from these patterns. Univariate methods consider each variable independently, so observations that appear odd only when two or more variables are considered simultaneously remain undetected. We propose a data quality evaluation process that emphasizes the use of multivariate outlier detection for identifying errors, and show that univariate approaches alone are insufficient. Further, we establish an iterative process that uses multiple multivariate approaches, communication between teams, and visualization for other large-scale projects to follow. METHODS: We illustrate this process with preliminary neuropsychology and gait data for the vascular cognitive impairment cohort from the Ontario Neurodegenerative Disease Research Initiative, a multi-cohort observational study that aims to characterize biomarkers within and between five neurodegenerative diseases. Each dataset was evaluated four times: with and without covariate adjustment using two validated multivariate methods – Minimum Covariance Determinant (MCD) and Candès’ Robust Principal Component Analysis (RPCA) – and results were assessed in relation to two univariate methods. Outlying participants identified by multiple multivariate analyses were compiled and communicated to the data teams for verification. RESULTS: Of 161 and 148 participants in the neuropsychology and gait data, 44 and 43 were flagged by one or both multivariate methods and errors were identified for 8 and 5 participants, respectively. MCD identified all participants with errors, while RPCA identified 6/8 and 3/5 for the neuropsychology and gait data, respectively. Both outperformed univariate approaches. Adjusting for covariates had a minor effect on the participants identified as outliers, though did affect error detection. CONCLUSIONS: Manual QC procedures are insufficient for large studies as many errors remain undetected. In these data, the MCD outperforms the RPCA for identifying errors, and both are more successful than univariate approaches. Therefore, data-driven multivariate outlier techniques are essential tools for QC as data become more complex. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12874-019-0737-5) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6521365
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-65213652019-05-23 The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project Sunderland, Kelly M. Beaton, Derek Fraser, Julia Kwan, Donna McLaughlin, Paula M. Montero-Odasso, Manuel Peltsch, Alicia J. Pieruccini-Faria, Frederico Sahlas, Demetrios J. Swartz, Richard H. Strother, Stephen C. Binns, Malcolm A. BMC Med Res Methodol Research Article BACKGROUND: Large and complex studies are now routine, and quality assurance and quality control (QC) procedures ensure reliable results and conclusions. Standard procedures may comprise manual verification and double entry, but these labour-intensive methods often leave errors undetected. Outlier detection uses a data-driven approach to identify patterns exhibited by the majority of the data and highlights data points that deviate from these patterns. Univariate methods consider each variable independently, so observations that appear odd only when two or more variables are considered simultaneously remain undetected. We propose a data quality evaluation process that emphasizes the use of multivariate outlier detection for identifying errors, and show that univariate approaches alone are insufficient. Further, we establish an iterative process that uses multiple multivariate approaches, communication between teams, and visualization for other large-scale projects to follow. METHODS: We illustrate this process with preliminary neuropsychology and gait data for the vascular cognitive impairment cohort from the Ontario Neurodegenerative Disease Research Initiative, a multi-cohort observational study that aims to characterize biomarkers within and between five neurodegenerative diseases. Each dataset was evaluated four times: with and without covariate adjustment using two validated multivariate methods – Minimum Covariance Determinant (MCD) and Candès’ Robust Principal Component Analysis (RPCA) – and results were assessed in relation to two univariate methods. Outlying participants identified by multiple multivariate analyses were compiled and communicated to the data teams for verification. RESULTS: Of 161 and 148 participants in the neuropsychology and gait data, 44 and 43 were flagged by one or both multivariate methods and errors were identified for 8 and 5 participants, respectively. MCD identified all participants with errors, while RPCA identified 6/8 and 3/5 for the neuropsychology and gait data, respectively. Both outperformed univariate approaches. Adjusting for covariates had a minor effect on the participants identified as outliers, though did affect error detection. CONCLUSIONS: Manual QC procedures are insufficient for large studies as many errors remain undetected. In these data, the MCD outperforms the RPCA for identifying errors, and both are more successful than univariate approaches. Therefore, data-driven multivariate outlier techniques are essential tools for QC as data become more complex. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12874-019-0737-5) contains supplementary material, which is available to authorized users. BioMed Central 2019-05-15 /pmc/articles/PMC6521365/ /pubmed/31092212 http://dx.doi.org/10.1186/s12874-019-0737-5 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Sunderland, Kelly M.
Beaton, Derek
Fraser, Julia
Kwan, Donna
McLaughlin, Paula M.
Montero-Odasso, Manuel
Peltsch, Alicia J.
Pieruccini-Faria, Frederico
Sahlas, Demetrios J.
Swartz, Richard H.
Strother, Stephen C.
Binns, Malcolm A.
The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project
title The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project
title_full The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project
title_fullStr The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project
title_full_unstemmed The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project
title_short The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project
title_sort utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ondri project
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6521365/
https://www.ncbi.nlm.nih.gov/pubmed/31092212
http://dx.doi.org/10.1186/s12874-019-0737-5
work_keys_str_mv AT sunderlandkellym theutilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT beatonderek theutilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT fraserjulia theutilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT kwandonna theutilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT mclaughlinpaulam theutilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT monteroodassomanuel theutilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT peltschaliciaj theutilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT pieruccinifariafrederico theutilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT sahlasdemetriosj theutilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT swartzrichardh theutilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT theutilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT strotherstephenc theutilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT binnsmalcolma theutilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT sunderlandkellym utilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT beatonderek utilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT fraserjulia utilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT kwandonna utilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT mclaughlinpaulam utilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT monteroodassomanuel utilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT peltschaliciaj utilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT pieruccinifariafrederico utilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT sahlasdemetriosj utilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT swartzrichardh utilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT utilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT strotherstephenc utilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject
AT binnsmalcolma utilityofmultivariateoutlierdetectiontechniquesfordataqualityevaluationinlargestudiesanapplicationwithintheondriproject