Cargando…

Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data

All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage a...

Descripción completa

Detalles Bibliográficos
Autores principales:	Woolley, Charlotte S. C., Handel, Ian G., Bronsvoort, B. Mark, Schoenebeck, Jeffrey J., Clements, Dylan N.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2020
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6980495/ https://www.ncbi.nlm.nih.gov/pubmed/31978151 http://dx.doi.org/10.1371/journal.pone.0228154

_version_	1783490954621091840
author	Woolley, Charlotte S. C. Handel, Ian G. Bronsvoort, B. Mark Schoenebeck, Jeffrey J. Clements, Dylan N.
author_facet	Woolley, Charlotte S. C. Handel, Ian G. Bronsvoort, B. Mark Schoenebeck, Jeffrey J. Clements, Dylan N.
author_sort	Woolley, Charlotte S. C.
collection	PubMed
description	All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains.
format	Online Article Text
id	pubmed-6980495
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-69804952020-02-04 Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data Woolley, Charlotte S. C. Handel, Ian G. Bronsvoort, B. Mark Schoenebeck, Jeffrey J. Clements, Dylan N. PLoS One Research Article All data are prone to error and require data cleaning prior to analysis. An important example is longitudinal growth data, for which there are no universally agreed standard methods for identifying and removing implausible values and many existing methods have limitations that restrict their usage across different domains. A decision-making algorithm that modified or deleted growth measurements based on a combination of pre-defined cut-offs and logic rules was designed. Five data cleaning methods for growth were tested with and without the addition of the algorithm and applied to five different longitudinal growth datasets: four uncleaned canine weight or height datasets and one pre-cleaned human weight dataset with randomly simulated errors. Prior to the addition of the algorithm, data cleaning based on non-linear mixed effects models was the most effective in all datasets and had on average a minimum of 26.00% higher sensitivity and 0.12% higher specificity than other methods. Data cleaning methods using the algorithm had improved data preservation and were capable of correcting simulated errors according to the gold standard; returning a value to its original state prior to error simulation. The algorithm improved the performance of all data cleaning methods and increased the average sensitivity and specificity of the non-linear mixed effects model method by 7.68% and 0.42% respectively. Using non-linear mixed effects models combined with the algorithm to clean data allows individual growth trajectories to vary from the population by using repeated longitudinal measurements, identifies consecutive errors or those within the first data entry, avoids the requirement for a minimum number of data entries, preserves data where possible by correcting errors rather than deleting them and removes duplications intelligently. This algorithm is broadly applicable to data cleaning anthropometric data in different mammalian species and could be adapted for use in a range of other domains. Public Library of Science 2020-01-24 /pmc/articles/PMC6980495/ /pubmed/31978151 http://dx.doi.org/10.1371/journal.pone.0228154 Text en © 2020 Woolley et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Woolley, Charlotte S. C. Handel, Ian G. Bronsvoort, B. Mark Schoenebeck, Jeffrey J. Clements, Dylan N. Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data
title	Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data
title_full	Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data
title_fullStr	Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data
title_full_unstemmed	Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data
title_short	Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data
title_sort	is it time to stop sweeping data cleaning under the carpet? a novel algorithm for outlier management in growth data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6980495/ https://www.ncbi.nlm.nih.gov/pubmed/31978151 http://dx.doi.org/10.1371/journal.pone.0228154
work_keys_str_mv	AT woolleycharlottesc isittimetostopsweepingdatacleaningunderthecarpetanovelalgorithmforoutliermanagementingrowthdata AT handeliang isittimetostopsweepingdatacleaningunderthecarpetanovelalgorithmforoutliermanagementingrowthdata AT bronsvoortbmark isittimetostopsweepingdatacleaningunderthecarpetanovelalgorithmforoutliermanagementingrowthdata AT schoenebeckjeffreyj isittimetostopsweepingdatacleaningunderthecarpetanovelalgorithmforoutliermanagementingrowthdata AT clementsdylann isittimetostopsweepingdatacleaningunderthecarpetanovelalgorithmforoutliermanagementingrowthdata

Is it time to stop sweeping data cleaning under the carpet? A novel algorithm for outlier management in growth data

Ejemplares similares