Cargando…

Using decision trees to understand structure in missing data

OBJECTIVES: Demonstrate the application of decision trees—classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs)—to understand structure in missing data. SETTING: Data taken from employees at 3 different industrial sites in Australia. PARTICIPANTS: 7915 obser...

Descripción completa

Detalles Bibliográficos
Autores principales: Tierney, Nicholas J, Harden, Fiona A, Harden, Maurice J, Mengersen, Kerrie L
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BMJ Publishing Group 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4486966/
https://www.ncbi.nlm.nih.gov/pubmed/26124509
http://dx.doi.org/10.1136/bmjopen-2014-007450
_version_ 1782378948532895744
author Tierney, Nicholas J
Harden, Fiona A
Harden, Maurice J
Mengersen, Kerrie L
author_facet Tierney, Nicholas J
Harden, Fiona A
Harden, Maurice J
Mengersen, Kerrie L
author_sort Tierney, Nicholas J
collection PubMed
description OBJECTIVES: Demonstrate the application of decision trees—classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs)—to understand structure in missing data. SETTING: Data taken from employees at 3 different industrial sites in Australia. PARTICIPANTS: 7915 observations were included. MATERIALS AND METHODS: The approach was evaluated using an occupational health data set comprising results of questionnaires, medical tests and environmental monitoring. Statistical methods included standard statistical tests and the ‘rpart’ and ‘gbm’ packages for CART and BRT analyses, respectively, from the statistical software ‘R’. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced. RESULTS: CART and BRT models were effective in highlighting a missingness structure in the data, related to the type of data (medical or environmental), the site in which it was collected, the number of visits, and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured as compared to structured missingness. DISCUSSION: Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers. CONCLUSIONS: Researchers are encouraged to use CART and BRT models to explore and understand missing data.
format Online
Article
Text
id pubmed-4486966
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher BMJ Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-44869662015-07-20 Using decision trees to understand structure in missing data Tierney, Nicholas J Harden, Fiona A Harden, Maurice J Mengersen, Kerrie L BMJ Open Research Methods OBJECTIVES: Demonstrate the application of decision trees—classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs)—to understand structure in missing data. SETTING: Data taken from employees at 3 different industrial sites in Australia. PARTICIPANTS: 7915 observations were included. MATERIALS AND METHODS: The approach was evaluated using an occupational health data set comprising results of questionnaires, medical tests and environmental monitoring. Statistical methods included standard statistical tests and the ‘rpart’ and ‘gbm’ packages for CART and BRT analyses, respectively, from the statistical software ‘R’. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced. RESULTS: CART and BRT models were effective in highlighting a missingness structure in the data, related to the type of data (medical or environmental), the site in which it was collected, the number of visits, and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured as compared to structured missingness. DISCUSSION: Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers. CONCLUSIONS: Researchers are encouraged to use CART and BRT models to explore and understand missing data. BMJ Publishing Group 2015-06-29 /pmc/articles/PMC4486966/ /pubmed/26124509 http://dx.doi.org/10.1136/bmjopen-2014-007450 Text en Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
spellingShingle Research Methods
Tierney, Nicholas J
Harden, Fiona A
Harden, Maurice J
Mengersen, Kerrie L
Using decision trees to understand structure in missing data
title Using decision trees to understand structure in missing data
title_full Using decision trees to understand structure in missing data
title_fullStr Using decision trees to understand structure in missing data
title_full_unstemmed Using decision trees to understand structure in missing data
title_short Using decision trees to understand structure in missing data
title_sort using decision trees to understand structure in missing data
topic Research Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4486966/
https://www.ncbi.nlm.nih.gov/pubmed/26124509
http://dx.doi.org/10.1136/bmjopen-2014-007450
work_keys_str_mv AT tierneynicholasj usingdecisiontreestounderstandstructureinmissingdata
AT hardenfionaa usingdecisiontreestounderstandstructureinmissingdata
AT hardenmauricej usingdecisiontreestounderstandstructureinmissingdata
AT mengersenkerriel usingdecisiontreestounderstandstructureinmissingdata