Cargando…

A real data-driven simulation strategy to select an imputation method for mixed-type trait data

Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advant...

Descripción completa

Detalles Bibliográficos
Autores principales:	May, Jacqueline A., Feng, Zeny, Adamowicz, Sarah J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2023
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10069776/ https://www.ncbi.nlm.nih.gov/pubmed/36947561 http://dx.doi.org/10.1371/journal.pcbi.1010154

_version_	1785018915424829440
author	May, Jacqueline A. Feng, Zeny Adamowicz, Sarah J.
author_facet	May, Jacqueline A. Feng, Zeny Adamowicz, Sarah J.
author_sort	May, Jacqueline A.
collection	PubMed
description	Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advantageous. We invoked a real data-driven simulation strategy to select an imputation method for a given mixed-type (categorical, count, continuous) target dataset. Candidate methods included mean/mode imputation, k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Using a trait dataset of squamates (lizards and amphisbaenians; order: Squamata) as a target dataset, a complete-case dataset consisting of species with nearly complete information was formed for the imputation method selection. Missing data were induced by removing values from this dataset under different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For each method, combinations with and without phylogenetic information from single gene (nuclear and mitochondrial) or multigene trees were used to impute the missing values for five numerical and two categorical traits. The performances of the methods were evaluated under each missing mechanism by determining the mean squared error and proportion falsely classified rates for numerical and categorical traits, respectively. A random forest method supplemented with a nuclear-derived phylogeny resulted in the lowest error rates for the majority of traits, and this method was used to impute missing values in the original dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to complete-case data. However, caution should be taken when imputing trait data as phylogeny did not always improve performance for every trait and in every scenario. Ultimately, these results support the use of a real data-driven simulation strategy for selecting a suitable imputation method for a given mixed-type trait dataset.
format	Online Article Text
id	pubmed-10069776
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-100697762023-04-04 A real data-driven simulation strategy to select an imputation method for mixed-type trait data May, Jacqueline A. Feng, Zeny Adamowicz, Sarah J. PLoS Comput Biol Research Article Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advantageous. We invoked a real data-driven simulation strategy to select an imputation method for a given mixed-type (categorical, count, continuous) target dataset. Candidate methods included mean/mode imputation, k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Using a trait dataset of squamates (lizards and amphisbaenians; order: Squamata) as a target dataset, a complete-case dataset consisting of species with nearly complete information was formed for the imputation method selection. Missing data were induced by removing values from this dataset under different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For each method, combinations with and without phylogenetic information from single gene (nuclear and mitochondrial) or multigene trees were used to impute the missing values for five numerical and two categorical traits. The performances of the methods were evaluated under each missing mechanism by determining the mean squared error and proportion falsely classified rates for numerical and categorical traits, respectively. A random forest method supplemented with a nuclear-derived phylogeny resulted in the lowest error rates for the majority of traits, and this method was used to impute missing values in the original dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to complete-case data. However, caution should be taken when imputing trait data as phylogeny did not always improve performance for every trait and in every scenario. Ultimately, these results support the use of a real data-driven simulation strategy for selecting a suitable imputation method for a given mixed-type trait dataset. Public Library of Science 2023-03-22 /pmc/articles/PMC10069776/ /pubmed/36947561 http://dx.doi.org/10.1371/journal.pcbi.1010154 Text en © 2023 May et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article May, Jacqueline A. Feng, Zeny Adamowicz, Sarah J. A real data-driven simulation strategy to select an imputation method for mixed-type trait data
title	A real data-driven simulation strategy to select an imputation method for mixed-type trait data
title_full	A real data-driven simulation strategy to select an imputation method for mixed-type trait data
title_fullStr	A real data-driven simulation strategy to select an imputation method for mixed-type trait data
title_full_unstemmed	A real data-driven simulation strategy to select an imputation method for mixed-type trait data
title_short	A real data-driven simulation strategy to select an imputation method for mixed-type trait data
title_sort	real data-driven simulation strategy to select an imputation method for mixed-type trait data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10069776/ https://www.ncbi.nlm.nih.gov/pubmed/36947561 http://dx.doi.org/10.1371/journal.pcbi.1010154
work_keys_str_mv	AT mayjacquelinea arealdatadrivensimulationstrategytoselectanimputationmethodformixedtypetraitdata AT fengzeny arealdatadrivensimulationstrategytoselectanimputationmethodformixedtypetraitdata AT adamowiczsarahj arealdatadrivensimulationstrategytoselectanimputationmethodformixedtypetraitdata AT mayjacquelinea realdatadrivensimulationstrategytoselectanimputationmethodformixedtypetraitdata AT fengzeny realdatadrivensimulationstrategytoselectanimputationmethodformixedtypetraitdata AT adamowiczsarahj realdatadrivensimulationstrategytoselectanimputationmethodformixedtypetraitdata

A real data-driven simulation strategy to select an imputation method for mixed-type trait data

Ejemplares similares