Cargando…

Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies

BACKGROUND: High throughput metabolomics makes it possible to measure the relative abundances of numerous metabolites in biological samples, which is useful to many areas of biomedical research. However, missing values (MVs) in metabolomics datasets are common and can arise due to both technical and...

Descripción completa

Detalles Bibliográficos
Autores principales:	Shah, Jasmit S., Rai, Shesh N., DeFilippis, Andrew P., Hill, Bradford G., Bhatnagar, Aruni, Brock, Guy N.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5319174/ https://www.ncbi.nlm.nih.gov/pubmed/28219348 http://dx.doi.org/10.1186/s12859-017-1547-6

_version_	1782509334705471488
author	Shah, Jasmit S. Rai, Shesh N. DeFilippis, Andrew P. Hill, Bradford G. Bhatnagar, Aruni Brock, Guy N.
author_facet	Shah, Jasmit S. Rai, Shesh N. DeFilippis, Andrew P. Hill, Bradford G. Bhatnagar, Aruni Brock, Guy N.
author_sort	Shah, Jasmit S.
collection	PubMed
description	BACKGROUND: High throughput metabolomics makes it possible to measure the relative abundances of numerous metabolites in biological samples, which is useful to many areas of biomedical research. However, missing values (MVs) in metabolomics datasets are common and can arise due to both technical and biological reasons. Typically, such MVs are substituted by a minimum value, which may lead to different results in downstream analyses. RESULTS: Here we present a modified version of the K-nearest neighbor (KNN) approach which accounts for truncation at the minimum value, i.e., KNN truncation (KNN-TN). We compare imputation results based on KNN-TN with results from other KNN approaches such as KNN based on correlation (KNN-CR) and KNN based on Euclidean distance (KNN-EU). Our approach assumes that the data follow a truncated normal distribution with the truncation point at the detection limit (LOD). The effectiveness of each approach was analyzed by the root mean square error (RMSE) measure as well as the metabolite list concordance index (MLCI) for influence on downstream statistical testing. Through extensive simulation studies and application to three real data sets, we show that KNN-TN has lower RMSE values compared to the other two KNN procedures as well as simpler imputation methods based on substituting missing values with the metabolite mean, zero values, or the LOD. MLCI values between KNN-TN and KNN-EU were roughly equivalent, and superior to the other four methods in most cases. CONCLUSION: Our findings demonstrate that KNN-TN generally has improved performance in imputing the missing values of the different datasets compared to KNN-CR and KNN-EU when there is missingness due to missing at random combined with an LOD. The results shown in this study are in the field of metabolomics but this method could be applicable with any high throughput technology which has missing due to LOD. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1547-6) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-5319174
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-53191742017-02-24 Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies Shah, Jasmit S. Rai, Shesh N. DeFilippis, Andrew P. Hill, Bradford G. Bhatnagar, Aruni Brock, Guy N. BMC Bioinformatics Methodology Article BACKGROUND: High throughput metabolomics makes it possible to measure the relative abundances of numerous metabolites in biological samples, which is useful to many areas of biomedical research. However, missing values (MVs) in metabolomics datasets are common and can arise due to both technical and biological reasons. Typically, such MVs are substituted by a minimum value, which may lead to different results in downstream analyses. RESULTS: Here we present a modified version of the K-nearest neighbor (KNN) approach which accounts for truncation at the minimum value, i.e., KNN truncation (KNN-TN). We compare imputation results based on KNN-TN with results from other KNN approaches such as KNN based on correlation (KNN-CR) and KNN based on Euclidean distance (KNN-EU). Our approach assumes that the data follow a truncated normal distribution with the truncation point at the detection limit (LOD). The effectiveness of each approach was analyzed by the root mean square error (RMSE) measure as well as the metabolite list concordance index (MLCI) for influence on downstream statistical testing. Through extensive simulation studies and application to three real data sets, we show that KNN-TN has lower RMSE values compared to the other two KNN procedures as well as simpler imputation methods based on substituting missing values with the metabolite mean, zero values, or the LOD. MLCI values between KNN-TN and KNN-EU were roughly equivalent, and superior to the other four methods in most cases. CONCLUSION: Our findings demonstrate that KNN-TN generally has improved performance in imputing the missing values of the different datasets compared to KNN-CR and KNN-EU when there is missingness due to missing at random combined with an LOD. The results shown in this study are in the field of metabolomics but this method could be applicable with any high throughput technology which has missing due to LOD. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1547-6) contains supplementary material, which is available to authorized users. BioMed Central 2017-02-20 /pmc/articles/PMC5319174/ /pubmed/28219348 http://dx.doi.org/10.1186/s12859-017-1547-6 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Shah, Jasmit S. Rai, Shesh N. DeFilippis, Andrew P. Hill, Bradford G. Bhatnagar, Aruni Brock, Guy N. Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies
title	Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies
title_full	Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies
title_fullStr	Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies
title_full_unstemmed	Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies
title_short	Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies
title_sort	distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5319174/ https://www.ncbi.nlm.nih.gov/pubmed/28219348 http://dx.doi.org/10.1186/s12859-017-1547-6
work_keys_str_mv	AT shahjasmits distributionbasednearestneighborimputationfortruncatedhighdimensionaldatawithapplicationstopreclinicalandclinicalmetabolomicsstudies AT raisheshn distributionbasednearestneighborimputationfortruncatedhighdimensionaldatawithapplicationstopreclinicalandclinicalmetabolomicsstudies AT defilippisandrewp distributionbasednearestneighborimputationfortruncatedhighdimensionaldatawithapplicationstopreclinicalandclinicalmetabolomicsstudies AT hillbradfordg distributionbasednearestneighborimputationfortruncatedhighdimensionaldatawithapplicationstopreclinicalandclinicalmetabolomicsstudies AT bhatnagararuni distributionbasednearestneighborimputationfortruncatedhighdimensionaldatawithapplicationstopreclinicalandclinicalmetabolomicsstudies AT brockguyn distributionbasednearestneighborimputationfortruncatedhighdimensionaldatawithapplicationstopreclinicalandclinicalmetabolomicsstudies

Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies

Ejemplares similares