Cargando…

Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study

BACKGROUND: LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matte...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kokla, Marietta, Virtanen, Jyrki, Kolehmainen, Marjukka, Paananen, Jussi, Hanhineva, Kati
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6788053/ https://www.ncbi.nlm.nih.gov/pubmed/31601178 http://dx.doi.org/10.1186/s12859-019-3110-0

_version_	1783458414157889536
author	Kokla, Marietta Virtanen, Jyrki Kolehmainen, Marjukka Paananen, Jussi Hanhineva, Kati
author_facet	Kokla, Marietta Virtanen, Jyrki Kolehmainen, Marjukka Paananen, Jussi Hanhineva, Kati
author_sort	Kokla, Marietta
collection	PubMed
description	BACKGROUND: LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. RESULTS: Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. CONCLUSION: Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.
format	Online Article Text
id	pubmed-6788053
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-67880532019-10-18 Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study Kokla, Marietta Virtanen, Jyrki Kolehmainen, Marjukka Paananen, Jussi Hanhineva, Kati BMC Bioinformatics Research Article BACKGROUND: LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. RESULTS: Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. CONCLUSION: Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance. BioMed Central 2019-10-11 /pmc/articles/PMC6788053/ /pubmed/31601178 http://dx.doi.org/10.1186/s12859-019-3110-0 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Kokla, Marietta Virtanen, Jyrki Kolehmainen, Marjukka Paananen, Jussi Hanhineva, Kati Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study
title	Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study
title_full	Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study
title_fullStr	Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study
title_full_unstemmed	Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study
title_short	Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study
title_sort	random forest-based imputation outperforms other methods for imputing lc-ms metabolomics data: a comparative study
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6788053/ https://www.ncbi.nlm.nih.gov/pubmed/31601178 http://dx.doi.org/10.1186/s12859-019-3110-0
work_keys_str_mv	AT koklamarietta randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy AT virtanenjyrki randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy AT kolehmainenmarjukka randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy AT paananenjussi randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy AT hanhinevakati randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy

Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study

Ejemplares similares