Cargando…

Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study

BACKGROUND: LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matte...

Descripción completa

Detalles Bibliográficos
Autores principales: Kokla, Marietta, Virtanen, Jyrki, Kolehmainen, Marjukka, Paananen, Jussi, Hanhineva, Kati
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6788053/
https://www.ncbi.nlm.nih.gov/pubmed/31601178
http://dx.doi.org/10.1186/s12859-019-3110-0
_version_ 1783458414157889536
author Kokla, Marietta
Virtanen, Jyrki
Kolehmainen, Marjukka
Paananen, Jussi
Hanhineva, Kati
author_facet Kokla, Marietta
Virtanen, Jyrki
Kolehmainen, Marjukka
Paananen, Jussi
Hanhineva, Kati
author_sort Kokla, Marietta
collection PubMed
description BACKGROUND: LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. RESULTS: Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. CONCLUSION: Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.
format Online
Article
Text
id pubmed-6788053
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-67880532019-10-18 Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study Kokla, Marietta Virtanen, Jyrki Kolehmainen, Marjukka Paananen, Jussi Hanhineva, Kati BMC Bioinformatics Research Article BACKGROUND: LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. RESULTS: Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. CONCLUSION: Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance. BioMed Central 2019-10-11 /pmc/articles/PMC6788053/ /pubmed/31601178 http://dx.doi.org/10.1186/s12859-019-3110-0 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Kokla, Marietta
Virtanen, Jyrki
Kolehmainen, Marjukka
Paananen, Jussi
Hanhineva, Kati
Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study
title Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study
title_full Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study
title_fullStr Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study
title_full_unstemmed Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study
title_short Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study
title_sort random forest-based imputation outperforms other methods for imputing lc-ms metabolomics data: a comparative study
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6788053/
https://www.ncbi.nlm.nih.gov/pubmed/31601178
http://dx.doi.org/10.1186/s12859-019-3110-0
work_keys_str_mv AT koklamarietta randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy
AT virtanenjyrki randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy
AT kolehmainenmarjukka randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy
AT paananenjussi randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy
AT hanhinevakati randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy