Cargando…
Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study
BACKGROUND: LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matte...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6788053/ https://www.ncbi.nlm.nih.gov/pubmed/31601178 http://dx.doi.org/10.1186/s12859-019-3110-0 |
_version_ | 1783458414157889536 |
---|---|
author | Kokla, Marietta Virtanen, Jyrki Kolehmainen, Marjukka Paananen, Jussi Hanhineva, Kati |
author_facet | Kokla, Marietta Virtanen, Jyrki Kolehmainen, Marjukka Paananen, Jussi Hanhineva, Kati |
author_sort | Kokla, Marietta |
collection | PubMed |
description | BACKGROUND: LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. RESULTS: Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. CONCLUSION: Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance. |
format | Online Article Text |
id | pubmed-6788053 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-67880532019-10-18 Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study Kokla, Marietta Virtanen, Jyrki Kolehmainen, Marjukka Paananen, Jussi Hanhineva, Kati BMC Bioinformatics Research Article BACKGROUND: LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis. RESULTS: Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin. CONCLUSION: Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance. BioMed Central 2019-10-11 /pmc/articles/PMC6788053/ /pubmed/31601178 http://dx.doi.org/10.1186/s12859-019-3110-0 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Kokla, Marietta Virtanen, Jyrki Kolehmainen, Marjukka Paananen, Jussi Hanhineva, Kati Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study |
title | Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study |
title_full | Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study |
title_fullStr | Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study |
title_full_unstemmed | Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study |
title_short | Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study |
title_sort | random forest-based imputation outperforms other methods for imputing lc-ms metabolomics data: a comparative study |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6788053/ https://www.ncbi.nlm.nih.gov/pubmed/31601178 http://dx.doi.org/10.1186/s12859-019-3110-0 |
work_keys_str_mv | AT koklamarietta randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy AT virtanenjyrki randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy AT kolehmainenmarjukka randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy AT paananenjussi randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy AT hanhinevakati randomforestbasedimputationoutperformsothermethodsforimputinglcmsmetabolomicsdataacomparativestudy |