Cargando…

Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

BACKGROUND: Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received li...

Descripción completa

Detalles Bibliográficos
Autores principales: Do, Kieu Trinh, Wahl, Simone, Raffler, Johannes, Molnos, Sophie, Laimighofer, Michael, Adamski, Jerzy, Suhre, Karsten, Strauch, Konstantin, Peters, Annette, Gieger, Christian, Langenberg, Claudia, Stewart, Isobel D., Theis, Fabian J., Grallert, Harald, Kastenmüller, Gabi, Krumsiek, Jan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer US 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6153696/
https://www.ncbi.nlm.nih.gov/pubmed/30830398
http://dx.doi.org/10.1007/s11306-018-1420-2
_version_ 1783357554685902848
author Do, Kieu Trinh
Wahl, Simone
Raffler, Johannes
Molnos, Sophie
Laimighofer, Michael
Adamski, Jerzy
Suhre, Karsten
Strauch, Konstantin
Peters, Annette
Gieger, Christian
Langenberg, Claudia
Stewart, Isobel D.
Theis, Fabian J.
Grallert, Harald
Kastenmüller, Gabi
Krumsiek, Jan
author_facet Do, Kieu Trinh
Wahl, Simone
Raffler, Johannes
Molnos, Sophie
Laimighofer, Michael
Adamski, Jerzy
Suhre, Karsten
Strauch, Konstantin
Peters, Annette
Gieger, Christian
Langenberg, Claudia
Stewart, Isobel D.
Theis, Fabian J.
Grallert, Harald
Kastenmüller, Gabi
Krumsiek, Jan
author_sort Do, Kieu Trinh
collection PubMed
description BACKGROUND: Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation. METHODS: We investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established metabolic quantitative trait loci. RESULTS: Run day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable. CONCLUSION: Missing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1007/s11306-018-1420-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6153696
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Springer US
record_format MEDLINE/PubMed
spelling pubmed-61536962018-10-04 Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies Do, Kieu Trinh Wahl, Simone Raffler, Johannes Molnos, Sophie Laimighofer, Michael Adamski, Jerzy Suhre, Karsten Strauch, Konstantin Peters, Annette Gieger, Christian Langenberg, Claudia Stewart, Isobel D. Theis, Fabian J. Grallert, Harald Kastenmüller, Gabi Krumsiek, Jan Metabolomics Original Article BACKGROUND: Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation. METHODS: We investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established metabolic quantitative trait loci. RESULTS: Run day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable. CONCLUSION: Missing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1007/s11306-018-1420-2) contains supplementary material, which is available to authorized users. Springer US 2018-09-20 2018 /pmc/articles/PMC6153696/ /pubmed/30830398 http://dx.doi.org/10.1007/s11306-018-1420-2 Text en © The Author(s) 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle Original Article
Do, Kieu Trinh
Wahl, Simone
Raffler, Johannes
Molnos, Sophie
Laimighofer, Michael
Adamski, Jerzy
Suhre, Karsten
Strauch, Konstantin
Peters, Annette
Gieger, Christian
Langenberg, Claudia
Stewart, Isobel D.
Theis, Fabian J.
Grallert, Harald
Kastenmüller, Gabi
Krumsiek, Jan
Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies
title Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies
title_full Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies
title_fullStr Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies
title_full_unstemmed Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies
title_short Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies
title_sort characterization of missing values in untargeted ms-based metabolomics data and evaluation of missing data handling strategies
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6153696/
https://www.ncbi.nlm.nih.gov/pubmed/30830398
http://dx.doi.org/10.1007/s11306-018-1420-2
work_keys_str_mv AT dokieutrinh characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT wahlsimone characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT rafflerjohannes characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT molnossophie characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT laimighofermichael characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT adamskijerzy characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT suhrekarsten characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT strauchkonstantin characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT petersannette characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT giegerchristian characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT langenbergclaudia characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT stewartisobeld characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT theisfabianj characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT grallertharald characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT kastenmullergabi characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies
AT krumsiekjan characterizationofmissingvaluesinuntargetedmsbasedmetabolomicsdataandevaluationofmissingdatahandlingstrategies