Cargando…

PLS-Based and Regularization-Based Methods for the Selection of Relevant Variables in Non-targeted Metabolomics Data

Non-targeted metabolomics constitutes a part of the systems biology and aims at determining numerous metabolites in complex biological samples. Datasets obtained in the non-targeted metabolomics studies are high-dimensional due to sensitivity of mass spectrometry-based detection methods as well as c...

Descripción completa

Detalles Bibliográficos
Autores principales: Bujak, Renata, Daghir-Wojtkowiak, Emilia, Kaliszan, Roman, Markuszewski, Michał J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4960252/
https://www.ncbi.nlm.nih.gov/pubmed/27508208
http://dx.doi.org/10.3389/fmolb.2016.00035
_version_ 1782444494584545280
author Bujak, Renata
Daghir-Wojtkowiak, Emilia
Kaliszan, Roman
Markuszewski, Michał J.
author_facet Bujak, Renata
Daghir-Wojtkowiak, Emilia
Kaliszan, Roman
Markuszewski, Michał J.
author_sort Bujak, Renata
collection PubMed
description Non-targeted metabolomics constitutes a part of the systems biology and aims at determining numerous metabolites in complex biological samples. Datasets obtained in the non-targeted metabolomics studies are high-dimensional due to sensitivity of mass spectrometry-based detection methods as well as complexity of biological matrices. Therefore, a proper selection of variables which contribute into group classification is a crucial step, especially in metabolomics studies which are focused on searching for disease biomarker candidates. In the present study, three different statistical approaches were tested using two metabolomics datasets (RH and PH study). The orthogonal projections to latent structures-discriminant analysis (OPLS-DA) without and with multiple testing correction as well as the least absolute shrinkage and selection operator (LASSO) with bootstrapping, were tested and compared. For the RH study, OPLS-DA model built without multiple testing correction selected 46 and 218 variables based on the VIP criteria using Pareto and UV scaling, respectively. For the PH study, 217 and 320 variables were selected based on the VIP criteria using Pareto and UV scaling, respectively. In the RH study, OPLS-DA model built after correcting for multiple testing, selected 4 and 19 variables as in terms of Pareto and UV scaling, respectively. For the PH study, 14 and 18 variables were selected based on the VIP criteria in terms of Pareto and UV scaling, respectively. In the RH and PH study, the LASSO selected 14 and 4 variables with reproducibility between 99.3 and 100%, respectively. In the light of PLS-based models, the larger the search space the higher the probability of developing models that fit the training data well with simultaneous poor predictive performance on the validation set. The LASSO offers potential improvements over standard linear regression due to the presence of the constrain, which promotes sparse solutions. This paper is the first one to date utilizing the LASSO penalized logistic regression in untargeted metabolomics studies.
format Online
Article
Text
id pubmed-4960252
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-49602522016-08-09 PLS-Based and Regularization-Based Methods for the Selection of Relevant Variables in Non-targeted Metabolomics Data Bujak, Renata Daghir-Wojtkowiak, Emilia Kaliszan, Roman Markuszewski, Michał J. Front Mol Biosci Molecular Biosciences Non-targeted metabolomics constitutes a part of the systems biology and aims at determining numerous metabolites in complex biological samples. Datasets obtained in the non-targeted metabolomics studies are high-dimensional due to sensitivity of mass spectrometry-based detection methods as well as complexity of biological matrices. Therefore, a proper selection of variables which contribute into group classification is a crucial step, especially in metabolomics studies which are focused on searching for disease biomarker candidates. In the present study, three different statistical approaches were tested using two metabolomics datasets (RH and PH study). The orthogonal projections to latent structures-discriminant analysis (OPLS-DA) without and with multiple testing correction as well as the least absolute shrinkage and selection operator (LASSO) with bootstrapping, were tested and compared. For the RH study, OPLS-DA model built without multiple testing correction selected 46 and 218 variables based on the VIP criteria using Pareto and UV scaling, respectively. For the PH study, 217 and 320 variables were selected based on the VIP criteria using Pareto and UV scaling, respectively. In the RH study, OPLS-DA model built after correcting for multiple testing, selected 4 and 19 variables as in terms of Pareto and UV scaling, respectively. For the PH study, 14 and 18 variables were selected based on the VIP criteria in terms of Pareto and UV scaling, respectively. In the RH and PH study, the LASSO selected 14 and 4 variables with reproducibility between 99.3 and 100%, respectively. In the light of PLS-based models, the larger the search space the higher the probability of developing models that fit the training data well with simultaneous poor predictive performance on the validation set. The LASSO offers potential improvements over standard linear regression due to the presence of the constrain, which promotes sparse solutions. This paper is the first one to date utilizing the LASSO penalized logistic regression in untargeted metabolomics studies. Frontiers Media S.A. 2016-07-26 /pmc/articles/PMC4960252/ /pubmed/27508208 http://dx.doi.org/10.3389/fmolb.2016.00035 Text en Copyright © 2016 Bujak, Daghir-Wojtkowiak, Kaliszan and Markuszewski. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Molecular Biosciences
Bujak, Renata
Daghir-Wojtkowiak, Emilia
Kaliszan, Roman
Markuszewski, Michał J.
PLS-Based and Regularization-Based Methods for the Selection of Relevant Variables in Non-targeted Metabolomics Data
title PLS-Based and Regularization-Based Methods for the Selection of Relevant Variables in Non-targeted Metabolomics Data
title_full PLS-Based and Regularization-Based Methods for the Selection of Relevant Variables in Non-targeted Metabolomics Data
title_fullStr PLS-Based and Regularization-Based Methods for the Selection of Relevant Variables in Non-targeted Metabolomics Data
title_full_unstemmed PLS-Based and Regularization-Based Methods for the Selection of Relevant Variables in Non-targeted Metabolomics Data
title_short PLS-Based and Regularization-Based Methods for the Selection of Relevant Variables in Non-targeted Metabolomics Data
title_sort pls-based and regularization-based methods for the selection of relevant variables in non-targeted metabolomics data
topic Molecular Biosciences
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4960252/
https://www.ncbi.nlm.nih.gov/pubmed/27508208
http://dx.doi.org/10.3389/fmolb.2016.00035
work_keys_str_mv AT bujakrenata plsbasedandregularizationbasedmethodsfortheselectionofrelevantvariablesinnontargetedmetabolomicsdata
AT daghirwojtkowiakemilia plsbasedandregularizationbasedmethodsfortheselectionofrelevantvariablesinnontargetedmetabolomicsdata
AT kaliszanroman plsbasedandregularizationbasedmethodsfortheselectionofrelevantvariablesinnontargetedmetabolomicsdata
AT markuszewskimichałj plsbasedandregularizationbasedmethodsfortheselectionofrelevantvariablesinnontargetedmetabolomicsdata