Cargando…

Evaluation of Classifier Performance for Multiclass Phenotype Discrimination in Untargeted Metabolomics

Statistical classification is a critical component of utilizing metabolomics data for examining the molecular determinants of phenotypes. Despite this, a comprehensive and rigorous evaluation of the accuracy of classification techniques for phenotype discrimination given metabolomics data has not be...

Descripción completa

Detalles Bibliográficos
Autores principales:	Trainor, Patrick J., DeFilippis, Andrew P., Rai, Shesh N.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	MDPI 2017
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5488001/ https://www.ncbi.nlm.nih.gov/pubmed/28635678 http://dx.doi.org/10.3390/metabo7020030

_version_	1783246568451735552
author	Trainor, Patrick J. DeFilippis, Andrew P. Rai, Shesh N.
author_facet	Trainor, Patrick J. DeFilippis, Andrew P. Rai, Shesh N.
author_sort	Trainor, Patrick J.
collection	PubMed
description	Statistical classification is a critical component of utilizing metabolomics data for examining the molecular determinants of phenotypes. Despite this, a comprehensive and rigorous evaluation of the accuracy of classification techniques for phenotype discrimination given metabolomics data has not been conducted. We conducted such an evaluation using both simulated and real metabolomics datasets, comparing Partial Least Squares-Discriminant Analysis (PLS-DA), Sparse PLS-DA, Random Forests, Support Vector Machines (SVM), Artificial Neural Network, k-Nearest Neighbors (k-NN), and Naïve Bayes classification techniques for discrimination. We evaluated the techniques on simulated data generated to mimic global untargeted metabolomics data by incorporating realistic block-wise correlation and partial correlation structures for mimicking the correlations and metabolite clustering generated by biological processes. Over the simulation studies, covariance structures, means, and effect sizes were stochastically varied to provide consistent estimates of classifier performance over a wide range of possible scenarios. The effects of the presence of non-normal error distributions, the introduction of biological and technical outliers, unbalanced phenotype allocation, missing values due to abundances below a limit of detection, and the effect of prior-significance filtering (dimension reduction) were evaluated via simulation. In each simulation, classifier parameters, such as the number of hidden nodes in a Neural Network, were optimized by cross-validation to minimize the probability of detecting spurious results due to poorly tuned classifiers. Classifier performance was then evaluated using real metabolomics datasets of varying sample medium, sample size, and experimental design. We report that in the most realistic simulation studies that incorporated non-normal error distributions, unbalanced phenotype allocation, outliers, missing values, and dimension reduction, classifier performance (least to greatest error) was ranked as follows: SVM, Random Forest, Naïve Bayes, sPLS-DA, Neural Networks, PLS-DA and k-NN classifiers. When non-normal error distributions were introduced, the performance of PLS-DA and k-NN classifiers deteriorated further relative to the remaining techniques. Over the real datasets, a trend of better performance of SVM and Random Forest classifier performance was observed.
format	Online Article Text
id	pubmed-5488001
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	MDPI
record_format	MEDLINE/PubMed
spelling	pubmed-54880012017-06-30 Evaluation of Classifier Performance for Multiclass Phenotype Discrimination in Untargeted Metabolomics Trainor, Patrick J. DeFilippis, Andrew P. Rai, Shesh N. Metabolites Article Statistical classification is a critical component of utilizing metabolomics data for examining the molecular determinants of phenotypes. Despite this, a comprehensive and rigorous evaluation of the accuracy of classification techniques for phenotype discrimination given metabolomics data has not been conducted. We conducted such an evaluation using both simulated and real metabolomics datasets, comparing Partial Least Squares-Discriminant Analysis (PLS-DA), Sparse PLS-DA, Random Forests, Support Vector Machines (SVM), Artificial Neural Network, k-Nearest Neighbors (k-NN), and Naïve Bayes classification techniques for discrimination. We evaluated the techniques on simulated data generated to mimic global untargeted metabolomics data by incorporating realistic block-wise correlation and partial correlation structures for mimicking the correlations and metabolite clustering generated by biological processes. Over the simulation studies, covariance structures, means, and effect sizes were stochastically varied to provide consistent estimates of classifier performance over a wide range of possible scenarios. The effects of the presence of non-normal error distributions, the introduction of biological and technical outliers, unbalanced phenotype allocation, missing values due to abundances below a limit of detection, and the effect of prior-significance filtering (dimension reduction) were evaluated via simulation. In each simulation, classifier parameters, such as the number of hidden nodes in a Neural Network, were optimized by cross-validation to minimize the probability of detecting spurious results due to poorly tuned classifiers. Classifier performance was then evaluated using real metabolomics datasets of varying sample medium, sample size, and experimental design. We report that in the most realistic simulation studies that incorporated non-normal error distributions, unbalanced phenotype allocation, outliers, missing values, and dimension reduction, classifier performance (least to greatest error) was ranked as follows: SVM, Random Forest, Naïve Bayes, sPLS-DA, Neural Networks, PLS-DA and k-NN classifiers. When non-normal error distributions were introduced, the performance of PLS-DA and k-NN classifiers deteriorated further relative to the remaining techniques. Over the real datasets, a trend of better performance of SVM and Random Forest classifier performance was observed. MDPI 2017-06-21 /pmc/articles/PMC5488001/ /pubmed/28635678 http://dx.doi.org/10.3390/metabo7020030 Text en © 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle	Article Trainor, Patrick J. DeFilippis, Andrew P. Rai, Shesh N. Evaluation of Classifier Performance for Multiclass Phenotype Discrimination in Untargeted Metabolomics
title	Evaluation of Classifier Performance for Multiclass Phenotype Discrimination in Untargeted Metabolomics
title_full	Evaluation of Classifier Performance for Multiclass Phenotype Discrimination in Untargeted Metabolomics
title_fullStr	Evaluation of Classifier Performance for Multiclass Phenotype Discrimination in Untargeted Metabolomics
title_full_unstemmed	Evaluation of Classifier Performance for Multiclass Phenotype Discrimination in Untargeted Metabolomics
title_short	Evaluation of Classifier Performance for Multiclass Phenotype Discrimination in Untargeted Metabolomics
title_sort	evaluation of classifier performance for multiclass phenotype discrimination in untargeted metabolomics
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5488001/ https://www.ncbi.nlm.nih.gov/pubmed/28635678 http://dx.doi.org/10.3390/metabo7020030
work_keys_str_mv	AT trainorpatrickj evaluationofclassifierperformanceformulticlassphenotypediscriminationinuntargetedmetabolomics AT defilippisandrewp evaluationofclassifierperformanceformulticlassphenotypediscriminationinuntargetedmetabolomics AT raisheshn evaluationofclassifierperformanceformulticlassphenotypediscriminationinuntargetedmetabolomics

Evaluation of Classifier Performance for Multiclass Phenotype Discrimination in Untargeted Metabolomics

Ejemplares similares