Cargando…

Variable selection for binary classification using error rate p-values applied to metabolomics data

BACKGROUND: Metabolomics datasets are often high-dimensional though only a limited number of variables are expected to be informative given a specific research question. The important task of selecting informative variables can therefore become complex. In this paper we look at discriminating betwee...

Descripción completa

Detalles Bibliográficos
Autores principales:	van Reenen, Mari, Reinecke, Carolus J., Westerhuis, Johan A., Venter, J. Hendrik
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4712617/ https://www.ncbi.nlm.nih.gov/pubmed/26763892 http://dx.doi.org/10.1186/s12859-015-0867-7

_version_	1782410100682522624
author	van Reenen, Mari Reinecke, Carolus J. Westerhuis, Johan A. Venter, J. Hendrik
author_facet	van Reenen, Mari Reinecke, Carolus J. Westerhuis, Johan A. Venter, J. Hendrik
author_sort	van Reenen, Mari
collection	PubMed
description	BACKGROUND: Metabolomics datasets are often high-dimensional though only a limited number of variables are expected to be informative given a specific research question. The important task of selecting informative variables can therefore become complex. In this paper we look at discriminating between two groups. Two tasks need to be performed: (i) finding variables which differ between the two groups; and (ii) determining how the selected variables can be used to classify new subjects. We introduce an approach using minimum classification error rates as test statistics to find discriminatory and therefore informative variables. The thresholds resulting in the minimum error rates can be used to classify new subjects. This approach transforms error rates into p-values and is referred to as ERp. RESULTS: We show that non-parametric hypothesis testing, based on minimum classification error rates as test statistics, can find statistically significantly shifted variables. The discriminatory ability of variables becomes more apparent when error rates are evaluated based on their corresponding p-values, as relatively high error rates can still be statistically significant. ERp can handle unequal and small group sizes, as well as account for the cost of misclassification. ERp retains (if known) or reveals (if unknown) the shift direction, aiding in biological interpretation. The threshold resulting in the minimum error rate can immediately be used to classify new subjects. We use NMR generated metabolomics data to illustrate how ERp is able to discriminate subjects diagnosed with Mycobacterium tuberculosis infected meningitis from a control group. The list of discriminatory variables produced by ERp contains all biologically relevant variables with appropriate shift directions discussed in the original paper from which this data is taken. CONCLUSIONS: ERp performs variable selection and classification, is non-parametric and aids biological interpretation while handling unequal group sizes and misclassification costs. All this is achieved by a single approach which is easy to perform and interpret. ERp has the potential to address many other characteristics of metabolomics data. Future research aims to extend ERp to account for a large proportion of observations below the detection limit, as well as expand on interactions between variables. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0867-7) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4712617
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-47126172016-01-15 Variable selection for binary classification using error rate p-values applied to metabolomics data van Reenen, Mari Reinecke, Carolus J. Westerhuis, Johan A. Venter, J. Hendrik BMC Bioinformatics Methodology Article BACKGROUND: Metabolomics datasets are often high-dimensional though only a limited number of variables are expected to be informative given a specific research question. The important task of selecting informative variables can therefore become complex. In this paper we look at discriminating between two groups. Two tasks need to be performed: (i) finding variables which differ between the two groups; and (ii) determining how the selected variables can be used to classify new subjects. We introduce an approach using minimum classification error rates as test statistics to find discriminatory and therefore informative variables. The thresholds resulting in the minimum error rates can be used to classify new subjects. This approach transforms error rates into p-values and is referred to as ERp. RESULTS: We show that non-parametric hypothesis testing, based on minimum classification error rates as test statistics, can find statistically significantly shifted variables. The discriminatory ability of variables becomes more apparent when error rates are evaluated based on their corresponding p-values, as relatively high error rates can still be statistically significant. ERp can handle unequal and small group sizes, as well as account for the cost of misclassification. ERp retains (if known) or reveals (if unknown) the shift direction, aiding in biological interpretation. The threshold resulting in the minimum error rate can immediately be used to classify new subjects. We use NMR generated metabolomics data to illustrate how ERp is able to discriminate subjects diagnosed with Mycobacterium tuberculosis infected meningitis from a control group. The list of discriminatory variables produced by ERp contains all biologically relevant variables with appropriate shift directions discussed in the original paper from which this data is taken. CONCLUSIONS: ERp performs variable selection and classification, is non-parametric and aids biological interpretation while handling unequal group sizes and misclassification costs. All this is achieved by a single approach which is easy to perform and interpret. ERp has the potential to address many other characteristics of metabolomics data. Future research aims to extend ERp to account for a large proportion of observations below the detection limit, as well as expand on interactions between variables. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0867-7) contains supplementary material, which is available to authorized users. BioMed Central 2016-01-14 /pmc/articles/PMC4712617/ /pubmed/26763892 http://dx.doi.org/10.1186/s12859-015-0867-7 Text en © van Reenen et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article van Reenen, Mari Reinecke, Carolus J. Westerhuis, Johan A. Venter, J. Hendrik Variable selection for binary classification using error rate p-values applied to metabolomics data
title	Variable selection for binary classification using error rate p-values applied to metabolomics data
title_full	Variable selection for binary classification using error rate p-values applied to metabolomics data
title_fullStr	Variable selection for binary classification using error rate p-values applied to metabolomics data
title_full_unstemmed	Variable selection for binary classification using error rate p-values applied to metabolomics data
title_short	Variable selection for binary classification using error rate p-values applied to metabolomics data
title_sort	variable selection for binary classification using error rate p-values applied to metabolomics data
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4712617/ https://www.ncbi.nlm.nih.gov/pubmed/26763892 http://dx.doi.org/10.1186/s12859-015-0867-7
work_keys_str_mv	AT vanreenenmari variableselectionforbinaryclassificationusingerrorratepvaluesappliedtometabolomicsdata AT reineckecarolusj variableselectionforbinaryclassificationusingerrorratepvaluesappliedtometabolomicsdata AT westerhuisjohana variableselectionforbinaryclassificationusingerrorratepvaluesappliedtometabolomicsdata AT venterjhendrik variableselectionforbinaryclassificationusingerrorratepvaluesappliedtometabolomicsdata

Variable selection for binary classification using error rate p-values applied to metabolomics data

Ejemplares similares