Cargando…

Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp

BACKGROUND: ERp is a variable selection and classification method for metabolomics data. ERp uses minimized classification error rates, based on data from a control and experimental group, to test the null hypothesis of no difference between the distributions of variables over the two groups. If the...

Descripción completa

Detalles Bibliográficos
Autores principales: van Reenen, Mari, Westerhuis, Johan A., Reinecke, Carolus J., Venter, J Hendrik
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5290706/
https://www.ncbi.nlm.nih.gov/pubmed/28153039
http://dx.doi.org/10.1186/s12859-017-1480-8
_version_ 1782504688902471680
author van Reenen, Mari
Westerhuis, Johan A.
Reinecke, Carolus J.
Venter, J Hendrik
author_facet van Reenen, Mari
Westerhuis, Johan A.
Reinecke, Carolus J.
Venter, J Hendrik
author_sort van Reenen, Mari
collection PubMed
description BACKGROUND: ERp is a variable selection and classification method for metabolomics data. ERp uses minimized classification error rates, based on data from a control and experimental group, to test the null hypothesis of no difference between the distributions of variables over the two groups. If the associated p-values are significant they indicate discriminatory variables (i.e. informative metabolites). The p-values are calculated assuming a common continuous strictly increasing cumulative distribution under the null hypothesis. This assumption is violated when zero-valued observations can occur with positive probability, a characteristic of GC-MS metabolomics data, disqualifying ERp in this context. This paper extends ERp to address two sources of zero-valued observations: (i) zeros reflecting the complete absence of a metabolite from a sample (true zeros); and (ii) zeros reflecting a measurement below the detection limit. This is achieved by allowing the null cumulative distribution function to take the form of a mixture between a jump at zero and a continuous strictly increasing function. The extended ERp approach is referred to as XERp. RESULTS: XERp is no longer non-parametric, but its null distributions depend only on one parameter, the true proportion of zeros. Under the null hypothesis this parameter can be estimated by the proportion of zeros in the available data. XERp is shown to perform well with regard to bias and power. To demonstrate the utility of XERp, it is applied to GC-MS data from a metabolomics study on tuberculosis meningitis in infants and children. We find that XERp is able to provide an informative shortlist of discriminatory variables, while attaining satisfactory classification accuracy for new subjects in a leave-one-out cross-validation context. CONCLUSION: XERp takes into account the distributional structure of data with a probability mass at zero without requiring any knowledge of the detection limit of the metabolomics platform. XERp is able to identify variables that discriminate between two groups by simultaneously extracting information from the difference in the proportion of zeros and shifts in the distributions of the non-zero observations. XERp uses simple rules to classify new subjects and a weight pair to adjust for unequal sample sizes or sensitivity and specificity requirements. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1480-8) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5290706
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-52907062017-02-07 Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp van Reenen, Mari Westerhuis, Johan A. Reinecke, Carolus J. Venter, J Hendrik BMC Bioinformatics Methodology Article BACKGROUND: ERp is a variable selection and classification method for metabolomics data. ERp uses minimized classification error rates, based on data from a control and experimental group, to test the null hypothesis of no difference between the distributions of variables over the two groups. If the associated p-values are significant they indicate discriminatory variables (i.e. informative metabolites). The p-values are calculated assuming a common continuous strictly increasing cumulative distribution under the null hypothesis. This assumption is violated when zero-valued observations can occur with positive probability, a characteristic of GC-MS metabolomics data, disqualifying ERp in this context. This paper extends ERp to address two sources of zero-valued observations: (i) zeros reflecting the complete absence of a metabolite from a sample (true zeros); and (ii) zeros reflecting a measurement below the detection limit. This is achieved by allowing the null cumulative distribution function to take the form of a mixture between a jump at zero and a continuous strictly increasing function. The extended ERp approach is referred to as XERp. RESULTS: XERp is no longer non-parametric, but its null distributions depend only on one parameter, the true proportion of zeros. Under the null hypothesis this parameter can be estimated by the proportion of zeros in the available data. XERp is shown to perform well with regard to bias and power. To demonstrate the utility of XERp, it is applied to GC-MS data from a metabolomics study on tuberculosis meningitis in infants and children. We find that XERp is able to provide an informative shortlist of discriminatory variables, while attaining satisfactory classification accuracy for new subjects in a leave-one-out cross-validation context. CONCLUSION: XERp takes into account the distributional structure of data with a probability mass at zero without requiring any knowledge of the detection limit of the metabolomics platform. XERp is able to identify variables that discriminate between two groups by simultaneously extracting information from the difference in the proportion of zeros and shifts in the distributions of the non-zero observations. XERp uses simple rules to classify new subjects and a weight pair to adjust for unequal sample sizes or sensitivity and specificity requirements. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1480-8) contains supplementary material, which is available to authorized users. BioMed Central 2017-02-02 /pmc/articles/PMC5290706/ /pubmed/28153039 http://dx.doi.org/10.1186/s12859-017-1480-8 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
van Reenen, Mari
Westerhuis, Johan A.
Reinecke, Carolus J.
Venter, J Hendrik
Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp
title Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp
title_full Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp
title_fullStr Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp
title_full_unstemmed Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp
title_short Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp
title_sort metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of erp
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5290706/
https://www.ncbi.nlm.nih.gov/pubmed/28153039
http://dx.doi.org/10.1186/s12859-017-1480-8
work_keys_str_mv AT vanreenenmari metabolomicsvariableselectionandclassificationinthepresenceofobservationsbelowthedetectionlimitusinganextensionoferp
AT westerhuisjohana metabolomicsvariableselectionandclassificationinthepresenceofobservationsbelowthedetectionlimitusinganextensionoferp
AT reineckecarolusj metabolomicsvariableselectionandclassificationinthepresenceofobservationsbelowthedetectionlimitusinganextensionoferp
AT venterjhendrik metabolomicsvariableselectionandclassificationinthepresenceofobservationsbelowthedetectionlimitusinganextensionoferp