Cargando…
Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp
BACKGROUND: ERp is a variable selection and classification method for metabolomics data. ERp uses minimized classification error rates, based on data from a control and experimental group, to test the null hypothesis of no difference between the distributions of variables over the two groups. If the...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5290706/ https://www.ncbi.nlm.nih.gov/pubmed/28153039 http://dx.doi.org/10.1186/s12859-017-1480-8 |
_version_ | 1782504688902471680 |
---|---|
author | van Reenen, Mari Westerhuis, Johan A. Reinecke, Carolus J. Venter, J Hendrik |
author_facet | van Reenen, Mari Westerhuis, Johan A. Reinecke, Carolus J. Venter, J Hendrik |
author_sort | van Reenen, Mari |
collection | PubMed |
description | BACKGROUND: ERp is a variable selection and classification method for metabolomics data. ERp uses minimized classification error rates, based on data from a control and experimental group, to test the null hypothesis of no difference between the distributions of variables over the two groups. If the associated p-values are significant they indicate discriminatory variables (i.e. informative metabolites). The p-values are calculated assuming a common continuous strictly increasing cumulative distribution under the null hypothesis. This assumption is violated when zero-valued observations can occur with positive probability, a characteristic of GC-MS metabolomics data, disqualifying ERp in this context. This paper extends ERp to address two sources of zero-valued observations: (i) zeros reflecting the complete absence of a metabolite from a sample (true zeros); and (ii) zeros reflecting a measurement below the detection limit. This is achieved by allowing the null cumulative distribution function to take the form of a mixture between a jump at zero and a continuous strictly increasing function. The extended ERp approach is referred to as XERp. RESULTS: XERp is no longer non-parametric, but its null distributions depend only on one parameter, the true proportion of zeros. Under the null hypothesis this parameter can be estimated by the proportion of zeros in the available data. XERp is shown to perform well with regard to bias and power. To demonstrate the utility of XERp, it is applied to GC-MS data from a metabolomics study on tuberculosis meningitis in infants and children. We find that XERp is able to provide an informative shortlist of discriminatory variables, while attaining satisfactory classification accuracy for new subjects in a leave-one-out cross-validation context. CONCLUSION: XERp takes into account the distributional structure of data with a probability mass at zero without requiring any knowledge of the detection limit of the metabolomics platform. XERp is able to identify variables that discriminate between two groups by simultaneously extracting information from the difference in the proportion of zeros and shifts in the distributions of the non-zero observations. XERp uses simple rules to classify new subjects and a weight pair to adjust for unequal sample sizes or sensitivity and specificity requirements. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1480-8) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-5290706 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-52907062017-02-07 Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp van Reenen, Mari Westerhuis, Johan A. Reinecke, Carolus J. Venter, J Hendrik BMC Bioinformatics Methodology Article BACKGROUND: ERp is a variable selection and classification method for metabolomics data. ERp uses minimized classification error rates, based on data from a control and experimental group, to test the null hypothesis of no difference between the distributions of variables over the two groups. If the associated p-values are significant they indicate discriminatory variables (i.e. informative metabolites). The p-values are calculated assuming a common continuous strictly increasing cumulative distribution under the null hypothesis. This assumption is violated when zero-valued observations can occur with positive probability, a characteristic of GC-MS metabolomics data, disqualifying ERp in this context. This paper extends ERp to address two sources of zero-valued observations: (i) zeros reflecting the complete absence of a metabolite from a sample (true zeros); and (ii) zeros reflecting a measurement below the detection limit. This is achieved by allowing the null cumulative distribution function to take the form of a mixture between a jump at zero and a continuous strictly increasing function. The extended ERp approach is referred to as XERp. RESULTS: XERp is no longer non-parametric, but its null distributions depend only on one parameter, the true proportion of zeros. Under the null hypothesis this parameter can be estimated by the proportion of zeros in the available data. XERp is shown to perform well with regard to bias and power. To demonstrate the utility of XERp, it is applied to GC-MS data from a metabolomics study on tuberculosis meningitis in infants and children. We find that XERp is able to provide an informative shortlist of discriminatory variables, while attaining satisfactory classification accuracy for new subjects in a leave-one-out cross-validation context. CONCLUSION: XERp takes into account the distributional structure of data with a probability mass at zero without requiring any knowledge of the detection limit of the metabolomics platform. XERp is able to identify variables that discriminate between two groups by simultaneously extracting information from the difference in the proportion of zeros and shifts in the distributions of the non-zero observations. XERp uses simple rules to classify new subjects and a weight pair to adjust for unequal sample sizes or sensitivity and specificity requirements. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1480-8) contains supplementary material, which is available to authorized users. BioMed Central 2017-02-02 /pmc/articles/PMC5290706/ /pubmed/28153039 http://dx.doi.org/10.1186/s12859-017-1480-8 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Methodology Article van Reenen, Mari Westerhuis, Johan A. Reinecke, Carolus J. Venter, J Hendrik Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp |
title | Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp |
title_full | Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp |
title_fullStr | Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp |
title_full_unstemmed | Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp |
title_short | Metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of ERp |
title_sort | metabolomics variable selection and classification in the presence of observations below the detection limit using an extension of erp |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5290706/ https://www.ncbi.nlm.nih.gov/pubmed/28153039 http://dx.doi.org/10.1186/s12859-017-1480-8 |
work_keys_str_mv | AT vanreenenmari metabolomicsvariableselectionandclassificationinthepresenceofobservationsbelowthedetectionlimitusinganextensionoferp AT westerhuisjohana metabolomicsvariableselectionandclassificationinthepresenceofobservationsbelowthedetectionlimitusinganextensionoferp AT reineckecarolusj metabolomicsvariableselectionandclassificationinthepresenceofobservationsbelowthedetectionlimitusinganextensionoferp AT venterjhendrik metabolomicsvariableselectionandclassificationinthepresenceofobservationsbelowthedetectionlimitusinganextensionoferp |