Cargando…

Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution

BACKGROUND: Glycoproteins are involved in a diverse range of biochemical and biological processes. Changes in protein glycosylation are believed to occur in many diseases, particularly during cancer initiation and progression. The identification of biomarkers for human disease states is becoming inc...

Descripción completa

Detalles Bibliográficos
Autores principales: Galligan, Marie C, Saldova, Radka, Campbell, Matthew P, Rudd, Pauline M, Murphy, Thomas B
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3703279/
https://www.ncbi.nlm.nih.gov/pubmed/23651459
http://dx.doi.org/10.1186/1471-2105-14-155
_version_ 1782275881388998656
author Galligan, Marie C
Saldova, Radka
Campbell, Matthew P
Rudd, Pauline M
Murphy, Thomas B
author_facet Galligan, Marie C
Saldova, Radka
Campbell, Matthew P
Rudd, Pauline M
Murphy, Thomas B
author_sort Galligan, Marie C
collection PubMed
description BACKGROUND: Glycoproteins are involved in a diverse range of biochemical and biological processes. Changes in protein glycosylation are believed to occur in many diseases, particularly during cancer initiation and progression. The identification of biomarkers for human disease states is becoming increasingly important, as early detection is key to improving survival and recovery rates. To this end, the serum glycome has been proposed as a potential source of biomarkers for different types of cancers. High-throughput hydrophilic interaction liquid chromatography (HILIC) technology for glycan analysis allows for the detailed quantification of the glycan content in human serum. However, the experimental data from this analysis is compositional by nature. Compositional data are subject to a constant-sum constraint, which restricts the sample space to a simplex. Statistical analysis of glycan chromatography datasets should account for their unusual mathematical properties. As the volume of glycan HILIC data being produced increases, there is a considerable need for a framework to support appropriate statistical analysis. Proposed here is a methodology for feature selection in compositional data. The principal objective is to provide a template for the analysis of glycan chromatography data that may be used to identify potential glycan biomarkers. RESULTS: A greedy search algorithm, based on the generalized Dirichlet distribution, is carried out over the feature space to search for the set of “grouping variables” that best discriminate between known group structures in the data, modelling the compositional variables using beta distributions. The algorithm is applied to two glycan chromatography datasets. Statistical classification methods are used to test the ability of the selected features to differentiate between known groups in the data. Two well-known methods are used for comparison: correlation-based feature selection (CFS) and recursive partitioning (rpart). CFS is a feature selection method, while recursive partitioning is a learning tree algorithm that has been used for feature selection in the past. CONCLUSIONS: The proposed feature selection method performs well for both glycan chromatography datasets. It is computationally slower, but results in a lower misclassification rate and a higher sensitivity rate than both correlation-based feature selection and the classification tree method.
format Online
Article
Text
id pubmed-3703279
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-37032792013-07-10 Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution Galligan, Marie C Saldova, Radka Campbell, Matthew P Rudd, Pauline M Murphy, Thomas B BMC Bioinformatics Methodology Article BACKGROUND: Glycoproteins are involved in a diverse range of biochemical and biological processes. Changes in protein glycosylation are believed to occur in many diseases, particularly during cancer initiation and progression. The identification of biomarkers for human disease states is becoming increasingly important, as early detection is key to improving survival and recovery rates. To this end, the serum glycome has been proposed as a potential source of biomarkers for different types of cancers. High-throughput hydrophilic interaction liquid chromatography (HILIC) technology for glycan analysis allows for the detailed quantification of the glycan content in human serum. However, the experimental data from this analysis is compositional by nature. Compositional data are subject to a constant-sum constraint, which restricts the sample space to a simplex. Statistical analysis of glycan chromatography datasets should account for their unusual mathematical properties. As the volume of glycan HILIC data being produced increases, there is a considerable need for a framework to support appropriate statistical analysis. Proposed here is a methodology for feature selection in compositional data. The principal objective is to provide a template for the analysis of glycan chromatography data that may be used to identify potential glycan biomarkers. RESULTS: A greedy search algorithm, based on the generalized Dirichlet distribution, is carried out over the feature space to search for the set of “grouping variables” that best discriminate between known group structures in the data, modelling the compositional variables using beta distributions. The algorithm is applied to two glycan chromatography datasets. Statistical classification methods are used to test the ability of the selected features to differentiate between known groups in the data. Two well-known methods are used for comparison: correlation-based feature selection (CFS) and recursive partitioning (rpart). CFS is a feature selection method, while recursive partitioning is a learning tree algorithm that has been used for feature selection in the past. CONCLUSIONS: The proposed feature selection method performs well for both glycan chromatography datasets. It is computationally slower, but results in a lower misclassification rate and a higher sensitivity rate than both correlation-based feature selection and the classification tree method. BioMed Central 2013-05-07 /pmc/articles/PMC3703279/ /pubmed/23651459 http://dx.doi.org/10.1186/1471-2105-14-155 Text en Copyright © 2013 Galligan et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Galligan, Marie C
Saldova, Radka
Campbell, Matthew P
Rudd, Pauline M
Murphy, Thomas B
Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution
title Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution
title_full Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution
title_fullStr Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution
title_full_unstemmed Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution
title_short Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution
title_sort greedy feature selection for glycan chromatography data with the generalized dirichlet distribution
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3703279/
https://www.ncbi.nlm.nih.gov/pubmed/23651459
http://dx.doi.org/10.1186/1471-2105-14-155
work_keys_str_mv AT galliganmariec greedyfeatureselectionforglycanchromatographydatawiththegeneralizeddirichletdistribution
AT saldovaradka greedyfeatureselectionforglycanchromatographydatawiththegeneralizeddirichletdistribution
AT campbellmatthewp greedyfeatureselectionforglycanchromatographydatawiththegeneralizeddirichletdistribution
AT ruddpaulinem greedyfeatureselectionforglycanchromatographydatawiththegeneralizeddirichletdistribution
AT murphythomasb greedyfeatureselectionforglycanchromatographydatawiththegeneralizeddirichletdistribution