Cargando…

A Bayesian network approach to feature selection in mass spectrometry data

BACKGROUND: Time-of-flight mass spectrometry (TOF-MS) has the potential to provide non-invasive, high-throughput screening for cancers and other serious diseases via detection of protein biomarkers in blood or other accessible biologic samples. Unfortunately, this potential has largely been unrealiz...

Descripción completa

Detalles Bibliográficos
Autores principales: Kuschner, Karl W, Malyarenko, Dariya I, Cooke, William E, Cazares, Lisa H, Semmes, OJ, Tracy, Eugene R
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098056/
https://www.ncbi.nlm.nih.gov/pubmed/20377906
http://dx.doi.org/10.1186/1471-2105-11-177
_version_ 1782203907192127488
author Kuschner, Karl W
Malyarenko, Dariya I
Cooke, William E
Cazares, Lisa H
Semmes, OJ
Tracy, Eugene R
author_facet Kuschner, Karl W
Malyarenko, Dariya I
Cooke, William E
Cazares, Lisa H
Semmes, OJ
Tracy, Eugene R
author_sort Kuschner, Karl W
collection PubMed
description BACKGROUND: Time-of-flight mass spectrometry (TOF-MS) has the potential to provide non-invasive, high-throughput screening for cancers and other serious diseases via detection of protein biomarkers in blood or other accessible biologic samples. Unfortunately, this potential has largely been unrealized to date due to the high variability of measurements, uncertainties in the distribution of proteins in a given population, and the difficulty of extracting repeatable diagnostic markers using current statistical tools. With studies consisting of perhaps only dozens of samples, and possibly hundreds of variables, overfitting is a serious complication. To overcome these difficulties, we have developed a Bayesian inductive method which uses model-independent methods of discovering relationships between spectral features. This method appears to efficiently discover network models which not only identify connections between the disease and key features, but also organizes relationships between features--and furthermore creates a stable classifier that categorizes new data at predicted error rates. RESULTS: The method was applied to artificial data with known feature relationships and typical TOF-MS variability introduced, and was able to recover those relationships nearly perfectly. It was also applied to blood sera data from a 2004 leukemia study, and showed high stability of selected features under cross-validation. Verification of results using withheld data showed excellent predictive power. The method showed improvement over traditional techniques, and naturally incorporated measurement uncertainties. The relationships discovered between features allowed preliminary identification of a protein biomarker which was consistent with other cancer studies and later verified experimentally. CONCLUSIONS: This method appears to avoid overfitting in biologic data and produce stable feature sets in a network model. The network structure provides additional information about the relationships among features that is useful to guide further biochemical analysis. In addition, when used to classify new data, these feature sets are far more consistent than those produced by many traditional techniques.
format Text
id pubmed-3098056
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30980562011-05-20 A Bayesian network approach to feature selection in mass spectrometry data Kuschner, Karl W Malyarenko, Dariya I Cooke, William E Cazares, Lisa H Semmes, OJ Tracy, Eugene R BMC Bioinformatics Methodology Article BACKGROUND: Time-of-flight mass spectrometry (TOF-MS) has the potential to provide non-invasive, high-throughput screening for cancers and other serious diseases via detection of protein biomarkers in blood or other accessible biologic samples. Unfortunately, this potential has largely been unrealized to date due to the high variability of measurements, uncertainties in the distribution of proteins in a given population, and the difficulty of extracting repeatable diagnostic markers using current statistical tools. With studies consisting of perhaps only dozens of samples, and possibly hundreds of variables, overfitting is a serious complication. To overcome these difficulties, we have developed a Bayesian inductive method which uses model-independent methods of discovering relationships between spectral features. This method appears to efficiently discover network models which not only identify connections between the disease and key features, but also organizes relationships between features--and furthermore creates a stable classifier that categorizes new data at predicted error rates. RESULTS: The method was applied to artificial data with known feature relationships and typical TOF-MS variability introduced, and was able to recover those relationships nearly perfectly. It was also applied to blood sera data from a 2004 leukemia study, and showed high stability of selected features under cross-validation. Verification of results using withheld data showed excellent predictive power. The method showed improvement over traditional techniques, and naturally incorporated measurement uncertainties. The relationships discovered between features allowed preliminary identification of a protein biomarker which was consistent with other cancer studies and later verified experimentally. CONCLUSIONS: This method appears to avoid overfitting in biologic data and produce stable feature sets in a network model. The network structure provides additional information about the relationships among features that is useful to guide further biochemical analysis. In addition, when used to classify new data, these feature sets are far more consistent than those produced by many traditional techniques. BioMed Central 2010-04-08 /pmc/articles/PMC3098056/ /pubmed/20377906 http://dx.doi.org/10.1186/1471-2105-11-177 Text en Copyright ©2010 Kuschner et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Kuschner, Karl W
Malyarenko, Dariya I
Cooke, William E
Cazares, Lisa H
Semmes, OJ
Tracy, Eugene R
A Bayesian network approach to feature selection in mass spectrometry data
title A Bayesian network approach to feature selection in mass spectrometry data
title_full A Bayesian network approach to feature selection in mass spectrometry data
title_fullStr A Bayesian network approach to feature selection in mass spectrometry data
title_full_unstemmed A Bayesian network approach to feature selection in mass spectrometry data
title_short A Bayesian network approach to feature selection in mass spectrometry data
title_sort bayesian network approach to feature selection in mass spectrometry data
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098056/
https://www.ncbi.nlm.nih.gov/pubmed/20377906
http://dx.doi.org/10.1186/1471-2105-11-177
work_keys_str_mv AT kuschnerkarlw abayesiannetworkapproachtofeatureselectioninmassspectrometrydata
AT malyarenkodariyai abayesiannetworkapproachtofeatureselectioninmassspectrometrydata
AT cookewilliame abayesiannetworkapproachtofeatureselectioninmassspectrometrydata
AT cazareslisah abayesiannetworkapproachtofeatureselectioninmassspectrometrydata
AT semmesoj abayesiannetworkapproachtofeatureselectioninmassspectrometrydata
AT tracyeugener abayesiannetworkapproachtofeatureselectioninmassspectrometrydata
AT kuschnerkarlw bayesiannetworkapproachtofeatureselectioninmassspectrometrydata
AT malyarenkodariyai bayesiannetworkapproachtofeatureselectioninmassspectrometrydata
AT cookewilliame bayesiannetworkapproachtofeatureselectioninmassspectrometrydata
AT cazareslisah bayesiannetworkapproachtofeatureselectioninmassspectrometrydata
AT semmesoj bayesiannetworkapproachtofeatureselectioninmassspectrometrydata
AT tracyeugener bayesiannetworkapproachtofeatureselectioninmassspectrometrydata