Cargando…

A comparison of machine learning and Bayesian modelling for molecular serotyping

BACKGROUND: Streptococcus pneumoniae is a human pathogen that is a major cause of infant mortality. Identifying the pneumococcal serotype is an important step in monitoring the impact of vaccines used to protect against disease. Genomic microarrays provide an effective method for molecular serotypin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Newton, Richard, Wernisch, Lorenz
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5553679/ https://www.ncbi.nlm.nih.gov/pubmed/28800724 http://dx.doi.org/10.1186/s12864-017-3998-6

_version_	1783256657972690944
author	Newton, Richard Wernisch, Lorenz
author_facet	Newton, Richard Wernisch, Lorenz
author_sort	Newton, Richard
collection	PubMed
description	BACKGROUND: Streptococcus pneumoniae is a human pathogen that is a major cause of infant mortality. Identifying the pneumococcal serotype is an important step in monitoring the impact of vaccines used to protect against disease. Genomic microarrays provide an effective method for molecular serotyping. Previously we developed an empirical Bayesian model for the classification of serotypes from a molecular serotyping array. With only few samples available, a model driven approach was the only option. In the meanwhile, several thousand samples have been made available to us, providing an opportunity to investigate serotype classification by machine learning methods, which could complement the Bayesian model. RESULTS: We compare the performance of the original Bayesian model with two machine learning algorithms: Gradient Boosting Machines and Random Forests. We present our results as an example of a generic strategy whereby a preliminary probabilistic model is complemented or replaced by a machine learning classifier once enough data are available. Despite the availability of thousands of serotyping arrays, a problem encountered when applying machine learning methods is the lack of training data containing mixtures of serotypes; due to the large number of possible combinations. Most of the available training data comprises samples with only a single serotype. To overcome the lack of training data we implemented an iterative analysis, creating artificial training data of serotype mixtures by combining raw data from single serotype arrays. CONCLUSIONS: With the enhanced training set the machine learning algorithms out perform the original Bayesian model. However, for serotypes currently lacking sufficient training data the best performing implementation was a combination of the results of the Bayesian Model and the Gradient Boosting Machine. As well as being an effective method for classifying biological data, machine learning can also be used as an efficient method for revealing subtle biological insights, which we illustrate with an example.
format	Online Article Text
id	pubmed-5553679
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-55536792017-08-15 A comparison of machine learning and Bayesian modelling for molecular serotyping Newton, Richard Wernisch, Lorenz BMC Genomics Research Article BACKGROUND: Streptococcus pneumoniae is a human pathogen that is a major cause of infant mortality. Identifying the pneumococcal serotype is an important step in monitoring the impact of vaccines used to protect against disease. Genomic microarrays provide an effective method for molecular serotyping. Previously we developed an empirical Bayesian model for the classification of serotypes from a molecular serotyping array. With only few samples available, a model driven approach was the only option. In the meanwhile, several thousand samples have been made available to us, providing an opportunity to investigate serotype classification by machine learning methods, which could complement the Bayesian model. RESULTS: We compare the performance of the original Bayesian model with two machine learning algorithms: Gradient Boosting Machines and Random Forests. We present our results as an example of a generic strategy whereby a preliminary probabilistic model is complemented or replaced by a machine learning classifier once enough data are available. Despite the availability of thousands of serotyping arrays, a problem encountered when applying machine learning methods is the lack of training data containing mixtures of serotypes; due to the large number of possible combinations. Most of the available training data comprises samples with only a single serotype. To overcome the lack of training data we implemented an iterative analysis, creating artificial training data of serotype mixtures by combining raw data from single serotype arrays. CONCLUSIONS: With the enhanced training set the machine learning algorithms out perform the original Bayesian model. However, for serotypes currently lacking sufficient training data the best performing implementation was a combination of the results of the Bayesian Model and the Gradient Boosting Machine. As well as being an effective method for classifying biological data, machine learning can also be used as an efficient method for revealing subtle biological insights, which we illustrate with an example. BioMed Central 2017-08-11 /pmc/articles/PMC5553679/ /pubmed/28800724 http://dx.doi.org/10.1186/s12864-017-3998-6 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Newton, Richard Wernisch, Lorenz A comparison of machine learning and Bayesian modelling for molecular serotyping
title	A comparison of machine learning and Bayesian modelling for molecular serotyping
title_full	A comparison of machine learning and Bayesian modelling for molecular serotyping
title_fullStr	A comparison of machine learning and Bayesian modelling for molecular serotyping
title_full_unstemmed	A comparison of machine learning and Bayesian modelling for molecular serotyping
title_short	A comparison of machine learning and Bayesian modelling for molecular serotyping
title_sort	comparison of machine learning and bayesian modelling for molecular serotyping
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5553679/ https://www.ncbi.nlm.nih.gov/pubmed/28800724 http://dx.doi.org/10.1186/s12864-017-3998-6
work_keys_str_mv	AT newtonrichard acomparisonofmachinelearningandbayesianmodellingformolecularserotyping AT wernischlorenz acomparisonofmachinelearningandbayesianmodellingformolecularserotyping AT newtonrichard comparisonofmachinelearningandbayesianmodellingformolecularserotyping AT wernischlorenz comparisonofmachinelearningandbayesianmodellingformolecularserotyping

A comparison of machine learning and Bayesian modelling for molecular serotyping

Ejemplares similares