Cargando…

The influence of negative training set size on machine learning-based virtual screening

BACKGROUND: The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods. RESULTS: The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kurczab, Rafał, Smusz, Sabina, Bojarski, Andrzej J
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4061540/ https://www.ncbi.nlm.nih.gov/pubmed/24976867 http://dx.doi.org/10.1186/1758-2946-6-32

_version_	1782321511482261504
author	Kurczab, Rafał Smusz, Sabina Bojarski, Andrzej J
author_facet	Kurczab, Rafał Smusz, Sabina Bojarski, Andrzej J
author_sort	Kurczab, Rafał
collection	PubMed
description	BACKGROUND: The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods. RESULTS: The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naïve Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set. CONCLUSIONS: In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening.
format	Online Article Text
id	pubmed-4061540
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-40615402014-06-27 The influence of negative training set size on machine learning-based virtual screening Kurczab, Rafał Smusz, Sabina Bojarski, Andrzej J J Cheminform Research Article BACKGROUND: The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods. RESULTS: The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naïve Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set. CONCLUSIONS: In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening. BioMed Central 2014-06-11 /pmc/articles/PMC4061540/ /pubmed/24976867 http://dx.doi.org/10.1186/1758-2946-6-32 Text en Copyright © 2014 Kurczab et al.; licensee Chemistry Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Kurczab, Rafał Smusz, Sabina Bojarski, Andrzej J The influence of negative training set size on machine learning-based virtual screening
title	The influence of negative training set size on machine learning-based virtual screening
title_full	The influence of negative training set size on machine learning-based virtual screening
title_fullStr	The influence of negative training set size on machine learning-based virtual screening
title_full_unstemmed	The influence of negative training set size on machine learning-based virtual screening
title_short	The influence of negative training set size on machine learning-based virtual screening
title_sort	influence of negative training set size on machine learning-based virtual screening
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4061540/ https://www.ncbi.nlm.nih.gov/pubmed/24976867 http://dx.doi.org/10.1186/1758-2946-6-32
work_keys_str_mv	AT kurczabrafał theinfluenceofnegativetrainingsetsizeonmachinelearningbasedvirtualscreening AT smuszsabina theinfluenceofnegativetrainingsetsizeonmachinelearningbasedvirtualscreening AT bojarskiandrzejj theinfluenceofnegativetrainingsetsizeonmachinelearningbasedvirtualscreening AT kurczabrafał influenceofnegativetrainingsetsizeonmachinelearningbasedvirtualscreening AT smuszsabina influenceofnegativetrainingsetsizeonmachinelearningbasedvirtualscreening AT bojarskiandrzejj influenceofnegativetrainingsetsizeonmachinelearningbasedvirtualscreening

The influence of negative training set size on machine learning-based virtual screening

Ejemplares similares