Cargando…

The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening

The machine learning-based virtual screening of molecular databases is a commonly used approach to identify hits. However, many aspects associated with training predictive models can influence the final performance and, consequently, the number of hits found. Thus, we performed a systematic study of...

Descripción completa

Detalles Bibliográficos
Autores principales:	Kurczab, Rafał, Bojarski, Andrzej J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5383296/ https://www.ncbi.nlm.nih.gov/pubmed/28384344 http://dx.doi.org/10.1371/journal.pone.0175410

_version_	1782520259679354880
author	Kurczab, Rafał Bojarski, Andrzej J.
author_facet	Kurczab, Rafał Bojarski, Andrzej J.
author_sort	Kurczab, Rafał
collection	PubMed
description	The machine learning-based virtual screening of molecular databases is a commonly used approach to identify hits. However, many aspects associated with training predictive models can influence the final performance and, consequently, the number of hits found. Thus, we performed a systematic study of the simultaneous influence of the proportion of negatives to positives in the testing set, the size of screening databases and the type of molecular representations on the effectiveness of classification. The results obtained for eight protein targets, five machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest), two types of molecular fingerprints (MACCS and CDK FP) and eight screening databases with different numbers of molecules confirmed our previous findings that increases in the ratio of negative to positive training instances greatly influenced most of the investigated parameters of the ML methods in simulated virtual screening experiments. However, the performance of screening was shown to also be highly dependent on the molecular library dimension. Generally, with the increasing size of the screened database, the optimal training ratio also increased, and this ratio can be rationalized using the proposed cost-effectiveness threshold approach. To increase the performance of machine learning-based virtual screening, the training set should be constructed in a way that considers the size of the screening database.
format	Online Article Text
id	pubmed-5383296
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-53832962017-05-03 The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening Kurczab, Rafał Bojarski, Andrzej J. PLoS One Research Article The machine learning-based virtual screening of molecular databases is a commonly used approach to identify hits. However, many aspects associated with training predictive models can influence the final performance and, consequently, the number of hits found. Thus, we performed a systematic study of the simultaneous influence of the proportion of negatives to positives in the testing set, the size of screening databases and the type of molecular representations on the effectiveness of classification. The results obtained for eight protein targets, five machine learning algorithms (SMO, Naïve Bayes, Ibk, J48 and Random Forest), two types of molecular fingerprints (MACCS and CDK FP) and eight screening databases with different numbers of molecules confirmed our previous findings that increases in the ratio of negative to positive training instances greatly influenced most of the investigated parameters of the ML methods in simulated virtual screening experiments. However, the performance of screening was shown to also be highly dependent on the molecular library dimension. Generally, with the increasing size of the screened database, the optimal training ratio also increased, and this ratio can be rationalized using the proposed cost-effectiveness threshold approach. To increase the performance of machine learning-based virtual screening, the training set should be constructed in a way that considers the size of the screening database. Public Library of Science 2017-04-06 /pmc/articles/PMC5383296/ /pubmed/28384344 http://dx.doi.org/10.1371/journal.pone.0175410 Text en © 2017 Kurczab, Bojarski http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Kurczab, Rafał Bojarski, Andrzej J. The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening
title	The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening
title_full	The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening
title_fullStr	The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening
title_full_unstemmed	The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening
title_short	The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening
title_sort	influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5383296/ https://www.ncbi.nlm.nih.gov/pubmed/28384344 http://dx.doi.org/10.1371/journal.pone.0175410
work_keys_str_mv	AT kurczabrafał theinfluenceofthenegativepositiveratioandscreeningdatabasesizeontheperformanceofmachinelearningbasedvirtualscreening AT bojarskiandrzejj theinfluenceofthenegativepositiveratioandscreeningdatabasesizeontheperformanceofmachinelearningbasedvirtualscreening AT kurczabrafał influenceofthenegativepositiveratioandscreeningdatabasesizeontheperformanceofmachinelearningbasedvirtualscreening AT bojarskiandrzejj influenceofthenegativepositiveratioandscreeningdatabasesizeontheperformanceofmachinelearningbasedvirtualscreening

The influence of the negative-positive ratio and screening database size on the performance of machine learning-based virtual screening

Ejemplares similares