Cargando…

Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds

[Image: see text] Support vector machine (SVM) modeling is one of the most popular machine learning approaches in chemoinformatics and drug design. The influence of training set composition and size on predictions currently is an underinvestigated issue in SVM modeling. In this study, we have derive...

Descripción completa

Detalles Bibliográficos
Autores principales:	Rodríguez-Pérez, Raquel, Vogt, Martin, Bajorath, Jürgen
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	American Chemical Society 2017
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5417594/ https://www.ncbi.nlm.nih.gov/pubmed/28376613 http://dx.doi.org/10.1021/acs.jcim.7b00088

_version_	1783233913480544256
author	Rodríguez-Pérez, Raquel Vogt, Martin Bajorath, Jürgen
author_facet	Rodríguez-Pérez, Raquel Vogt, Martin Bajorath, Jürgen
author_sort	Rodríguez-Pérez, Raquel
collection	PubMed
description	[Image: see text] Support vector machine (SVM) modeling is one of the most popular machine learning approaches in chemoinformatics and drug design. The influence of training set composition and size on predictions currently is an underinvestigated issue in SVM modeling. In this study, we have derived SVM classification and ranking models for a variety of compound activity classes under systematic variation of the number of positive and negative training examples. With increasing numbers of negative training compounds, SVM classification calculations became increasingly accurate and stable. However, this was only the case if a required threshold of positive training examples was also reached. In addition, consideration of class weights and optimization of cost factors substantially aided in balancing the calculations for increasing numbers of negative training examples. Taken together, the results of our analysis have practical implications for SVM learning and the prediction of active compounds. For all compound classes under study, top recall performance and independence of compound recall of training set composition was achieved when 250–500 active and 500–1000 randomly selected inactive training instances were used. However, as long as ∼50 known active compounds were available for training, increasing numbers of 500–1000 randomly selected negative training examples significantly improved model performance and gave very similar results for different training sets.
format	Online Article Text
id	pubmed-5417594
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	American Chemical Society
record_format	MEDLINE/PubMed
spelling	pubmed-54175942017-05-05 Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds Rodríguez-Pérez, Raquel Vogt, Martin Bajorath, Jürgen J Chem Inf Model [Image: see text] Support vector machine (SVM) modeling is one of the most popular machine learning approaches in chemoinformatics and drug design. The influence of training set composition and size on predictions currently is an underinvestigated issue in SVM modeling. In this study, we have derived SVM classification and ranking models for a variety of compound activity classes under systematic variation of the number of positive and negative training examples. With increasing numbers of negative training compounds, SVM classification calculations became increasingly accurate and stable. However, this was only the case if a required threshold of positive training examples was also reached. In addition, consideration of class weights and optimization of cost factors substantially aided in balancing the calculations for increasing numbers of negative training examples. Taken together, the results of our analysis have practical implications for SVM learning and the prediction of active compounds. For all compound classes under study, top recall performance and independence of compound recall of training set composition was achieved when 250–500 active and 500–1000 randomly selected inactive training instances were used. However, as long as ∼50 known active compounds were available for training, increasing numbers of 500–1000 randomly selected negative training examples significantly improved model performance and gave very similar results for different training sets. American Chemical Society 2017-04-04 2017-04-24 /pmc/articles/PMC5417594/ /pubmed/28376613 http://dx.doi.org/10.1021/acs.jcim.7b00088 Text en Copyright © 2017 American Chemical Society This is an open access article published under an ACS AuthorChoice License (http://pubs.acs.org/page/policy/authorchoice_termsofuse.html) , which permits copying and redistribution of the article or any adaptations for non-commercial purposes.
spellingShingle	Rodríguez-Pérez, Raquel Vogt, Martin Bajorath, Jürgen Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds
title	Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds
title_full	Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds
title_fullStr	Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds
title_full_unstemmed	Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds
title_short	Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds
title_sort	influence of varying training set composition and size on support vector machine-based prediction of active compounds
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5417594/ https://www.ncbi.nlm.nih.gov/pubmed/28376613 http://dx.doi.org/10.1021/acs.jcim.7b00088
work_keys_str_mv	AT rodriguezperezraquel influenceofvaryingtrainingsetcompositionandsizeonsupportvectormachinebasedpredictionofactivecompounds AT vogtmartin influenceofvaryingtrainingsetcompositionandsizeonsupportvectormachinebasedpredictionofactivecompounds AT bajorathjurgen influenceofvaryingtrainingsetcompositionandsizeonsupportvectormachinebasedpredictionofactivecompounds

Influence of Varying Training Set Composition and Size on Support Vector Machine-Based Prediction of Active Compounds

Ejemplares similares