Cargando…

Virtual screening of bioassay data

BACKGROUND: There are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a l...

Descripción completa

Detalles Bibliográficos
Autor principal:	Schierz, Amanda C
Formato:	Texto
Lenguaje:	English
Publicado:	Springer 2009
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2820499/ https://www.ncbi.nlm.nih.gov/pubmed/20150999 http://dx.doi.org/10.1186/1758-2946-1-21

_version_	1782177383368884224
author	Schierz, Amanda C
author_facet	Schierz, Amanda C
author_sort	Schierz, Amanda C
collection	PubMed
description	BACKGROUND: There are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a low ratio of Active compounds to Inactive compounds. This paper first discusses these three problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and Random Forest) are applied to a variety of bioassay datasets. RESULTS: Pharmaceutical bioassay data is not readily available to the academic community. The data held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary and Confirmatory screening assays. With regard to the number of false positives that occur in the primary screening process, the analysis carried out has been shallow due to the lack of cross-referencing mentioned above. In six cases found, the average percentage of false positives from the High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's implementations of the Support Vector Machine and C4.5 decision tree learner have performed relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base classifier used and not solely on the ratio of class imbalance. CONCLUSIONS: Understandably, pharmaceutical data is hard to obtain. However, it would be beneficial to both the pharmaceutical industry and to academics for curated primary screening and corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual screening techniques to bioassay data. First, by reducing the search space of compounds to be screened and secondly, by analysing the false positives that occur in the primary screening process, the technology may be improved. The number of false positives arising from primary screening leads to the issue of whether this type of data should be used for virtual screening. Care when using Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class ratios should not be used when comparing differing classifiers for the same dataset.
format	Text
id	pubmed-2820499
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	Springer
record_format	MEDLINE/PubMed
spelling	pubmed-28204992010-02-12 Virtual screening of bioassay data Schierz, Amanda C J Cheminform Research Article BACKGROUND: There are three main problems associated with the virtual screening of bioassay data. The first is access to freely-available curated data, the second is the number of false positives that occur in the physical primary screening process, and finally the data is highly-imbalanced with a low ratio of Active compounds to Inactive compounds. This paper first discusses these three problems and then a selection of Weka cost-sensitive classifiers (Naive Bayes, SVM, C4.5 and Random Forest) are applied to a variety of bioassay datasets. RESULTS: Pharmaceutical bioassay data is not readily available to the academic community. The data held at PubChem is not curated and there is a lack of detailed cross-referencing between Primary and Confirmatory screening assays. With regard to the number of false positives that occur in the primary screening process, the analysis carried out has been shallow due to the lack of cross-referencing mentioned above. In six cases found, the average percentage of false positives from the High-Throughput Primary screen is quite high at 64%. For the cost-sensitive classification, Weka's implementations of the Support Vector Machine and C4.5 decision tree learner have performed relatively well. It was also found, that the setting of the Weka cost matrix is dependent on the base classifier used and not solely on the ratio of class imbalance. CONCLUSIONS: Understandably, pharmaceutical data is hard to obtain. However, it would be beneficial to both the pharmaceutical industry and to academics for curated primary screening and corresponding confirmatory data to be provided. Two benefits could be gained by employing virtual screening techniques to bioassay data. First, by reducing the search space of compounds to be screened and secondly, by analysing the false positives that occur in the primary screening process, the technology may be improved. The number of false positives arising from primary screening leads to the issue of whether this type of data should be used for virtual screening. Care when using Weka's cost-sensitive classifiers is needed - across the board misclassification costs based on class ratios should not be used when comparing differing classifiers for the same dataset. Springer 2009-12-22 /pmc/articles/PMC2820499/ /pubmed/20150999 http://dx.doi.org/10.1186/1758-2946-1-21 Text en Copyright © 2009 Schierz; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Schierz, Amanda C Virtual screening of bioassay data
title	Virtual screening of bioassay data
title_full	Virtual screening of bioassay data
title_fullStr	Virtual screening of bioassay data
title_full_unstemmed	Virtual screening of bioassay data
title_short	Virtual screening of bioassay data
title_sort	virtual screening of bioassay data
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2820499/ https://www.ncbi.nlm.nih.gov/pubmed/20150999 http://dx.doi.org/10.1186/1758-2946-1-21
work_keys_str_mv	AT schierzamandac virtualscreeningofbioassaydata

Virtual screening of bioassay data

Ejemplares similares