Cargando…

A novel method for mining highly imbalanced high-throughput screening data in PubChem

Motivation: The comprehensive information of small molecules and their biological activities in PubChem brings great opportunities for academic researchers. However, mining high-throughput screening (HTS) assay data remains a great challenge given the very large data volume and the highly imbalanced...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Qingliang, Wang, Yanli, Bryant, Stephen H.
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2788930/
https://www.ncbi.nlm.nih.gov/pubmed/19825798
http://dx.doi.org/10.1093/bioinformatics/btp589
_version_ 1782175015367606272
author Li, Qingliang
Wang, Yanli
Bryant, Stephen H.
author_facet Li, Qingliang
Wang, Yanli
Bryant, Stephen H.
author_sort Li, Qingliang
collection PubMed
description Motivation: The comprehensive information of small molecules and their biological activities in PubChem brings great opportunities for academic researchers. However, mining high-throughput screening (HTS) assay data remains a great challenge given the very large data volume and the highly imbalanced nature with only small number of active compounds compared to inactive compounds. Therefore, there is currently a need for better strategies to work with HTS assay data. Moreover, as luciferase-based HTS technology is frequently exploited in the assays deposited in PubChem, constructing a computational model to distinguish and filter out potential interference compounds for these assays is another motivation. Results: We used the granular support vector machines (SVMs) repetitive under sampling method (GSVM-RU) to construct an SVM from luciferase inhibition bioassay data that the imbalance ratio of active/inactive is high (1/377). The best model recognized the active and inactive compounds at the accuracies of 86.60% and 88.89 with a total accuracy of 87.74%, by cross-validation test and blind test. These results demonstrate the robustness of the model in handling the intrinsic imbalance problem in HTS data and it can be used as a virtual screening tool to identify potential interference compounds in luciferase-based HTS experiments. Additionally, this method has also proved computationally efficient by greatly reducing the computational cost and can be easily adopted in the analysis of HTS data for other biological systems. Availability: Data are publicly available in PubChem with AIDs of 773, 1006 and 1379. Contact: ywang@ncbi.nlm.nih.gov; bryant@ncbi.nlm.nih.gov Supplementary information: Supplementary data are available at Bioinformatics online.
format Text
id pubmed-2788930
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-27889302009-12-07 A novel method for mining highly imbalanced high-throughput screening data in PubChem Li, Qingliang Wang, Yanli Bryant, Stephen H. Bioinformatics Original Papers Motivation: The comprehensive information of small molecules and their biological activities in PubChem brings great opportunities for academic researchers. However, mining high-throughput screening (HTS) assay data remains a great challenge given the very large data volume and the highly imbalanced nature with only small number of active compounds compared to inactive compounds. Therefore, there is currently a need for better strategies to work with HTS assay data. Moreover, as luciferase-based HTS technology is frequently exploited in the assays deposited in PubChem, constructing a computational model to distinguish and filter out potential interference compounds for these assays is another motivation. Results: We used the granular support vector machines (SVMs) repetitive under sampling method (GSVM-RU) to construct an SVM from luciferase inhibition bioassay data that the imbalance ratio of active/inactive is high (1/377). The best model recognized the active and inactive compounds at the accuracies of 86.60% and 88.89 with a total accuracy of 87.74%, by cross-validation test and blind test. These results demonstrate the robustness of the model in handling the intrinsic imbalance problem in HTS data and it can be used as a virtual screening tool to identify potential interference compounds in luciferase-based HTS experiments. Additionally, this method has also proved computationally efficient by greatly reducing the computational cost and can be easily adopted in the analysis of HTS data for other biological systems. Availability: Data are publicly available in PubChem with AIDs of 773, 1006 and 1379. Contact: ywang@ncbi.nlm.nih.gov; bryant@ncbi.nlm.nih.gov Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2009-12-15 2009-10-13 /pmc/articles/PMC2788930/ /pubmed/19825798 http://dx.doi.org/10.1093/bioinformatics/btp589 Text en http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Li, Qingliang
Wang, Yanli
Bryant, Stephen H.
A novel method for mining highly imbalanced high-throughput screening data in PubChem
title A novel method for mining highly imbalanced high-throughput screening data in PubChem
title_full A novel method for mining highly imbalanced high-throughput screening data in PubChem
title_fullStr A novel method for mining highly imbalanced high-throughput screening data in PubChem
title_full_unstemmed A novel method for mining highly imbalanced high-throughput screening data in PubChem
title_short A novel method for mining highly imbalanced high-throughput screening data in PubChem
title_sort novel method for mining highly imbalanced high-throughput screening data in pubchem
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2788930/
https://www.ncbi.nlm.nih.gov/pubmed/19825798
http://dx.doi.org/10.1093/bioinformatics/btp589
work_keys_str_mv AT liqingliang anovelmethodformininghighlyimbalancedhighthroughputscreeningdatainpubchem
AT wangyanli anovelmethodformininghighlyimbalancedhighthroughputscreeningdatainpubchem
AT bryantstephenh anovelmethodformininghighlyimbalancedhighthroughputscreeningdatainpubchem
AT liqingliang novelmethodformininghighlyimbalancedhighthroughputscreeningdatainpubchem
AT wangyanli novelmethodformininghighlyimbalancedhighthroughputscreeningdatainpubchem
AT bryantstephenh novelmethodformininghighlyimbalancedhighthroughputscreeningdatainpubchem