Cargando…

Efficient iterative virtual screening with Apache Spark and conformal prediction

BACKGROUND: Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docki...

Descripción completa

Detalles Bibliográficos
Autores principales: Ahmed, Laeeq, Georgiev, Valentin, Capuccini, Marco, Toor, Salman, Schaal, Wesley, Laure, Erwin, Spjuth, Ola
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5833896/
https://www.ncbi.nlm.nih.gov/pubmed/29492726
http://dx.doi.org/10.1186/s13321-018-0265-z
_version_ 1783303561825746944
author Ahmed, Laeeq
Georgiev, Valentin
Capuccini, Marco
Toor, Salman
Schaal, Wesley
Laure, Erwin
Spjuth, Ola
author_facet Ahmed, Laeeq
Georgiev, Valentin
Capuccini, Marco
Toor, Salman
Schaal, Wesley
Laure, Erwin
Spjuth, Ola
author_sort Ahmed, Laeeq
collection PubMed
description BACKGROUND: Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands. CONTRIBUTION: In this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as ‘low-scoring’ ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling. RESULTS: We show on 4 different targets that conformal prediction based virtual screening (CPVS) is able to reduce the number of docked molecules by 62.61% while retaining an accuracy for the top 30 hits of 94% on average and a speedup of 3.7. The implementation is available as open source via GitHub (https://github.com/laeeq80/spark-cpvs) and can be run on high-performance computers as well as on cloud resources. [Image: see text]
format Online
Article
Text
id pubmed-5833896
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-58338962018-03-13 Efficient iterative virtual screening with Apache Spark and conformal prediction Ahmed, Laeeq Georgiev, Valentin Capuccini, Marco Toor, Salman Schaal, Wesley Laure, Erwin Spjuth, Ola J Cheminform Methodology BACKGROUND: Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands. CONTRIBUTION: In this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as ‘low-scoring’ ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling. RESULTS: We show on 4 different targets that conformal prediction based virtual screening (CPVS) is able to reduce the number of docked molecules by 62.61% while retaining an accuracy for the top 30 hits of 94% on average and a speedup of 3.7. The implementation is available as open source via GitHub (https://github.com/laeeq80/spark-cpvs) and can be run on high-performance computers as well as on cloud resources. [Image: see text] Springer International Publishing 2018-03-01 /pmc/articles/PMC5833896/ /pubmed/29492726 http://dx.doi.org/10.1186/s13321-018-0265-z Text en © The Author(s) 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology
Ahmed, Laeeq
Georgiev, Valentin
Capuccini, Marco
Toor, Salman
Schaal, Wesley
Laure, Erwin
Spjuth, Ola
Efficient iterative virtual screening with Apache Spark and conformal prediction
title Efficient iterative virtual screening with Apache Spark and conformal prediction
title_full Efficient iterative virtual screening with Apache Spark and conformal prediction
title_fullStr Efficient iterative virtual screening with Apache Spark and conformal prediction
title_full_unstemmed Efficient iterative virtual screening with Apache Spark and conformal prediction
title_short Efficient iterative virtual screening with Apache Spark and conformal prediction
title_sort efficient iterative virtual screening with apache spark and conformal prediction
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5833896/
https://www.ncbi.nlm.nih.gov/pubmed/29492726
http://dx.doi.org/10.1186/s13321-018-0265-z
work_keys_str_mv AT ahmedlaeeq efficientiterativevirtualscreeningwithapachesparkandconformalprediction
AT georgievvalentin efficientiterativevirtualscreeningwithapachesparkandconformalprediction
AT capuccinimarco efficientiterativevirtualscreeningwithapachesparkandconformalprediction
AT toorsalman efficientiterativevirtualscreeningwithapachesparkandconformalprediction
AT schaalwesley efficientiterativevirtualscreeningwithapachesparkandconformalprediction
AT laureerwin efficientiterativevirtualscreeningwithapachesparkandconformalprediction
AT spjuthola efficientiterativevirtualscreeningwithapachesparkandconformalprediction