Cargando…

Large-scale virtual screening on public cloud resources with Apache Spark

BACKGROUND: Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the av...

Descripción completa

Detalles Bibliográficos
Autores principales: Capuccini, Marco, Ahmed, Laeeq, Schaal, Wesley, Laure, Erwin, Spjuth, Ola
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer International Publishing 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5339264/
https://www.ncbi.nlm.nih.gov/pubmed/28316653
http://dx.doi.org/10.1186/s13321-017-0204-4
_version_ 1782512624718577664
author Capuccini, Marco
Ahmed, Laeeq
Schaal, Wesley
Laure, Erwin
Spjuth, Ola
author_facet Capuccini, Marco
Ahmed, Laeeq
Schaal, Wesley
Laure, Erwin
Spjuth, Ola
author_sort Capuccini, Marco
collection PubMed
description BACKGROUND: Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the available parallel implementations are based on message passing interface, relying on low failure rate hardware and fast network connection. Google’s MapReduce revolutionized large-scale analysis, enabling the processing of massive datasets on commodity hardware and cloud resources, providing transparent scalability and fault tolerance at the software level. Open source implementations of MapReduce include Apache Hadoop and the more recent Apache Spark. RESULTS: We developed a method to run existing docking-based screening software on distributed cloud resources, utilizing the MapReduce approach. We benchmarked our method, which is implemented in Apache Spark, docking a publicly available target receptor against [Formula: see text] 2.2 M compounds. The performance experiments show a good parallel efficiency (87%) when running in a public cloud environment. CONCLUSION: Our method enables parallel Structure-based virtual screening on public cloud resources or commodity computer clusters. The degree of scalability that we achieve allows for trying out our method on relatively small libraries first and then to scale to larger libraries. Our implementation is named Spark-VS and it is freely available as open source from GitHub (https://github.com/mcapuccini/spark-vs). [Figure: see text]
format Online
Article
Text
id pubmed-5339264
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Springer International Publishing
record_format MEDLINE/PubMed
spelling pubmed-53392642017-03-17 Large-scale virtual screening on public cloud resources with Apache Spark Capuccini, Marco Ahmed, Laeeq Schaal, Wesley Laure, Erwin Spjuth, Ola J Cheminform Methodology BACKGROUND: Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the available parallel implementations are based on message passing interface, relying on low failure rate hardware and fast network connection. Google’s MapReduce revolutionized large-scale analysis, enabling the processing of massive datasets on commodity hardware and cloud resources, providing transparent scalability and fault tolerance at the software level. Open source implementations of MapReduce include Apache Hadoop and the more recent Apache Spark. RESULTS: We developed a method to run existing docking-based screening software on distributed cloud resources, utilizing the MapReduce approach. We benchmarked our method, which is implemented in Apache Spark, docking a publicly available target receptor against [Formula: see text] 2.2 M compounds. The performance experiments show a good parallel efficiency (87%) when running in a public cloud environment. CONCLUSION: Our method enables parallel Structure-based virtual screening on public cloud resources or commodity computer clusters. The degree of scalability that we achieve allows for trying out our method on relatively small libraries first and then to scale to larger libraries. Our implementation is named Spark-VS and it is freely available as open source from GitHub (https://github.com/mcapuccini/spark-vs). [Figure: see text] Springer International Publishing 2017-03-06 /pmc/articles/PMC5339264/ /pubmed/28316653 http://dx.doi.org/10.1186/s13321-017-0204-4 Text en © The Author(s) 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology
Capuccini, Marco
Ahmed, Laeeq
Schaal, Wesley
Laure, Erwin
Spjuth, Ola
Large-scale virtual screening on public cloud resources with Apache Spark
title Large-scale virtual screening on public cloud resources with Apache Spark
title_full Large-scale virtual screening on public cloud resources with Apache Spark
title_fullStr Large-scale virtual screening on public cloud resources with Apache Spark
title_full_unstemmed Large-scale virtual screening on public cloud resources with Apache Spark
title_short Large-scale virtual screening on public cloud resources with Apache Spark
title_sort large-scale virtual screening on public cloud resources with apache spark
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5339264/
https://www.ncbi.nlm.nih.gov/pubmed/28316653
http://dx.doi.org/10.1186/s13321-017-0204-4
work_keys_str_mv AT capuccinimarco largescalevirtualscreeningonpubliccloudresourceswithapachespark
AT ahmedlaeeq largescalevirtualscreeningonpubliccloudresourceswithapachespark
AT schaalwesley largescalevirtualscreeningonpubliccloudresourceswithapachespark
AT laureerwin largescalevirtualscreeningonpubliccloudresourceswithapachespark
AT spjuthola largescalevirtualscreeningonpubliccloudresourceswithapachespark