Virtual screening with Parallel Cluster and Nextflow

2021-05-28

POST

Docking is often used in virtual screening to attempt to identify potential drug leads. To be effective you typically need to screen a large number of candidate molecules and that means you need to parallelise the process across multiple servers. This post describes how this can easily be done using AWS Parallel Cluster and Nextflow.

Whilst docking is embarrassingly parallel in nature, with each candidate ligand being independent of the others, doing this in practice can be a lot of effort. We make this relatively simple to do by using AWS Parallel Cluster to create a compute environment on AWS EC2 and using Nextflow to execute the workflow. Parallel cluster is configured to install and use the SLURM workload manager to execute the jobs. Other workload managers are also supported but we prefer SLURM.

The setup of the compute environment using Parallel cluster is described in detail here. In short, these are the key steps:

Create a Python virtual environment and install the necessary modules (including aws-parallelcluster and the AWS CLI).
Setup your AWS credentials
Setup the IAM role and policies
Define post install scripts that run once the nodes have been created. This does a number of things that setup the nodes ready for action, such as installing Singularity, which is used to run the workflow containers, and installing and configuring Nextflow.
Define the configuration of your cluster, such as the number of worker nodes and the node flavours.
Create the cluster by running pcluster create -c ./config ${CLUSTER_NAME}
Connect to the cluster and execute a test Nextflow workflow to check that everything is working.

See the GitHub repository for full details. A few things to note about this.

Firstly the cluster comprises a single master node from where workflows are launched and a number of worker nodes where the computations take place. Nextflow launches the workflow and schedules the tasks using SLURM. Each task is executed as a Singularity container which is generated from a Docker image. Yes, there is a lot going on here, but much of this is hidden from view.

Secondly a volume is created and shared across all master and worker nodes. This is how data is shared between the different tasks. We choose to use an EFS volume, but you can use and EBS ones instead.

Next, you have full control over the flavours of the master and worker nodes, so you can have as many workers as you need, and you can specify to use spot instances to reduce the costs. The workers are created when there is work to do and when the work is complete they are terminated, again to reduce costs.

And finally, Parallel Cluster has a whole range of options to consider. Consult the docs for full details.

So at this stage we have a cluster ready and have run a simple ‘hello world’ workflow on it. What about running a real docking workflow? Well we have lots of examples of this and it really depends on what your starting point is. e.g. have you already prepared your ligands, do you want to enumerate microstates, tautomers or chiral centres, what docking tool and protocol do you want to run, how do you want to post process the generated poses so that you can filter down on the most interesting. We will likely cover many of these topics at a later stage, but for simplicity we’ll run a simple docking protocol using rDock using candidate ligands that are already prepared. The data for this is in the docking directory of the repo. To run this just copy those files to the master node and run nextflow run main.nf. You should see something like this:

$ nextflow run main.nf --num_dockings 5
N E X T F L O W  ~  version 21.04.1
Launching `main.nf` [reverent_lovelace] - revision: e819a8e100
[-        ] process > sdsplit -
executor >  slurm (412)
[29/25b0bb] process > sdsplit     [100%] 1 of 1 ✔
[b8/43764b] process > rdock (410) [100%] 410 of 410 ✔
[fe/e82585] process > results     [100%] 1 of 1 ✔
Completed at: 28-May-2021 12:28:31
Duration    : 2h 44m 27s
CPU hours   : 31.9
Succeeded   : 412

The target is DHFR and the ligands are 1000 diverse molecules from the ChemSpace dataset that are similar in size to the natural ligand, with those molecules enumerated with respect to charges, tautomers and undefined chiral centres. The resulting 10,226 ligands are split into 410 chunks of 25 and docked using rDock, and then the individual results collated and filtered. This is only a simple example of what can be done, but does at least illustrate how to run a computationally demanding workflow on EC2 using Parallel Cluster and Nextflow.