Metagenomic Binning with NextFlow

How to build a metagenomic binning pipeline on AWS (Part 2)

Sixing Huang
AWS in Plain English

--

A metagenome is the sum of all single genomes in a habitat, be they viral, bacterial, or eukaryotic (see my introduction here). Currently, due to technical limitations, biologists have to shred all these single genomes and sequence the fragments. DNA fragments from the same organism should have similar DNA compositions and abundances. Hence computer programs can group these similar fragments into bins to represent individual organisms. This process is called metagenomic binning.

Figure 1. Metagenome is the collection of all single genomes in a habitat. Photo by National Cancer Institute on Unsplash

The binning process consists of several essential steps. First, we should quality-filter the raw sequences. Otherwise strange things ensue. Afterwards, the short raw reads are assembled into long contigs. Then the binning process can take place. Finally, a completeness check can estimate how complete each bin is as a genome and whether they are pure. For Illumina data, a bioinformatician can chain some well-established open-source software for each step (Figure 2) and build a binning pipeline.

Figure 2. Essential steps in a metagenomic binning and their software. Image by author.

There are many ways to build such pipelines. I have built one for DSMZ in Braunschweig, Germany. I installed all the necessary software in a Conda environment and chained them together via a Bash script. The pipeline is configured for Slurm. It can also resume aborted tasks. Although it gets the job done, the pipeline is inflexible. Firstly, it is platform-specific: I would have to modify several places to run it on my local machine, on an SGE cluster or on AWS. Secondly, it is difficult to assign compute resources to this monolithic pipeline. I had to claim enough CPU and memory upfront so that the most resource-intensive step could survive. It means that I had to wait longer in the queue until enough compute resources were available and when the job did run, it sat on excessive memory most of the time. Finally, there is not enough logging in the pipeline. That makes debugging difficult and time-consuming.

After I left DSMZ, I tried to reproduce the pipeline. This time I want to build it modularly on AWS. I have found out that FSx for Lustre can be a good start to stage data from S3 to AWS Batch. But then I have found it daunting to build every trigger, Batch configuration and step function all by myself. AWS ParallelCluster will be an interesting choice. It offers a Slurm cluster in the cloud so that I can just migrate my pipeline seamlessly. But, the effort still does not solve the platform-specific problem, only this time it is AWS-specific.

I have kept looking for alternatives. And the name “Nexflow” kept coming up. And after playing around with it, Nextflow looks very promising. This Java program reads in user-defined .nf and .config files and then runs the workflows either locally, on a cluster or in the cloud. It executes programs in any programming languages and supports Docker, Singularity and other container platforms. And unlike my sequential Bash script, processes in Nextflow run in parallel implicitly. This can speed up some complex workflows significantly. Furthermore, I can now tailor compute resources for each process and reduce waste. Last but not least, with the accompanying webapp Nextflow Tower, we can configure and submit Nextflow jobs to a computer cluster or into the cloud easily.

Figure 3. The Nextflow home page. Screenshot by author.

With these many good things, the young Nextflow still needs time to build a larger user base. Tutorials and supports are surprisingly scarce on the internet. Example pipelines are really hard to find. There are many undocumented gotchas in writing and executing Nextflow scripts. And because feedback from the cloud takes time, errors may take hours to fix. All these make it difficult for a beginner like me to take off quickly.

After debugging for a week and spending €30 on AWS, I have finished my minimal Illumina metagenomic binning pipeline with Nextflow. Although the pipeline can run locally or on a Slurm cluster, it is the setup for AWS that requires extra attention. And the knowledge should be transferable to other cloud platforms. Therefore in this article, I am going to demonstrate the steps towards an AWS deployment via the command-line and later via the Nextflow Tower. And I will point out those undocumented gotchas along the way to save you several precious hours. For this demo, I will use the paired-end read data of Enterococcus faecium from the Superbugs Project (AUSMDU00015573). Be careful that a walkthrough of this tutorial does cost you around five to ten dollars. I will use the DSL Version 1 in this project. The code for this project is hosted on my Github repository here:

1. The architecture

Figure 4. The architecture for the minimal metagenomic binning pipeline on AWS. Image by author.

The AWS architecture for this project is deceptively simple. It only involves AWS Batch and S3. In details, AWS Batch creates EC2 instances to perform the computation. S3 serves as the storage for both input and output data. It is noteworthy that in Nextflow Tower, users can even use FSx for Lustre or EFS for better performance (read my article here and here to learn more about these two storage interfaces).

Lucky for us, Nextflow takes care of most of the setup so that we only need to define a few things such as a working bucket in S3. Nextflow coordinates the data upload and download between our local storage and S3. It also automatically moves data back and forth between EC2 and S3 without explicit user instruction. The job dependency is handled by channeling. Like a good plumber, Nextflow chains processes together via channels and keeps the sludge (data) moving among them. Process parallelisation is also implicit. For example, assuming Process A and B rely on the same input data, although the Nextflow script gives us the impression that B can only start after A ends, in fact, Nextflow tries to run both parallelly as soon as their input data are available. This can cut the runtime greatly for many pipelines.

As a result, we can write down our pipeline in simple scripts and then run them in AWS with local input data via just one command. All in all, Nextflow effectively takes a lot of AWS details off our shoulders and we only need to focus on our bioinformatic logic. It is a huge productivity boost.

2. Build a Nextflow AMI and configure the IAM

In AWS, Nextflow relies on AWS Batch to carry out the actual workload. AWS Batch in turn launches EC2 instances to do the actual work. The EC2 instance needs two pieces of software: Docker and AWS CLI. The former is necessary because AWS Batch only supports containers while the AWS CLI is needed for S3 data transfer. Nextflow requires us to build our own AMI to accomplish these objectives.

To build the AMI, first log into your EC2 service. Click Launch instances but choose the Amazon ECS-Optimized Amazon Linux 2 AMI from the AWS Marketplace because this AMI already has Docker and is optimized for AWS Batch.

Figure 5. Choose the base AMI for our Nextflow AMI. Image by author.

Next, choose a medium or a large instance type because the t3.micro (free tier eligible) did not have enough memory for the Conda installation in my test. Afterwards, click 4. Add Storage to adjust the storage for your pipeline. For the current demo, the default value of 30 GiB is fine. Configure the rest and launch the instance.

When the instance is ready, login via SSH. Install AWS CLI by issuing the following commands:

Afterwards, stop the instance and create an AMI out of it. Copy the AMI ID somewhere for later use.

Figure 6. Create a Nextflow AMI. Image by author.

Meanwhile, we need to give the S3 full access to the ecsInstanceRole so that our AWS Batch instances can transfer data from and to S3. Go to your IAM service and search ecsInstanceRole in Roles. If it does not exist, follow this instruction and create it. Afterwards, attach the AmazonS3FullAccess policy to the role (Figure 7.).

Figure 7. Policies for ecsInstanceRole. Image by author.

To use Nextflow Tower, you will also need a IAM user with a specific set of policies. Create that user by following the Tower Forge route here.

3. Configure AWS Batch

Now it is time to set up AWS Batch. Login your AWS Batch service and create a compute environment. Choose Managed and name it something like “nextflow-compute”. Click open Additional settings: service role, instance role, EC2 key pair so that you can choose the AWSBatchServiceRole and ecsInstanceRole in the two drop-downs (if you do not see the first role, follow this instruction to create one).

Figure 8. AWS Batch compute environment. Image by author.

Under Instance configuration, choose either On-demand or Spot. Do NOT choose Fargate or Fargate Spot. Also, make sure that the Minimum vCPUs are 0 and the fine print there explains why. Under Allowed instance types, you can optionally add the c5a.8xlarge because the tool CheckM needs around 40 GiB of memory. Finally, click open Additional settings: launch template, user specified AMI and check User-specified AMI ID, fill in the unique ID of your AMI.

Once the compute environment is created, create a Job queue and call it something like “nextflow-queue”. Connect this queue with the compute environment above.

4. Create an S3 bucket

Create an S3 bucket. This is the single place where you will find all your data. Command-line Nextflow will copy the input data into the bucket automatically. And all the temporary files during the pipeline execution will be stored there too. However, it appears that the webapp Nextflow Tower cannot read local files. Therefore, we can upload the input data into the bucket for the Nextflow Tower demo later.

5. The minimal metagenomic binning pipeline

After all these preparations, we can start composing the Nextflow scripts. The bioinformatic part is defined in the main.nf, while nextflow.config describes the deployment settings on various platforms.

main.nf starts with the pairing of the forward and reverse reads. It puts the pair into the read_pairs channel. Then it comes to the meaty part of the script. My minimal metagenomic binning pipeline consists of fastp, MEGAHIT, MaxBin2 and CheckM (Figure 2). They are implemented in their own code blocks with similar structures (Figure 9). They all have the definition of publishDir. It is a folder where Nextflow dumps the output files from the output channels.

Figure 9. The four processes defined in the nf script. Image by author.

Next are the Docker containers. There are two gotchas. Firstly, they should have the program ps installed in their base images because Nextflow collects task metrics through ps and will error out without it. Secondly, it is vital that the Docker images in Nextflow should NOT have ENTRYPOINT. Otherwise they will fail in AWS. If you want to use a certain image A, first check its Dockerfile or its image layers. If it has an ENTRYPOINT, you need to make a new Dockerfile with just two lines. The first line FROM A sets A as the base and the second line ENTRYPOINT [] removes the ENTRYPOINT. Push your new image to Docker hub or Quay so that Nextflow can download it later.

The input: and output: lines then define the input and output channels. Sample_id, read pairs and fasta files generated by one process are put into the output channels and serve as input in the input channel for the next process. It is also noteworthy that the file() and path() qualifiers are necessary in both input and output channels. Otherwise the downstream processes cannot find the files. Finally, the script: sections show the actual commands of the programs.

The nextflow.config is just a collection of settings and profiles. Below is an excerpt of mine.

Figure 10. An excerpt of my nextflow.config file. Image by author.

The config file explicitly enables Docker for all scenarios. The memory = 60.GB line is necessary. Otherwise, Nextflow will run some resource-hungry processes with just 1 GB of memory. Because CheckM devoured 35 GB of memory in my small test runs, I set the memory value to 60 GB for good measure. So make sure that your machine has at least 60 GB of memory. In the profiles section, I defined the aws profile. In it, the aws.region is set to my region Hongkong and process.queue is set to the queue name that we created in Section 3.

Once you finish the scripts, push them to Github so that Nextflow Tower can read it later.

6. Run the pipeline via command line

Finally, it is time to test drive the pipeline via the command line. Make sure that you have Java and Nextflow in your system (putting the nextflow program into one of your PATH folder can save you some typing down the road). Then run this command to kick off the process:

nextflow run main.nf -resume -bucket-dir [your-s3-bucket] --reads [path-pattern-for-input-read-pairs] -profile aws

Make sure that the paired-end files are named with the R1 and R2 tags before the .fastq.gz endings. Then you could write a path pattern that captures all your input read pairs. For example, my path pattern /Users/dgg32/Downloads/ausm-data/*_R{1,2}.fastq.gz captured the R1 and R2 fastq.gz files in my local Downloads/ausm-data folder. And my run looked like this:

Figure 11. An example Nextflow run. Image by author.

If the run is successful, you should see similar output messages like those in Figure 11. And because I have defined publishDir as “results” in my main.nf, Nextflow automatically transferred all my desired output files from AWS to that local “results” folder. Check that folder and have a look at the results.

Congratulations! You have just finished a Illumina metagenomic binning run on AWS.

7. Run the pipeline via Nextflow Tower

Alternatively, we can use Nextflow Tower to launch the same pipeline on AWS and get the same results. Nextflow Tower has greatly simplified the setup in AWS. It takes over the creation of Nextflow AMI and AWS Batch resources. But so far it seems that Nextflow Tower cannot process input data from the local file system.

Login Nextflow Tower and create a compute environment. You need to fill in your user credential from Section 2. Set the Pipeline work directory to your S3 bucket from Section 4. Give Max CPUs a positive value.

Figure 12. Compute environment setup in Nextflow Tower. Image by author.
Figure 13. Pipeline setup in Nextflow Tower. Image by author.

Afterwards, create a new pipeline under Launchpad. Give the pipeline a name and a description. Set Pipeline to launch to the address of your Nextflow script Github repository. Click open Advanced options and enable Pull latest. Paste the content of our nextflow.config into Nextflow config file. Then click Add.

Afterwards, click the newly created pipeline, specify the input data in YAML style under Pipeline parameters:

reads: '[your-S3-bucket-input-folder]/*_R{1,2}.f*q.gz'

In my case, it was

reads: 's3://nextflow-tower-sixing-asset/raw/*_R{1,2}.f*q.gz'

Launch the pipeline. You should be able to monitor the status of your submitted job under the Runs tab. The pipeline box will first start in orange. When it finishes the preparation and runs, it will turn into light blue.

Figure 14. The report of a successful run in Nextflow Tower. Image by author.

After about 22 minutes, the pipeline box should turn to green. Green means success and red means failure (and you can see my many failed attempts in Figure 14). Check outputs in the scratch folder in your S3 bucket and see whether you have got the results. In the detail page, you can read both the aggregate and individual CPU and memory usages. More interestingly, Nextflow Tower gives us a cost estimate. All these statistics are helpful for us to optimize our pipeline further.

Conclusion

Nextflow has checked many of the buzzwords in the current IT landscape: cloud, Conda, container, Kubernetes, Slurm, reproducible, language-agnostic, scalable, and asynchronous. First and foremost, its succinct declarative syntax makes pipeline construction straightforward. Tools of any language can run under its command. It can resume failed runs. Its support for Conda and container and its extensive logging make reproducibility easy. Finally, it has made a cloud deployment as simple as a local deployment. The cost estimate in Nextflow Tower can help us to improve budgeting. It means that hobby bioinformaticians can now run their pipelines at work in a cluster and at home on the cloud, and still get the same results. Fabulous.

But Nextflow still has room to grow. As mentioned earlier in the article, Nextflow needs more tutorials, documentations, and examples so that more and more users can come under its banner. It would be great if Nextflow Tower can read local files and support more cloud providers such as Alibaba Cloud (it is important for Chinese users).

This metagenomic binning pipeline is minimal but very functional. In fact, based on my experience, I would argue that we should check every public Illumina single genome with this minimal pipeline to detect non-axenic contaminations. But still, the pipeline is just a skeleton and we can add more processes to it. In DSMZ, I have added MetaBAT, CONCOCT and DAS into the binning step to refine the bins. I also wrote some Python scripts to summarize the CheckM results. It is also possible to add functional annotation and taxonomic classification into the mix. Finally, a PacBio metagenomic pipeline can be constructed in a similar fashion.

Are you also using Nextflow? If not, why not give it a try right away? If yes, please tell me your experience.

More content at plainenglish.io

--

--

A Neo4j Ninja, German bioinformatician in Gemini Data. I like to try things: Cloud, ML, satellite imagery, Japanese, plants, and travel the world.