The world is moving towards the cloud computing fast. This is because the cloud is very easy, cheap, accessible and secure. Cloud providers such as Amazon Web Service (AWS) take over many repetitive tedius IT maintenance tasks for their customers. As a result, cloud users can focus on their own business logics. The cloud also provide different pricing options that may be more cost effective than on-premises servers. In addition, unlike its on-premises counterparts, the cloud is very accessible and secure through the internet.

In this tutorial, I am going to show you how I managed a serverless NCBI taxonomy…

How to build a metagenomic binning pipeline on AWS (Part 1)

Bioinformatics is leaping into the cloud

In the webinar “Scaling genomics workloads using HPC on AWS” on July 14, 2021, I learned that the heavyweights such as AstraZeneca and Illumina have already moved their genome analyses into the AWS cloud and have been reaping the great benefits ever since. The cloud reduced both the runtime and costs dramatically. For example, AstraZeneca cut the runtime for its sequence data processing pipeline by 2400% on AWS while Illumina is saving close to $400,000 in monthly compute and storage costs.

One statement from the webinar lecturer Dr. Evan Bollig stood out: “start your migration with S3 and focus on…

Three museums in Yokohama, Stavanger and Berlin taught me something unexpected

Everyone loves a good museum visit. It is an intensive learning session in our free time. Although it is the objects themselves that do all the talking, but let’s not overlook the contributions of the museum curators. They carefully select exhibits to educate and entertain the visitors. It is a subtle art to organize an immersive itinerary that engages the visitors and teaches them what the museum name suggests. It is even harder to surpass the visitors’ expectation and show them something extra. But some museums have achieved just that. Here are the three museums that have gone above and…

Ridiculous sequencing results revealed how errors propagated from one research study to a global database

Garbage in, garbage out. But first you need to know what garbage looks like.

Figure 1. Carp in the soil.

Last year, when we were working at a publication about three Cyanobacteria, my colleague Pia Marter told me that the our three metagenome-assembled genomes (MAG) contain some DNA fragments from Cyprinus carpio (common carp). My first thought was: that was not surprising because our Cyanobacteria were aquatic. But then she told me that other colleagues in DSMZ also reported carp in their samples. And those samples came from anything but water: forest soils, fish pathogens and plants. All of these projects were sequenced on the Illumina…

Build an analytic pipeline with ElasticBLAST, SNS, and DataBrew on AWS

Photo by tian kuan on Unsplash

Bioinformatic programs come and go, but BLAST stays.

BLAST, short for a Basic local alignment search tool, is the search engine for bioinformaticians. While Google takes text strings as queries and returns relevant web pages, BLAST accepts DNA or protein sequences as queries and returns similar sequences from the databases such as the Non-redundant Nucleotide database and the Non-redundant Protein database from the National Center for Biotechnology Information (NCBI).

BLAST is the bread and butter for all bioinformaticians. Published in 1990 by Altschul, its paper has been cited 92,993 times as of this writing. For every biologist, this evergreen piece…

Combine Maps, OR-Tools, SendGrid and Cloud Functions to commandeer a delivery fleet

This article shows how to:

1. Set up a Cloud Storage in GCP that triggers a Cloud Function when a file is uploaded;

2. Set up a Cloud Function that calculates the optimal routing strategies with Google Maps and Google OR-Tools;

3. Send instruction emails to the carriers via SendGrid;

4. Set up Cloud Build for continuous deployment.

It is amazing how many modern business solutions can be built on the cloud with just a few services. Developers can combine cloud components to make a pipeline that goes from data ingestion to sending the results via emails with just a…

Gene cluster finding, annotation curation and seqeunce management all in one

If the 21st century is the Age of Biology (1), then genome sequencing is the harbinger. Genome sequencing basically turns DNA molecules into texts in computers. DNA sequences are stored in simple ASCII text files such as Fasta and Fastq. Biologists then run programs over them to discover proteins (open reading frame predictions). The functions of these predicted proteins can be guessed via their close relatives in the databases (protein annotations, please read my article here and here for more information). Biologists save all these works into GBK or EMBL files like this:

Figure 1. Example of an EMBL file. Image by author.

GBK or EMBL files are still just…

The missing manual for NCBI SRA in the Cloud

Biomedical researchers need to deposit their raw sequence data into one of the international nucleotide sequence databases before they can publish their results. The Sequence Read Archive (SRA) in the National Center for Biotechnology Information (NCBI) is one such database. Ever since its inception, SRA has not only served as the ultimate sequence storage but also as a data exchange platform. A biologist can search and download sequence data from other studies with ease and conduct all sorts of new analyses.

Since 2020, NCBI has distributed SRA also to the Google Cloud Platform (GCP) and the Amazon Web Services (AWS)…

Turbo-charge your carbohydrate genome analyses with Neo4j

This article shows how to:

1. Convert the CAZy database into Neo4j

2. Analyze the CAZynome of Formosa agariphila KMM 3901 to gain new insights

3. Build a GraphQL API for language-agnostic data access

4. Perform graph embedding and node classification for cellulose degradation prediction

What is the most abundant polymer in the world? The answer may surprise many: cellulose. It is a polysaccharide found in the plant cell walls. And we make it into paper, T-shirt and cellophane. And the second place is also taken by a polysaccharide: chitin. It is inside the cell walls of fungi, the scales…

Do bioinformatics without having your own computer cluster

We have entered the age of Massively Parallel Sequencing, also called next-generation sequencing (NGS).

Sequencers nowadays can sequence millions to billions of DNA fragments at once and only cost around $100 per 1 billion bases (our human genome contains 3 billion bases).

For example, the HiSeq X Ten System from Illumina consists of ten HiSeq X machines and together they can sequence per year 18,000 human genomes at 30x coverage for less than $1000 each. In comparison, the first complete human genome took 13 years to complete in 2003 and cost $2.7 billion dollars.

Massively Parallel Sequencing demands Massively Parallel…

Sixing Huang

Certified Neo4j Professional, bioinformatician in DSMZ Germany. I want to learn more about Cloud, machine learning and Japanese.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store