The world is moving towards the cloud computing fast. This is because the cloud is very easy, cheap, accessible and secure. Cloud providers such as Amazon Web Service (AWS) take over many repetitive tedius IT maintenance tasks for their customers. As a result, cloud users can focus on their own business logics. The cloud also provide different pricing options that may be more cost effective than on-premises servers. In addition, unlike its on-premises counterparts, the cloud is very accessible and secure through the internet.

In this tutorial, I am going to show you how I managed a serverless NCBI taxonomy…

Combine Maps, OR-Tools, SendGrid and Cloud Functions to commandeer a delivery fleet

This article shows how to:

1. Set up a Cloud Storage in GCP that triggers a Cloud Function when a file is uploaded;

2. Set up a Cloud Function that calculates the optimal routing strategies with Google Maps and Google OR-Tools;

3. Send instruction emails to the carriers via SendGrid;

4. Set up Cloud Build for continuous deployment.

It is amazing how many modern business solutions can be built on the cloud with just a few services. Developers can combine cloud components to make a pipeline that goes from data ingestion to sending the results via emails with just a…

Gene cluster finding, annotation curation and seqeunce management all in one

If the 21st century is the Age of Biology (1), then genome sequencing is the harbinger. Genome sequencing basically turns DNA molecules into texts in computers. DNA sequences are stored in simple ASCII text files such as Fasta and Fastq. Biologists then run programs over them to discover proteins (open reading frame predictions). The functions of these predicted proteins can be guessed via their close relatives in the databases (protein annotations, please read my article here and here for more information). Biologists save all these works into GBK or EMBL files like this:

Figure 1. Example of an EMBL file. Image by author.

GBK or EMBL files are still just…

The missing manual for NCBI SRA in the Cloud

Biomedical researchers need to deposit their raw sequence data into one of the international nucleotide sequence databases before they can publish their results. The Sequence Read Archive (SRA) in the National Center for Biotechnology Information (NCBI) is one such database. Ever since its inception, SRA has not only served as the ultimate sequence storage but also as a data exchange platform. A biologist can search and download sequence data from other studies with ease and conduct all sorts of new analyses.

Since 2020, NCBI has distributed SRA also to the Google Cloud Platform (GCP) and the Amazon Web Services (AWS)…

Turbo-charge your carbohydrate genome analyses with Neo4j

This article shows how to:

1. Convert the CAZy database into Neo4j

2. Analyze the CAZynome of Formosa agariphila KMM 3901 to gain new insights

3. Build a GraphQL API for language-agnostic data access

4. Perform graph embedding and node classification for cellulose degradation prediction

What is the most abundant polymer in the world? The answer may surprise many: cellulose. It is a polysaccharide found in the plant cell walls. And we make it into paper, T-shirt and cellophane. And the second place is also taken by a polysaccharide: chitin. It is inside the cell walls of fungi, the scales…

Do bioinformatics without having your own computer cluster

We have entered the age of Massively Parallel Sequencing, also called next-generation sequencing (NGS).

Sequencers nowadays can sequence millions to billions of DNA fragments at once and only cost around $100 per 1 billion bases (our human genome contains 3 billion bases).

For example, the HiSeq X Ten System from Illumina consists of ten HiSeq X machines and together they can sequence per year 18,000 human genomes at 30x coverage for less than $1000 each. In comparison, the first complete human genome took 13 years to complete in 2003 and cost $2.7 billion dollars.

Massively Parallel Sequencing demands Massively Parallel…

New data reveal how the pandemic has hit the Japanese travel industry

This article shows how to:

1. Download the newest tourism data from official databases in Japan.

2. Use Pandas and Facebook Prophet to analyze the data.

3. Show the impact of COVID-19 on the hotel industry, foreign visitors and household spending.

Figure 1. Corona cloud over Mount Fuji, taken by the Author on 17 January, 2020 at Motosuko.

As I browsed the photos from my last Japanese trip in January 2020, a photo stood out: a naturally occurring, corona-shaped cloud covered Mount Fuji (Figure 1.). In retrospect, its symbolism hits home way too close. In fact, the first case of COVID-19 in Japan was confirmed two days earlier. Less than a week later, on 23 January, Wuhan…

How to use TF-IDF, XGBoost and SHAP to classify and explain metagenomes

This article shows:

1. Build a TF-IDF and XGBoost pipeline with GridSearchCV for metagenome sample typing.

2. Use TF-IDF to vectorize taxonomic profile for modeling.

3. Build a XGBoost model with high F1-scores.

4. Use SHAP to see feature contributions and identify distinct taxa for each sample type


Ever since Antonie van Leeuwenhoek first lay eyes on the microbial world, we the human have been obssessed with it (learn more with the classics Microbe Hunters). In less than 400 years, we went from ignorance to realizing their key roles in our health and the global biochemical cycles. It was estimated…

Analyzing the KEGG Disease Data with a Graph Database

This article shows how to:

1. Use Neo4j to get quick overviews over the KEGG Disease database.

2. Identify multipurpose drugs.

3. Show details about some pathogens such as SARS-CoV-2.

4. Form disease communities with Louvain and discover the most connected diseases with PageRank. SARS and COVID-19 are isolated “islands” separated from other large disease clusters.

Disclaimer: This article does not provide medical advice. It is intended for informational purposes only. It is not a substitute for professional medical advice, diagnosis or treatment.

COVID-19 has the world by the short hairs. This contagious disease has taken a heavy toll on…

In my previous articles “Serve NCBI Taxonomy in AWS, Serverlessly” and “Build Your Own GraphQL GenBank in AWS”, I outlined my approaches to putting up NCBI’s resources as REST or GraphQL APIs in AWS. They were straightforward, but they were also the hard ways to do things — followers have to click and type a lot before the services were up and running, an error-prone and boring process.

Do it once, OK. Do it twice, no.

What if we can describe all these steps into some scripts and just let a software run through them and also build all those…

Sixing Huang

Certified Neo4j Professional, bioinformatician in DSMZ Germany. I want to learn more about Cloud, machine learning and Japanese.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store