The world is moving towards the cloud computing fast. This is because the cloud is very easy, cheap, accessible and secure. Cloud providers such as Amazon Web Service (AWS) take over many repetitive tedius IT maintenance tasks for their customers. As a result, cloud users can focus on their own business logics. The cloud also provide different pricing options that may be more cost effective than on-premises servers. In addition, unlike its on-premises counterparts, the cloud is very accessible and secure through the internet.
This article shows:
1. Build a TF-IDF and XGBoost pipeline with GridSearchCV for metagenome sample typing.
2. Use TF-IDF to vectorize taxonomic profile for modeling.
3. Build a XGBoost model with high F1-scores.
4. Use SHAP to see feature contributions and identify distinct taxa for each sample type
Ever since Antonie van Leeuwenhoek first lay eyes on the microbial world, we the human have been obssessed with it (learn more with the classics Microbe Hunters). In less than 400 years, we went from ignorance to realizing their key roles in our health and the global biochemical cycles. It was estimated…
This article shows:
1. Use Neo4j to get quick overviews over the KEGG Disease database.
2. Identify multipurpose drugs.
3. Show details about some pathogens such as SARS-CoV-2.
4. Form disease communities with Louvain and discover the most connected diseases with PageRank. SARS and COVID-19 are isolated “islands” separated from other large disease clusters.
Disclaimer: This article does not provide medical advice. It is intended for informational purposes only. It is not a substitute for professional medical advice, diagnosis or treatment.
COVID-19 has the world by the short hairs. This contagious disease has taken a heavy toll on our lives…
In my previous articles “Serve NCBI Taxonomy in AWS, Serverlessly” and “Build Your Own GraphQL GenBank in AWS”, I outlined my approaches to putting up NCBI’s resources as REST or GraphQL APIs in AWS. They were straightforward, but they were also the hard ways to do things — followers have to click and type a lot before the services were up and running, an error-prone and boring process.
Do it once, OK. Do it twice, no.
What if we can describe all these steps into some scripts and just let a software run through them and also build all those…
As a museum goer, I have slowly got acquainted with the works of Vermeer, Rembrandt, Caravaggio and other prolific old masters. Even though I am a casual art lover and their works did cover a wide range of subjects, from biblical, mythical to genre art, there seems to be certain individual styles that distinguish one from another. In fact, these unique styles became their signatures and were well documented online. But more often than not, I was often surprised by how different some first-seen Rembrandts are from his other works. …
The size of a scientific study: GB of raw data, gets turned into MB of Excel tables, probably written into KB of a manuscript and certainly abstracted into a few Bytes of sentences.
In 2019, I worked in a project commissioned by the Convention on Biological Diversity (CBD) on digital sequence information (DSI) in public and private databases. During the data gathering phase, I needed to download both the GenBank and Whole Genome Seqeuncing (WGS) from NCBI. Those were four TB worth of compressed data in GBK format. GBK is an all-in-one format used widely in bioinformatics. It…
A genome is the sum of all genetic materials inside an organism, be it a virus, a bacterium, or a human. It is the book of life written in just four letters: A, T, C, and G. However simple is its alphabet, this book contains all the instructions needed for its owner to reproduce and to survive. Just like a book with many sentences, a genome contains many genes. With these genes, the cell can manufacture proteins to construct itself and fulfill various biochemical functions. …
Finding a new home is a big daunting task. Buyers are bombarded with information. Apart from the usual data provided by the brokers, buyers usually like to know something more about the neighborhood, the so-called geospatial data or “geo”. This ranges from the basic and general questions such as “how many supermarkets and schools are around?” to the more personal ones like “how far is it from my workplace and how long does it take to drive there”.
If the list is short, it is possible to answer all these questions with a few searches in Google Maps or even…
I am currently working in DSMZ Germany as a bioinformatician. I want to learn more about Cloud, machine learning and Japanese.