Carp in the Soil

Ridiculous sequencing results revealed how errors propagated from one research study to a global database

Sixing Huang
7 min readJun 30, 2021

Garbage in, garbage out. But first you need to know what garbage looks like.

Figure 1. Carp in the soil. https://en.wikipedia.org/wiki/File:Cyprinus_carpio.jpeg

Last year, when we were working at a publication about three Cyanobacteria, my colleague Pia Marter told me that the our three metagenome-assembled genomes (MAG) contain some DNA fragments from Cyprinus carpio (common carp). My first thought was: that was not surprising because our Cyanobacteria were aquatic. But then she told me that other colleagues in DSMZ also reported carp in their samples. And those samples came from anything but water: forest soils, fish pathogens and plants. All of these projects were sequenced on the Illumina platform. Some were on NextSeq 550 and the others were on NovaSeq 6000.

So we had carp in the soil?

It immediately reminded me of the “Phantom of Heilbronn”, a.k.a. “Woman Without a Face”. She was a hypothesized female serial killer whose DNA appeared in over 40 crime scenes from Austria to Germany to France from 1993 to 2009. In the end, it was clear that the DNA came from a factory worker. She contaminated the cotton swabs that were later used in the crime scene investigation. The story was even fictionized into one episode of CSI: NY. So I thought that the carp could come from some contamination during our sequencing processes in our campus.

But that hypothesis was short-lived. And it was not good. Another colleague tested his samples and told me that his contaminated samples were not sequenced in our campus. This sounds the alarm of a large-scale contamination. Since my institute DSMZ relies heavily on Illumina sequencing, a rampant contamination like this could derail many of our projects. This prompted me to start an investigation into the issue immediately.

I have collected samples from several colleagues as well as some samples from the Sequence Read Archive (SRA) in NCBI. They were sequenced on NextSeq and NovaSeq. My goal was to estimate the scope of the contamination and its causes. It would be great if I can find a way to decontaminate the samples, too. This article documents my investigation.

1. Collect the carp reads

I went to my colleague Selma Gomes Vieira for some of her carp-infested soil DNA samples. I opened one of her Fastq files and got hit by many of the spurious sequences inside, like these:

List 1. Some spurious sequences from one of our sequencing project in DSMZ. The first couple of base pairs have been removed to show the poly-G tails.

They look unnatural. Firstly, the poly-G tails are conspicuous. The preceding fragments look random but they were so remarkably similar when I compared them. They are numerous too, making up nearly 70% of the sample. Because the soil is biologically highly diverse, I could never have expected that one sequence could dominate a sample like this. So I took some of the fragments to NCBI BLAST and here were the results:

Figure 2. The BLAST results of one of the spurious carp sequence. Image by author.

The long hits (Acc. Len) all came from one organism: Cyprinus carpio, the common carp. So it seemed that these spurious sequences are the sources of our carp problem.

Also, the tool fastp from OpenGene can show us the writing on the wall as well:

Figure 3. Fastp report on one of Selma’s sample. Image by author.

Figure 3. shows the fastp statistics of Selma’s sample. A lot of “G” with excellent qualities are flying over the other three bases especially after the position 70. Position 70 is also where many of poly-G tails start in the sample. We can also observe the downward trends in base calling quality for the other three bases, but not for the G.

2. The “carp” were primer-adatper sequences

So why did these carp sequences contain the long conspicuous “GGGGG” tails and where did the carp come from.

The poly-G, as it turned out, was the symptom and not the cause. The earlier Illumina platforms such as HiSeq and MiSeq used four colors to label the four nucleotides. With the introduction of NextSeq 550 and NovaSeq 6000, Illumina has switched to the 2-channel system. That is, only two dyes are used: red color for C, green color for T, “red + green” for A and “no color” for G.

It is a clever design, no doubt. But it has one fatal drawback: in a sequencing run, these machines will always generate 150 bases for every DNA (the value of 150 can be adjusted). But if a DNA has only 50 bases, no color will emit during the last 100 cycles and these machines will interpret all 100 as Gs. In fact, these Gs are so dark that they are the best Gs that you can find. This prevents us from recognizing them as phantoms based on their quality scores alone. So the sequences in List 1. are all in fact just 50 to 60 bases long.

Then what are those first dozens of bases? Selma and my colleague Boyke Bunk figured it out: they are primer-adapter sequences from Bartram et al.. They are auxiliary DNA fragments that are prepended to the target sequences and enable the sequencing. So we hypothesize that the during the sequencing preparation, some of the target DNA molecules were degraded. In the machine, the sequencing went right into the auxiliary on the opposite end of the library fragments. The built-in adapter trimming in the Illumina FASTQ file generation pipelines perhaps can only recognize terminal adapters. In Selma’s case, the adapters were embedded deeply in the sequences and therefore evaded the trimming.

Figure 4. Sequences with poly-G tails contain primer and adapter sequences. Image by author. The sequences were designed by Bartram et al.

Figure 4. shows that the Illumina TruSeq adapter read 1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA could be found in Selma’s forward reads while adapter read 2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT could be found in her reverse reads. It is also clear that they are identical with the carp sequences mentioned previously.

I then googled a bit and found out that the issue has been known since at least 2016!

3. The scope of the contamination

After the revelation, I set out to investigate how general the problem is. I downloaded several NextSeq and NovaSeq projects from SRA in NCBI and filtered them through fastp. I estimated the contamination by observing the differences in file sizes before and after the filtering:

The sequences from ERR4031234 suffer from low quality and many reads are discarded by fastp. The other projects are contaminated to a smaller extent. But that still makes it clear that perhaps all submitted data in SRA need to be decontaminated before any further analysis.

4. How the sequences masqueraded as the carp into NCBI

So how did the sequences make their way into NCBI? The carp genome was published in nature genetics 2014. The authors Xu et al. “performed whole-genome shotgun sequencing on three next-generation sequencing platforms, Roche 454, Illumina …”. They also discarded low quality and short reads before they mapped the sequences onto the reference genome.

A possible scenario was that, Xu et al. did not filter the adapter sequences and later submitted the contaminated sequences into NCBI. And NCBI somehow did not red-flag the adapter sequences either. It is now unclear how the heavily contaminated Illumina sequences have affected the results and conclusions of Xu et al. though.

In fact, Xu et al. are not alone. A simple BLAST with the adapter sequence AGATCGGAAGAGCACACGTCTGAACTCCAGTCA returned a page full of contaminated sequences in NCBI’s database:

5. How to get rid of the artefacts

So here comes the remedy. We can use the tool fastp to clean up the dubious sequences very quickly.

The command above trims sequences with poly-G tails and the TruSeq adapters. It also discards sequences shorter than 100 base pairs (you can change the values to your need). The results are in the two “_filtered.fastq” files. As a bonus, the command also collects the failed sequences into the failed.fastq.

I ran some sanity checks on the results and they are indeed free of the common carp. The BLAST results of the final assembled contigs are also now “carp”-free.

Conclusion

So after all this drama, it turns out that the carp are the symptom and not the cause. And they even revealed that the problem is global. Not every researcher has rigorously checked the sequences for abnormality before the submission to the NCBI. Although the gatekeeper in NCBI has implemented some quality checks at the sequence submission recently, still some new submissions fall through the cracks. Old submissions such as the common carp genome are even integrated into the BLAST database.

In a broader sense, our cautionary tale reiterates the fact that scientific research is an error-prone activity. The results from Xu et al. 2014 are at best questionable, at worst unusable. The lesson for the rest of us is: the best practice needs to be enforced universally, such as a standard quality filter pipeline. Only when all the researchers adhere to such a high standard, we can run analyses on reliable data, and stop errors from propagating into the global databases. In addition, the gatekeepers in NCBI need to purge the existing contaminants from the databases. And they should also show clear warnings in BLAST for common contaminants instead of returning seemingly legitimate hits.

If you see “carp” in your Illumina samples, from soil or water, check your sequences!

I thank Pia Marter, Selma Gomes Vieira, Jörn Petersen and Boyke Bunk for their supports in this article.

--

--

Sixing Huang

A Neo4j Ninja, German bioinformatician in Gemini Data. I like to try things: Cloud, ML, satellite imagery, Japanese, plants, and travel the world.