Gene Myers last year shared on GitHub his fork of GenomeScope v2.0 that works on turnicated kmer spectra GENESCOPE.FK calculated by his new ultrafast kmer counter FastK. I immediately started writing this blogpost and then I fogot about it. So, with a year delay, here is testing of the software on a difficult genome, compared to the tools I routinely use.
The difficult benchmarking genome I picked is the triploid Marbled Crayfish genome (Gutekunst et al. 2018). The genome contains several super-repetitive genomic kmers that were messing up all the initial kmer analyses of the genome (check this older blogpost).
The aim of this blog is to:
clarify the difference between the two approaches to calculate k-mer spectra
Comparing quality of GenomeScope fits to the two spectra
Benchmark the speed of the two k-mer counters
BONUS: the effect of trimming to the quality of fit
BONUS 2: comparing to another, even faster k-mer counter ntCard.
To see how exactly I did analyses for this blog, see Methods sections bellow.
The practical difference between KMC and FastK
KMC is a k-mer counter that is “giving-up” on k-mer coverage over a threshold specified by parameter (-cs). Every k-mer with coverage higher than that will be simply reported as with cx coverage. This is what is causing problems with the real k-mer coverage is a lot higher for plenty of real genomic k-mers (as discussed in that already linked older blogpost).
FastK on the other hand uses a different strategy - it calculates coverages in fixed coverage range 1 to ~2^15x (32768). And for all k-mers with coverage above, it calculates the total count. Furthermore, Histex (FastK tool to make histograms) produces histograms of freqiencies of coverages for all but the last calculated coverage values (by default 100x). The last frequency does not represent the number of distinct k-mers with this coverage or higher (as in KMC), but the theoretical number of distinct k-mers with that coverage and would generate the same coverage as the repetitive k-mers.
To demonstrate what I mean, check the following R snippet
# load the two histogramsfastK<-read.table('histograms/FastK_k17_default.hist',col.names=c('cov','freq'))tail(fastK)# cov freq# 95 95 2425866# 96 96 2374469# 97 97 2324463# 98 98 2274766# 99 99 2228325# 100 100 1548557290KMC<-read.table('histograms/KMC_k17_full.hist',col.names=c('cov','freq'))# tail(KMC)# cov freq# 219790 137051330 1# 219791 137414120 1# 219792 195670290 1# 219793 197284708 1# 219794 388684031 1# 219795 393157371 1# this histogram has a long tail with few kmers with extremly high coverage# the first 100 values of the two histograms are the sameall(fastK$freq[1:99]==KMC$freq[1:99])[1]TRUE# However, the high frequency k-mer behave a bit differentiallyKMC_high_freq<-KMC[100:nrow(KMC),]sum(KMC_high_freq$cov*KMC_high_freq$freq)[1]154855729017# and cov*freq of k-mers > 100x is 100x"the number of fastK kmers with coverage >100"fastK$freq[100][1]1548557290
So the difference is that while KMC is able to inform us about exact frequencies of k-mers and their repetitiveness, fastK is limited to 32768x. However, we needed to calculate all k-mer explicitly with KMC to get the right genome size estimate, FastK trick actually does not hinder the estimate as the genome size is practically (cov * freq) / (1n_kmer_coverage * ploidy).
Fitted models
Alright, now we understand the difference between approaches. We fit two models - KMC histogram with regular GenomeScope 2.0 and FastK using the adjusted GENESCOPE.FK.
This is how the regular model looks like:
and this is how the default fastK model looks like
It almost seems that the last of higher frequency k-mers made a better fit of the lower frequency k-mers. The fastK black line look nearly the same as the GenomeScope line. I suspect, if I made the k-mer histogram to more than 100x, the estimates would be nearly (if not completelly) the same.
So main conclusion here is that FastK and GenomeScope.FK are able to estimate properties even difficult genomes (just like KMC/GenomeScope did).
Speed
The main selling point of fastK is the speed up. By not counting all the repetitive k-mers, it saves a lot of time.
I generated the two k-mer spectra above from ~817,000,000 read pairs (245.2G base pairs). Both software had 16 cores and 120GB memory to their disposition. While it took KMC 89m29s to get the histogram, FastK was done in 33m51s. That is nearly 3x increase in speed!
I also wanted to test less (4) cores, just to check the quality of paralelization, but I ended up testing a different feature by accident. I calculated the two histograms using 4 cores on raw reads (instead of trimmed ones). On the raw reads, it took 234m39s to get the hitogram to KMC, and 146m54s to FastK. With 4 cores, the speed-up, is much less impressive - just about 1.6x. How comes?
I suspect the major difference comes from the ratio of super-repetitive kmers to the total number of k-mers that occur in the read set. The trimmed dataset must have had a lot fewer low, frequency k-mers and therefore the relative gain of FastK to KMC was a lot more significant. I suspect also that genomes with less repetitive nature would also make the two k-mer counters more even.
BONUS Differences between raw and trimmed data
K-mer spectra analyses are usually quite robust to messiness in data. In vast majority of cases I don’t bother trimming the data before caluclating the k-mer spectra. However, there are some cases, where the singal to noise ratio is so much on the edge, the trimming can help get the model nice.
This is exactly the case of the marbeled crayfish:
You see the model is a lot more messy. The error peak and the heterozygosity peaks are blended and both genome size and heterozygosity estimates are quite off comapred to the fits above.
This is just for an inspiration. If you have one of these “edge” datasets, you might want to try decrease your k and trim the reads before fitting a genome model.
BONUS 2: comparing KMC and FastK to ntCard
Rayan Chikhi asked why I did not use ntCard. And it is a fair question. Although ntCard is much less popular than Jellyfish or KMC, this tool managed to get the histogram from the same dataset in impressive 13m7s (16 cores and 512GB memory at disposition), which is 2.5x faster than FastK.
Why not ntCard then?
Although I requested calculation of coverage to up to 500000000x, all I got was ~65404x (looks like 2^16 is the limit).
This limit creates a big problem for correct genome size estimate. The model based on ntCard histogram estimated the genome size to 2.6Gbp only, which is nearly 1Gbp less than expected.
This probably won’t be a problem for any smallish genomes. So folks that know the coverage limitation is not a problem for them might want consider ntCard as the weapon of choice. Furthermore, the reaction of the developers on twitter gave me hopes we might see the high coverage k-mer counting feature implemented some time.
ntCard likely uses 16 bit integers as counters, hence the limit. Sounds like a simple change I can try, at the cost of memory usage.
ntCard is by far the fastest method to create a k-mer spectrum, but unfortunately it does not do well with extremely repetitive k-mers.
Among those that allow us to estimate the genome size correctly, FastK is quite a bit faster than KMC and the speed-up is more significant for read sets with higher fraction of repetitive k-mers. If you run k-mer analyses on daily basis, you might want to consider switching to FastK.
However, that applies only to people that are not interested in the repetitive k-mers. I would be personally really interested to understand what are all these superrepetitive k-mers about and knowing their sequence and exact coverage distribution is a fine way to start.
Also a fair disclaimer, I did run in some issues while I was setting up FastK on our cluster (reported here). It was not that difficult to overcome them, but if my goal would be to analyse single middle-sized genome, it would be still faster with KMC, as it is really easy to set up that one up with simple conda install -c bioconda kmc.
Methods
The Marbled crayfish data were originally sequenced in Gutekunst et al. 2018, available in sra with accessons SRR5115144,SRR5115147,SRR5115146,SRR5115148,SRR5115143,SRR5115145. There are billion ways to get them, I like to fetch fastq files from EBI. This should work
Halfway thought the benchmarking I recalled that the crayfish data were borderline noisy and it was actually really helpful to trim them. It also afect the kmer counters as it will decrease quite a lot the number of distinct kmers in the dataset.
The trimming was done using skewer for each two files as
Instaling FastK on a cluster was not trivial, but eventually with some help I managed to get it work solely with programs installed via conda (the steps and probems are here). Anyway, I got FastK compiled and running.
Have you heard about PeerCommunityIn, or PCI for short? It’s an organisation, that runs a peer reviewing platform and a diamond-open access journal. The way it works is, you submit a preprit to any of the preprint servers (such as bioRxiv) and then pass the details to appropriate PCI (there are 15 are moment for all sorts of disciplines). There will be a recommender (moreless like an editor), who will manage your preprint, evaluate if it is a credible piece and then send invitations to reviewers. Once you get through the review process, your priperint will get recommended. Once that happens, it’s up to you what you do next, you can either submit the already reviewed preprint to one of the classical journals, or you can right away publish it in the PCI Journal. The initiative is great for reducing work load for reviewers, bringing transparency to the review process and finally brings a sustainable solution to scientific publishing - you should totally check it out.
Few days ago, the world learned about two California condor chicks that were products of parthenogenesis, a reproduction where unfertilised eggs hatch and develop in an adult from maternal genome only. This IS exciting for multiple reasons, but perhaps not as exiting for conservation of condors and I will try to explain why… It will take a bit of background on parthenogenesis and sex determination, but bear with me.
When I was young naive aspiring scientist I did not comprehend all the aspects of publishing. To be honest, I did not think about it much, but for me it’s the same as the climate change, it’s harder and harder to turn a blind eye. The last drop for me was the announcement of Nature’s Open Access option, it’s shocking €9,500, (or $11,390 / £8,290)! How did we come to this? Are we really going to let a private company to drain the already poor scholarly funds by these obscure amounts? And the problem is not just Nature…
About a year and half ago an article by John Tregoning was published in Nature News. The short piece was openly defending the prevalent usage of journal impact factor for evaluation of junior scientists for their sake. As a junior scientist I felt bitter. The publishing system is a huge academic problem we should do something about it! And as far as I know, young scientists are the loudest in pointing out new ways for less morally corrupted sharing of knowledge and therefore I find unfair a senior academic takes us, as an argument for keeping the status quo.
Leave a Comment