# Optimizing Mash

Mash could well be the new BLAST. It is a great technique to compute similarity scores for sequences. However, its implementation is not optimal. In this blog post I describe some tricks to increase the performance substantially.

Addendum: Some people find this blog post and title rude. I hereby apologize to anyone who feels insulted, that was not my intention. This post will remain available as a sign of my stupidity and lack of skill with words.

Addendum 2: I changed the title.

# Simplifications

With DNA sequences, mash integrates the forward and reverse complement strand into its sketches. To do so, it makes the decision of hashing only the lexicographically lesser (canonical) of the two possible kmers. Here is the (slightly simplified) code:

bool useRevComp = true;
bool prefixEqual = true;
for ( uint64_t j = 0; j < kmerSize; j++ )
{
char base = seq[i + j];
char baseMinus = seqRev[length - i - kmerSize + j];

if ( prefixEqual && baseMinus > base )
{
useRevComp = false;
break;
}

if ( prefixEqual && baseMinus < base )
{
prefixEqual = false;
}
}


Note that this loop is nested within one iterating over the whole sequence via i. You can also observe the use of a boolean variable for control flow: Setting prefixEqual has the same effect as break. Basically, this code skips the common prefix and sets useRevComp to false iff the forward strand is lexicographically smaller than the kmer on the other strand. Comparing strings is such a common theme that there are a bunch of standard functions for it std::lessthan, operator<, strcmp, strncmp. Here we can use strncmp to simplify the code and also speed it up.

# Optimizing the Algorithms

The following code really wants to make sure that the procedure does not run off the end. But it also checks whether the current kmer (e.g. the following 21 characters) does not contain any unusual characters. If so, the procedure skips ahead.

for ( uint64_t i = 0; i < length - kmerSize + 1; i++ ) {
// repeatedly skip kmers with bad characters

for ( uint64_t j = i; j < i + kmerSize && i + kmerSize <= length; j++ )
{
if ( ! parameters.alphabet[seq[j]] )
{
break;
}
}

{
continue;
}

if ( i + kmerSize > length )
{
// skipped to end
break;
}

// …
}


Again we see an example of a boolean used for control flow. However, the deeper issue with this code is that it is O(kmerSize * length): If all characters pass the test, all but one of them are checked again on the next iteration. This wastes a lot of precious cycles. For my optimization I chose to apply a range based approach: Try to find a range of at least 21 consecutive valid characters, hash them and then search for the next valid range. With a few STL algorithms here, and some lambdas there the code turned out to be quite readable.

# Micro Optimizations

At this point most major flaws were gone and I could turn to some micro optimizations.

• Optimized reverse complement for improved vectorization. See Bit-twiddling on nucleotides for details.
• Given an if-statement, compilers take a guess which branch is more likely to be taken and layout the code to optimize that path. Mash iterates over the sequences, hashes kmers and adds those hashes to a set, only if the value is smaller than the largest already in the set. With increasing sequence length adding a new hash becomes increasingly unlikely. Thus one can annotate to help the compiler make the right decision.
• Inlining: Some functions do very little and are just wrappers for the programmers convenience. For these functions inlining can be appropriate if the function call overhead starts to show up in profiles.

# Results

To measure the performance I used the following quick, but not too accurate approach: Run mash in matrix mode on a set of 109 E. coli genomes. With the Linux perf tools one can get a lot of interesting statistics.

$perf stat -d ./mash_old matrix -p 4 Ecoli_ST131_109/* > /dev/null … 123253,691221 task-clock:u (msec) # 3,661 CPUs utilized 0 context-switches:u # 0,000 K/sec 0 cpu-migrations:u # 0,000 K/sec 42.718 page-faults:u # 0,347 K/sec 303.729.579.173 cycles:u # 2,464 GHz (62,50%) 284.210.243.443 instructions:u # 0,94 insn per cycle (74,97%) 73.489.591.931 branches:u # 596,247 M/sec (74,99%) 5.649.963.491 branch-misses:u # 7,69% of all branches (74,96%) 70.729.406.280 L1-dcache-loads:u # 573,852 M/sec (74,96%) 84.967.015 L1-dcache-load-misses:u # 0,12% of all L1-dcache hits (75,02%) 9.152.303 LLC-loads:u # 0,074 M/sec (50,05%) 3.497.725 LLC-load-misses:u # 38,22% of all LL-cache hits (50,02%) 33,670723038 seconds time elapsed  One can note that the number of instructions per cycle is suspiciously low. Every fourth instruction is a branch and 7% of branches are mispredicted. This inflicts quite a performance dip. $ perf stat -d ./mash_new matrix -p 4 Ecoli_ST131_109/* > /dev/null
…
39669,019776      task-clock:u (msec)       #    3,240 CPUs utilized
0      context-switches:u        #    0,000 K/sec
0      cpu-migrations:u          #    0,000 K/sec
165.465      page-faults:u             #    0,004 M/sec
96.670.348.088      cycles:u                  #    2,437 GHz                      (62,47%)
170.284.036.548      instructions:u            #    1,76  insn per cycle           (74,98%)
17.764.262.265      branches:u                #  447,812 M/sec                    (75,06%)
22.573.892      branch-misses:u           #    0,13% of all branches          (74,91%)
32.504.470.769      L1-dcache-loads:u         #  819,392 M/sec                    (74,90%)
130.365.778      L1-dcache-load-misses:u   #    0,40% of all L1-dcache hits    (75,11%)
30.141.757      LLC-loads:u               #    0,760 M/sec                    (50,04%)
9.056.970      LLC-load-misses:u         #   30,05% of all LL-cache hits     (49,98%)

12,244687511 seconds time elapsed



First thing to note is that the new version executes less instructions in total and more in parallel. Note that this is instruction level parallelism, not threads. Only every tenth instruction now is a branch, with a mere 0.13% mispredicted. This allows the CPU to speculatively execute a lot of the program making it much faster. Some of the other numbers have also improved (L1-dcache) while others have become worse (LLC-loads).

# Profile

To wrap up this post I will show a screenshot of the profiler for the new mash version. It was taken using \$ perf record ./mash_new matrix -p 4 Ecoli_ST131_109/* > /dev/null.

The bulk of the runtime (71%) is now spend hashing. The actual code of mash—reducing sequences to a sketch—is at just 24%. Very little time is spent reading the input (ks_getuntil2) and everything else is pretty much noise. At this point one could start searching for a faster hash function as this will have the biggest impact on program performance. All in all, with just two days of work I managed to get a 2x-3x performance improvement which is cool and will benefit my future plans with mash.

# Replies

to this blog post. Kindly republished with permission.

In a nutshell, the Mash software is a really good idea.
It is not going to replace Blast, because Blast is an aligner rand Mash is not.

The implementation of Mash is in C++ and reuse C++ STL containers. It also uses Cap'n Proto for the .msh files. In my opinion, Mash is a prime example of really good software engineering. Mash is tier 1. The best tier.