Revcomp

Table of Contents

Bit-twiddling on genomes seems to become a regular pattern in this blog; See calculating the gc-content and hashing nucleotides [GHOST_URL/hashing-nucleotides-to-two-bits/]. Today I will go into details on computing the reverse complement.

Complementing #

Ever since the detection of the DNA double helix structure by Watson and Crick, we know that nucleotides come in pairs . These are called base pairs, or even Watson-Crick pairs: A = T and C ≡ G. Mapping between the nucleotides is trivial to implement in any language.

char comp(char c)
{
    char d;
    switch (c) {
        case 'A': d = 'T'; break;
        case 'C': d = 'G'; break;
        case 'G': d = 'C'; break;
        case 'T': d = 'A'; break;
        case 'N': d = 'N'; break;
        default : d = c;
    }
    return d;
}

The above code is an obvious solution in C and there is nothing wrong with it. It complements the nucleotides, has N as a fix point and doesn’t change any other character, which might be useful for technical reasons (ie. \0 stays \0). However, if we relax the last condition, things can go a lot faster.

Let’s focus on the main task, complementing A, T and G, C. As we have already seen previously, G and C are only one bit apart in ASCII. Thus mapping either to the other becomes as simple as xoring with 4. The 4 is the same as 'G' ^ 'C'. As xor is associative and commutative 'C' ^ ('G' ^ 'C') becomes ('C' ^ 'C') ^ 'G' = 0 ^ 'G' and thus 'G'. The same reasoning is true for A, T and 21. So now complementing each pair is simply xoring with 4 or 21.

A = 0x41 = 0100 0001
C = 0x43 = 0100 0011
G = 0x47 = 0100 0111
T = 0x54 = 0101 0100

To now distinguish between the pairs we can use the fact that only in the case of C and G the second bit is set. Thus the common case boils down to

char comp(char c)
{
    return c ^= c & 2 ? 4 : 21;
}

The cool thing is that this vectorizes nicely. The compiler can generate code which complements multiple characters at once, giving a 10x performance boost.

As branching is costly in vectorized code, the the closest we can replicate the above behaviour is the following version.

char comp(char c)
{
    if (c == 'N') return 'N';
    return c ^= c & 2 ? 4 : 21;
}

It is marginally slower than the above code but still much faster than the big switch-statement.

Above we used a switch-statement to find the complement of a given base. Note how the nucleotides in the code are arranged in what looks like a table. However, this table is laid out in instructions space. Let’s convert it to data space instead.

char *revcomp_table(const char *begin, const char *end, char *dest)
{
    size_t length = end - begin;
    char table[256] = {0};
    table['A'] = 'T';
    table['C'] = 'G';
    table['G'] = 'C';
    table['T'] = 'A';

    for (size_t i = 0; i < length; i++) {
        unsigned char c = begin[length - i - 1];

        char d = table[c];

        dest[i] = d;
    }

    return dest + length;
}

Except for the loop this code no longer contains a branch, thus there is nothing that can be mispredicted. Indeed, comparing the runtime of the switch with the table-based version, the latter is easily ten times faster across a range of different CPUs. For instance here are the numbers on a laptop from 2015:

./Brevcomp --benchmark_filter='switch|table'

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
bench/revcomp_switch    8445060 ns      8332779 ns           83
bench/revcomp_table4     740910 ns       729380 ns          936

One way to make a problem simpler is by avoiding the problem. BWA went this route. Internally they represent the nucleotides not by their Ascii values but by codes. These codes make computing the reverse complement almost trivial. Consider the following mapping.

A → 0
C → 1
G → 2
T → 3

It is easy to see that complementing a base is as simple as subtracting its value from 3. A complete reverse complement function for this encoding is as simple as following.

char *revcomp_bwa(const char *begin, const char* end, char *dest)
{
    size_t length = end - begin;
    for (size_t i = 0; i < length; i++) {
        dest[i] = 3 - begin[length - i - 1];
    }
    return dest + length;
}

However, this approach has two downsides.

Firstly, sequences need to be encoded and decoded. While internally a program can rely on this encoding many file formats rely ASCII values. One thus has to translate back and forth between the two formats.

Secondly, the encoding makes debugging more difficult. Not only have humans trouble reading the encoded values, the same goes for machines. In C a character value of 0 represents the end of a string. Thus, if one wants to print the value of an encoded genetic sequence the printout starts at the first A.

The advantage of this approach is that it is really simple and vectorizes nicely. Here are some runtime measurements.

---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
bwa_bench                       95643 ns        95102 ns         7124
bench/revcomp_switch          8171236 ns      8123537 ns           86
bench/revcomp_table4           673073 ns       667585 ns         1048

So on the same data the encoded BWA method is about 7 times faster than the table based method. In the next part (to be written) we explore how we can get the same speed for the Ascii representation.

Reversing #

I have so far considered two cases: Read the master string backwards and write the revcomp forwards, or vice-versa. The former is two times faster for the above twiddling strategy. I guess this is as the writes are properly aligned. More strategies such as in-place swapping should be tested.