Bit-twiddling on genomes seems to become a regular pattern in this blog; See calculating the gc-content and hashing nucleotides. Today I will go into details on computing the reverse complement.

Complementing

Ever since the detection of the DNA double helix structure by Watson and Crick, we know that nucleotides come in pairs. These are called base pairs, or even Watson-Crick pairs: A = T and C ≡ G. Mapping between the nucleotides is trivial to implement in any language.

char comp(char c)
{
char d;
switch (c) {
case 'A': d = 'T'; break;
case 'C': d = 'G'; break;
case 'G': d = 'C'; break;
case 'T': d = 'A'; break;
case 'N': d = 'N'; break;
default : d = c;
}
return d;
}

The above code is an obvious solution in C and there is nothing wrong with it. It complements the nucleotides, has N as a fix point and doesn't change any other character, which might be useful for technical reasons (ie. \0 stays \0). However, if we relax the last condition, things can go a lot faster.

Let's focus on the main task, complementing A, T and G, C. As we have already seen previously, G and C are only one bit apart in ASCII. Thus mapping either to the other becomes as simple as xoring with 4. The 4 is the same as 'G' ^ 'C'. As xor is associative and commutative 'C' ^ ('G' ^ 'C') becomes ('C' ^ 'C') ^ 'G' = 0 ^ 'G' and thus 'G'. The same reasoning is true for A, T and 21. So now complementing each pair is simply xoring with 4 or 21.

A = 0x41 = 0100 0001
C = 0x43 = 0100 0011
G = 0x47 = 0100 0111
T = 0x54 = 0101 0100

To now distinguish between the pairs we can use the fact that only in the case of C and G the second bit is set. Thus the common case boils down to

char comp(char c)
{
return c ^= c & 2 ? 4 : 21;
}

The cool thing is that this vectorizes nicely. The compiler can generate code which complements multiple characters at once, giving a 10x performance boost.

As branching is costly in vectorized code, the the closest we can replicate the above behaviour is the following version.

char comp(char c)
{
if (c == 'N') return 'N';
return c ^= c & 2 ? 4 : 21;
}

It is marginally slower than the above code but still much faster than the big switch-statement.

Reversing

I have so far considered two cases: Read the master string backwards and write the revcomp forwards, or vice-versa. The former is two times faster for the above twiddling strategy. I guess this is as the writes are properly aligned. More strategies such as in-place swapping should be tested.