Bit-twiddling on genomes seems to become a regular pattern in this blog; See calculating the gc-content and hashing nucleotides. Today I will go into details on computing the reverse complement.

# Complementing

Ever since the detection of the DNA double helix structure by Watson and Crick, we know that nucleotides come in pairs. These are called base pairs, or even Watson-Crick pairs: A = T and C ≡ G. Mapping between the nucleotides is trivial to implement in any language.

char comp(char c)
{
char d;
switch (c) {
case 'A': d = 'T'; break;
case 'C': d = 'G'; break;
case 'G': d = 'C'; break;
case 'T': d = 'A'; break;
case 'N': d = 'N'; break;
default : d = c;
}
return d;
}


The above code is an obvious solution in C and there is nothing wrong with it. It complements the nucleotides, has N as a fix point and doesn't change any other character, which might be useful for technical reasons (ie. \0 stays \0). However, if we relax the last condition, things can go a lot faster.

Let's focus on the main task, complementing A, T and G, C. As we have already seen previously, G and C are only one bit apart in ASCII. Thus mapping either to the other becomes as simple as xoring with 4. The 4 is the same as 'G' ^ 'C'. As xor is associative and commutative 'C' ^ ('G' ^ 'C') becomes ('C' ^ 'C') ^ 'G' = 0 ^ 'G' and thus 'G'. The same reasoning is true for A, T and 21. So now complementing each pair is simply xoring with 4 or 21.

A = 0x41 = 0100 0001
C = 0x43 = 0100 0011
G = 0x47 = 0100 0111
T = 0x54 = 0101 0100


To now distinguish between the pairs we can use the fact that only in the case of C and G the second bit is set. Thus the common case boils down to

char comp(char c)
{
return c ^= c & 2 ? 4 : 21;
}


The cool thing is that this vectorizes nicely. The compiler can generate code which complements multiple characters at once, giving a 10x performance boost.

As branching is costly in vectorized code, the the closest we can replicate the above behaviour is the following version.

char comp(char c)
{
if (c == 'N') return 'N';
return c ^= c & 2 ? 4 : 21;
}


It is marginally slower than the above code but still much faster than the big switch-statement.

# Reversing

I have so far considered two cases: Read the master string backwards and write the revcomp forwards, or vice-versa. The former is two times faster for the above twiddling strategy. I guess this is as the writes are properly aligned. More strategies such as in-place swapping should be tested.