Hashing Nucleotides to two Bits

In many bioinformatics applications it is useful to reduce genomic sequences to 2bit format; Hash them, in some sense. Obviously two bits are enough to uniquely identify the four canonical bases. Usually the following mapping is to be achieved: A → 0, C → 1, G → 2, T → 3.

There are different methods to achieve this mapping, via a switch or a char array. However, there is a better way using the ASCII representation of the nucleotides:

A = 0x41 = 0100 0001
C = 0x43 = 0100 0011
G = 0x47 = 0100 0111
T = 0x54 = 0101 0100

The (lower) bits 2 and 3 are unique across all nucleotides. Thus, the following code from up2bit gets us half the way.

char n; // the nucleotide
n &= 0x6; // mask all bits except 2 and 3.
n >>= 1;

A → 0, C → 1, G → 3, T → 2 is the current mapping. Now we have to fix the G, T issue. This could either be done via one or two ifs. But again, bit-twiddling saves the day.

n ^= (n >> 1); // conditionally flip lower bit

If bit 2 is 0 (A or C) nothing happens, but if it is 1 (G and T) the lowest bit gets flipped. Finally, we have achieved the mapping we set out for. In C, this code compiles down to just five bit-twiddling instructions making it the fastest order-preserving method available.