In many bioinformatics applications it is useful to reduce genomic sequences to 2bit format; Hash them, in some sense. Obviously two bits are enough to uniquely identify the four canonical bases. Usually the following mapping is to be achieved: A → 0, C → 1, G → 2, T → 3.
There are different methods to achieve this mapping, via a
switch or a char array. However, there is a better way using the ASCII representation of the nucleotides:
A = 0x41 = 0100 0001 C = 0x43 = 0100 0011 G = 0x47 = 0100 0111 T = 0x54 = 0101 0100
The (lower) bits 2 and 3 are unique across all nucleotides. Thus, the following code from up2bit gets us half the way.
char n; // the nucleotide n &= 0x6; // mask all bits except 2 and 3. n >>= 1;
A → 0, C → 1, G → 3, T → 2 is the current mapping. Now we have to fix the G, T issue. This could either be done via one or two
ifs. But again, bit-twiddling saves the day.
n ^= (n >> 1); // conditionally flip lower bit
If bit 2 is 0 (A or C) nothing happens, but if it is 1 (G and T) the lowest bit gets flipped. Finally, we have achieved the mapping we set out for. In C, this code compiles down to just five bit-twiddling instructions making it the fastest order-preserving method available.