When Watson and Crick described the two-stranded structure of DNA they stated that "if the sequence of bases on one chain is given, then the sequence on the other chain is automatically determined." This observation is crucial to much modern biology such as sequencing by synthesis. However, for a computer scientist, a natural follow-up question is given one strand, how does one compute the other? I intend to answer this question in a series of articles.

Part 1: The Problem

For these articles I decided to use the following C function signature.

char *revcomp(const char *begin, const char *end, char *dest);

The forward strand is passed by using two pointers, one to the first base and one just past the last base. Thereby [begin, end) form an open interval. This kind of argument passing is very common for instance in the STL. It allows users to easily just call it on a substring of the forward strand. The dest pointer points to space for the reverse string. So all of the memory allocation is left to the user. The strings could be malloc'ed, mmap'ed or static buffers. All work fine. Finally, revcomp returns a pointer just past the last byte written. This is useful if the reverse strand shall be turned into a proper C-string which requires a terminating nul byte.

Computing the reverse complement can be split into two separate tasks: reversing and complementing. Let's look at the latter first using the following draft:

char *revcomp(const char *begin, const char *end, char *dest)
{
	size_t length = end - begin;
	for (size_t i = 0; i < length; i++) {
		char c = begin[length - i - 1];

		char d = complement(c);

		dest[i] = d;
	}

	return dest + length;
}

So the question is how does one map the base from the forward base to its complement? A common but less than optimal solution is to use a switch statement.

char *revcomp_switch(const char *begin, const char *end, char *dest)
{
	size_t length = end - begin;
	for (size_t i = 0; i < length; i++) {
		char c = begin[length - i - 1];

		char d;
		switch (c) {
			case 'A': d = 'T'; break;
			case 'C': d = 'G'; break;
			case 'G': d = 'C'; break;
			case 'T': d = 'A'; break;
		}

		dest[i] = d;
	}

	return dest + length;
}

While the switch statement looks innocent, it is logically equivalent to a series of if-else branches. Modern CPUs try to predict which branches a program will execute. However, I'm using genomes with a million equidistributed bases. That means it is almost impossible for the CPU to predict or to learn/remember which branches will be taken. It has to guess and 75% of the time is guesses incorrectly. So theory tells us that most branches are predicted incorrectly triggering rollbacks and CPU stalls. In practice compilers make the CPUs job easier. For above code almost every third instruction is a branch of which 20% are mispredicted. Much better than the worst-case 75% but still very bad.

Part 2: The Table