In part one we used a switch statement to find the complement of a given base. Note how the nucleotides in the code are arranged in what looks like a table. However, this table is laid out in instructions space. Let's convert it to data space instead.

char *revcomp_table(const char *begin, const char *end, char *dest)
	size_t length = end - begin;
	char table[256] = {0};
	table['A'] = 'T';
	table['C'] = 'G';
	table['G'] = 'C';
	table['T'] = 'A';

	for (size_t i = 0; i < length; i++) {
		unsigned char c = begin[length - i - 1];

		char d = table[c];

		dest[i] = d;

	return dest + length;

Except for the loop this code no longer contains a branch, thus there is nothing that can be mispredicted. Indeed, comparing the runtime of the switch with the table-based version, the latter is easily ten times faster across a range of different CPUs. For instance here are the numbers on a laptop from 2015:

./Brevcomp --benchmark_filter='switch|table'

Benchmark                     Time             CPU   Iterations
bench/revcomp_switch    8445060 ns      8332779 ns           83
bench/revcomp_table4     740910 ns       729380 ns          936

To be continued