The other day I was solving the wrong problem. Before hashing, mash has to verify that all characters in a k-mer come from one specific alphabet (either nucleotide, ACGT; or protein). I tried to speed up the checking problem where the real improvement was made by doing less checking. However, because my solution (to the wrong problem) was elaborate, I want to document it here.

Given a string seq of length length, the following function find_first_not_ACGT finds the first position that is not an A, C, G, or T. Here table has a zero for all non ACGT characters and a non-zero value only for 'A' (Ascii value 65), etc.

char *
find_first_not_ACGT(const char *seq, size_t length)
{
	size_t i = 0;

	for (; i < length; i++) {
		signed char c = table[seq[i]];
		if (c == 0) return (char *)seq + i;
	}

	return (char *)seq + length;
}

This approach is simple, generic, and of decent performance. However, it only checks one character at a time in a pretty large table. Using intrinsics one can do better.

#include <emmintrin.h>
#include <tmmintrin.h>

char * find_first_not_ACGT_fast(const char *seq, uint64_t length)
{

The new function find_first_not_ACGT_fast looks up multiple characters (a "chunk") at the same time. However, the table has to be reduced in size. The chunk size and the table size are both 16 aka. vec_size aka. sizeof(_m128i).

    typedef __m128i vec_type;
    size_t vec_size = sizeof(vec_type);

The length of the sequence may not be divisible by sixteen. Thus, the following line just computes how many full chunks there are.

    size_t capped_length = length & ~(vec_size - 1);
    size_t chunk_offset = 0;

Next, there is a "table" that is used later for looking up characters. You should think of mask_low_nibble as basically a table of zeros. A few values are set to non-zero:

mask_low_nibble['A' & 0x0f] = 'A';
mask_low_nibble['C' & 0x0f] = 'C';
mask_low_nibble['G' & 0x0f] = 'G';
mask_low_nibble['T' & 0x0f] = 'T';

Note, a nibble is half a byte.

    vec_type mask_low_nibble = _mm_setr_epi8(0, 'A', 0, 'C', 'T', 0, 0, 'G', 0, 0, 0, 0, 0, 0, 0, 0);

    for (; chunk_offset < capped_length; chunk_offset += vec_size) {
        vec_type chunk;
        // Load a full chunk
        memcpy(&chunk, seq + chunk_offset, vec_size);

Here is where most of the magic happens. Let's assume chunk[0] is a C.
Then the shuffle instruction is basically a bitwise-and followed by a table
look-up.

mask_low_nibble[chunk[0] & 0x0f] -> 'C'

Uninterestingly, we receive the same character as we put it. The same happens for A, G and T. These characters are stored in canonicalized. If the input character is a B, however a zero is returned and stored. As the table only considers the low nibble, there will be collisions.

mask_low_nibble['B' & 0x0f] -> 0
mask_low_nibble['A' & 0x0f] -> 'A'
mask_low_nibble['Q' & 0x0f] -> 'A'

Our new variable canonicalized now has all the canonical nucleotides in
the same places as chunk. All non-canonical spots now have either a zero
or a wrong base.
(The fact that _mm_shuffle also returns 0 if the high bit is set can be ignored here.)

        vec_type canonicalized = _mm_shuffle_epi8(mask_low_nibble, chunk);

We now compare canonicalized to chunk. The result will yield true (0xff)
for every canonical base and false (0) otherwise. These values are stored per byte.

        vec_type equal_mask = _mm_cmpeq_epi8(canonicalized, chunk);

From this long byte vector we only take the highest bit of each byte (movemask). Each correct canonical bit is a 1 and each wrong nucleotide a 0. As we are interested in non-canonical bytes, the result needs to be bit-flipped. However, now we also inverted the higher 16 bits, which have to be masked off.

        int mask = ~_mm_movemask_epi8(equal_mask) & ((1 << vec_size) - 1);
        if (mask) {

If the mask was truthy, there is at least one non-ACGT nucleotide. We can
find its position in the chunk by counting the trailing zeros (CTZ). Note
how the LSB corresponds to the first nucleotide as Intel architecture is little endian.

            return (char*)seq + chunk_offset + __builtin_ctz(mask);
        }
    }

If execution exits the loop, all characters so far were ACGT. Now we are left with a few characters that cannot fill a whole chunk. Instead we do the same method as above but just with the rest.

    size_t rest = length - capped_length;
    vec_type chunk = _mm_set1_epi8(0);
    memcpy(&chunk, seq + chunk_offset, rest);

    vec_type canonicalized = _mm_shuffle_epi8(mask_low_nibble, chunk);
    vec_type equal_mask = _mm_cmpeq_epi8(canonicalized, chunk);

    int mask = ~_mm_movemask_epi8(equal_mask) & ((1 << rest) - 1);
    if (mask) {
        return (char*)seq + chunk_offset + __builtin_ctz(mask);
    }

All nucleotides were canonical. Return.

    return (char*)seq + length;
}

This new code is almost six times faster than the table based function above. Truthfully, it is not simple and not something you should just copy and paste into your code. For instance, it requires the SSSE3 instruction set which might not be available on all target computers. Instead, libdna can be used, which gives optimal performance on each supported instruction set.