Skip to main content

GC Content

·138 words
bioinformatics

Computing the GC-content of a DNA sequence is probably the simplest “analysis” possible, right after sequence length. It also translates easily into code; for example this C code:

double gc(const char *seq)
{
	size_t gc = 0;
	const char *ptr = seq;

	for (; *ptr; ptr++) {
		if (*ptr == 'g' || *ptr == 'G' || *ptr == 'c' || *ptr == 'C') {
			gc++;
		}
	}

	return (double)gc / (ptr - seq);
}

The cool thing is that C and G differ only by one bit in the ASCII code. The same is true for lower and upper case letters. Thus the four comparisons compile down to just two instructions. Assume the eax register holds the character we want to check.

and $0xffffffdb,%eax
cmp %0x43,%al

So bioinformatics got lucky that we use the GC-content not the AT-content.