GC Content
·138 words
bioinformatics
Computing the GC-content of a DNA sequence is probably the simplest “analysis” possible, right after sequence length. It also translates easily into code; for example this C code:
double gc(const char *seq)
{
size_t gc = 0;
const char *ptr = seq;
for (; *ptr; ptr++) {
if (*ptr == 'g' || *ptr == 'G' || *ptr == 'c' || *ptr == 'C') {
gc++;
}
}
return (double)gc / (ptr - seq);
}
The cool thing is that C
and G
differ only by one bit in the ASCII code. The
same is true for lower and upper case letters. Thus the four comparisons compile
down to just two instructions. Assume the eax
register holds the character we want to
check.
and $0xffffffdb,%eax
cmp %0x43,%al
So bioinformatics got lucky that we use the GC-content not the AT-content.