/ Pitiful Bioinformatics

strlen(genome): The Pitiful State of Software Engineering in Bioinformatics

Genomes are very long strings. This comes to no surprise to anybody, except maybe a few C programmers. The following bug appears only in C programs or bad C++ programs due to the quirks of C.

Strings in C are null-terminated and their length is implicit. This has two consequences: the programmer has to call strlen to retrieve the length and algorithms can be executed without knowing the length a-priori by checking for the null-terminator. However, both solutions are costly, strlen is a O(n) operation and checking for null on the fly may disable vectoring.

In delta2maf, a tool from the Mugsy suite, strlen is called repreatedly on long sequences. Infact, strlen is called on the same sequence again and again. The link above shows a patch I developed for the Debian package of MUMmer, currently hosting the delta2maf program. The patch simply caches the length of the current sequence and reuses it later. This leads to a speed up with a factor of twenty.

In order to compute a phylogeny from a MAF (Multiple Alignment Format) file, I wrote the tool maf2dist some time ago. Just last week, I was annoyed by its poor performance and fixed the slowdown quite easily. During refactoring I made the simple change, that add_compare now expects two proper C++ strings. Thus the length of the two strings is known, a simple by-index for loop with proper bounds can be used. Even though the compiler still does not vectorize the function, it is one less branch and due to instruction parallelism execution is speed up.

Both of these issues, despite being in C++ programs come from the C heritage. In other programming languages the problems simply cannot arise.

Verdict: Avoid C-style strings for genomes.