/ Bioinformatics

The State of Software Engineering in Bioinformatics

This post will kick off a series of blog posts about the state of software engineering in bioinformatics. In each post I will take a stab at a popular tool in bioinformatics, dive in its code and show some of its worst parts. The issues will range from bad style, via ineffective idioms to plain errors.

Tool of the day: andi

To be fair, I will start this series with a bug of my own. andi is a tool for alignment-free sequence comparison I wrote during my master's thesis. After continuous “improvement” I ended up with the following code excerpt after a particular git merge.

double val = 0.1 /* substitutions per position */;

if( FLAGS & F_EXTRA_VERBOSE ){
    val = D(i,j).distance;
}

if( !(FLAGS & F_RAW)){
    val = -0.75 * log(1.0- (4.0 / 3.0) * val ); // jukes cantor
}
// fix negative zero
if( val <= 0.0 ){
    val = 0.0;
}

if( !(FLAGS & F_RAW)){
    val = -0.75 * log(1.0- (4.0 / 3.0) * val ); // jukes cantor
}

// fix negative zero
if( val <= 0.0 ){
    val = 0.0;
}

It should be suspicious that some of the code is duplicated; The Jukes-Cantor correction is applied twice. Due to this, evolutionary distances above 0.1 became significantly overcorrected and thus too large. The deeper issue is not that errors like this happen, but that it wasn't caught by unit tests. It should be easy to test that given two sequences with a certain substitution rate, my tool would reproduce it. However, at that point in time, only the non-corrected distance was checked.

This failure to validate is all too common in bioinformatics. These days andi comes with reproducible unit tests covering 82% of the code. This is not as good as it could be, but much better than most tools in bioinformatics which come without any tests at all.


Next post: MUMmer. Feel free to send my your examples.