Errors in the Ensembl Genomes Database

The Ensembl Bacteria database contains more than 2600 whole genomes sequences of Escherichia coli. It should not be a surprise that some of this data will be flawed. As a simple verification I took my own tool for phylogenetic reconstruction, andi, and applied it to 2673 downloadable genomes from the release 36. After 3.5 hours of computation the following tree was produced.

Something sticks out: A small number of sequences sequences are not E. coli at all, but simply mislabelled in the database. For the four most distantly related sequences, I determined their actual species by blasting parts of the genome against the NCBI database. Thus, I propose the following new labels:

  • _gca_01443095: Enterobacter cloacae
  • _isc11: Citrobacter freundii
  • _isc56: Klebsiella pneumoniae
  • _gca_900092915: Klebsiella pneumoniae

Getting The Errors Fixed

This blog post is not supposed to be a rant, nor do I want to point fingers, so I'll keep this short: I contacted the EBI, showed them my analysis, and proposed the new labels. They told me that they imported the data from the ENA, so I should go there to get the issue fixed.

Thus, I opened a ticket at the ENA, showed them my analysis, and proposed the new labels. However, they answered back that

In order to modify any data at ENA, the request needs to come directly from the submitter or owner of the data.

So I had to do another step to contact the original authors. Unfortunately, the datasets are not associated with an email address, or not publicly, at least. So after some googling I send a mail to the original authors. Furthermore, I found out that some of the authors had already noticed their mistakes and published corrections to their papers. Thinking that this would satisfy the requirement of written consent, I forwarded those to the ENA.

Since then: nothing. None of the original authors have answered my request, the ENA has closed the ticket, and no label changed.

The OpenSource Model

Just to put things in perspective, here is how the opensource development model works out these issues.

Say you are working on Ubuntu, find a bug in a distributed program, and also a possible fix. You can then contact Ubuntu, they will verify that the reported issue is indeed a bug, see if your patch fixes it, apply the fix and forward it to Debian. Debian will then also apply the fix to their package. Either of them will also notify the original author (called upstream) of the bug and the fix.

This process not only requires less work on behalf of the person reporting the bug, but also ensures that the issue does actually get fixed. Often enough the original author of a software has no interest in maintaining it any further or becomes unresponsive for various reasons. In this case the community benefits from the patches, nonetheless. The former is especially common with scientific software as people move around institutes and old mail addresses stop working.

Discussion

I had a nice Twitter discussion with Ewan Birney about all this. In it I expressed my frustration with this ineffective procedure. His point being

The key thing is that the archives can't make that decision on the underlying data - it's right to push this to the submitters

Part of that is that the INSDC rules do not require the authors to give the archive the right for modification. The Debian Policy, however does just that. My hope is that the databases will realise that upstream going silent is a problem, and come up with a procedure to resolve it. Otherwise, the sequences will remain mislabelled in the database forever. On the bright side, I can keep using the Ensembl E. coli as an example analysis in my talks in the future.