/ Bioinformatics

PFASTA — Yet another set of FASTA tools

Recently more and more CVEs are being filed against bioinformatic tools. Validity of the bug reports aside, it just shows that the security aspects of scientific software becomes more interesting. One of the most important factors is parsing and validating input data. Correctly handling invalid data is crucial to ensure your program does not contain a flaw which can be exploited.

The FASTA format is just one of the many data formats in bioinformatics. Even though it might look simple, it is hard to parse efficiently and correctly. Among the best parsers for FASTA files is kseq which has recently become faster. However, I have always been unsatisfied with the lack of error messages in kseq. Thus, some time ago I started the pfasta project which is now getting a gorgeous new release.

Pfasta is a FASTA parser written in C designed to produce great error messages. Here are some examples:

% ./acgt foo
acgt: foo: No such file or directory
% ./format Makefile 
format: Makefile: File must start with '>'.
% ./acgt test/xfail_emptyfile.fa 
acgt: test/xfail_emptyfile.fa: File is empty.
% ./validate test/xfail_snpgenie.fa 
validate: test/xfail_snpgenie.fa: Empty name on line 1.

The parser detects simple problems such as non-existing files, empty files, and files not starting with >. But it can also detect errors in the grammar such as a missing name for a sequence and even give a hint on which line the problem is.

Pfasta is also a set of tools with an assorted set of functionality. They use the pfasta parser and thus offer a great user experience.

  • acgt: Reduce residues to the four canonical bases.
  • aln2dist: Convert an alignment to a distance matrix.
  • aln2maf: Convert an alignment to MAF.
  • cchar: Count the number of nucleotides.
  • concat: Concatenate sequences.
  • fancy_info: Print a fancy report.
  • format: Format sequences.
  • gc_content: Determine the GC content.
  • n50: Compute the N50.
  • revcomp: Compute the reverse complement.
  • shuffle: Shuffle a set of sequences.
  • sim: Simulate a set of genetic sequences.
  • split: Split a FASTA file into multiple files on a sequence basis.
  • validate: Check if a file conforms to the grammar.

Do the extra checks slow down the parsing? Somewhat. Here are the numbers:

% pv ~DISS_PATH/data/Ecoli_ST131_109/* | wc -l - > /dev/null
547MiB 0:00:00 [2,70GiB/s]
% pv ~DISS_PATH/data/Ecoli_ST131_109/* | kseq-new  - > /dev/null
547MiB 0:00:00 [2,02GiB/s]
% pv ~DISS_PATH/data/Ecoli_ST131_109/* | kseq-old  - > /dev/null
547MiB 0:00:00 [1022MiB/s]
% pv ~DISS_PATH/data/Ecoli_ST131_109/* | ./validate -
547MiB 0:00:00 [1,22GiB/s]

For each tool I measured how quickly it can parse 109 bacterial genomes amounting to 547 MiB. As a baseline I included wc -l. On my machine wc whisks at a 2.70 GiB/s. kseq with the recent patch is almost as fast at around 2 GiB/s. Pfasta is slightly slower at 1.2 GiB/s while still being faster than most versions of kseq.h being used in the wild. Note that it is unlikely for your harddrive or SSD to deliver more than 1.2 GiB per second anyways.

pfasta comes with a lot of goodies: a simple API, man pages, unit tests, and a permissive license.

Feel free to use pfasta in your own projects. Also let me know if you want to see some other tool added to the set or managed to break the parser.