Each good scientific experiment should start with an hypothesis. This is true also for sequencing in the age of NGS. However, often enough the data doesn’t fit the expectations. In this case it is of interest to find and mark the violating evidence.

One most common task in molecular biology is the creation of a phylogeny based on the alignment of nucleotide sequences. These days such alignments come from whole genome data. Given that, a maximum-likelihood method is employed to search for the best tree.

The tree shape is determined by the mutations in the data. As the mutation rate varies along the genome so does the tree shape. With alignments from whole-genomes there is enough variance in the data to create any wanted tree shape. In the following I will present a number of methods to “massage” an alignment so that the tree reconstruction will give you the phylogeny you desire. All of these are implemented in a new tool, called AlnMassage.

## Data Formats

AlnMassage is build around the Multiple Alignment Format, that is, it expects the genomes to be aligned in blocks. Each block represents a set of homologous sequences. For some transformations AlnMassage also requires the original sequences in FASTA format. The target tree shall be given in Newick format.

## Reducing Divergence

Given a block with substitutions, AlnMassage can split these blocks into smaller blocks without the mutations.

AAACC
AATCC


The two blocks below do not carry any mutations and thus the two sequences seem to be more closely related than they are. If necessary, AlnMassage removes blocks with too many substitutions altogether.

AA
AA

CC
CC


## Increasing Divergence

A common technique among aligners is to first find exact matches and then extend the alignment until the amount of homologous nucleotides drops of rapidly. AlnMassage can extend blocks until a target diversity is reached. Alternatively, it drops blocks consisting of sequences with low divergence to increase the overall distance.

## Regapping

Gaps give AlnMassage more material to play with. They can be used to subtly increase or decrease the substitution count.

AA-CCTGGAA
AATCCTG-AA


Removing the two gaps from above greatly increases the number of substitutions. As ML or distance methods usually do not keep gaps into account these changes mostly go unnoticed.

AACCTGGAA
AATCCTGAA


Applying the shown transformation in reverse decreases the number of substitutions at the expense of gaps.