Bootstrapping Genomes

Bootstrapping has been the de-facto standard for computing support-values of phylogenies for decades; and it has served us well. However, with the rise of high-throughput sequencing its limits became apparent. Not only is it slow, but also the results become less useful, as the sequences get longer. I have had cases, where I aligned a set of bacterial genomes with two different tools, got two different topologies, but bootstrapping claimed 100% confidence in both cases. The problem is not faulty tools, but purely mathematical.

Lets have three sequences, for simplicity, all of length \(L\). With just three sequences, homolog positions can be one of five cases, all depicted in the alignment below.

S1  AAATA
S2  AATAC
S3  ATAAG

Case one and five do not influence the topology of the tree*. In the cases two, three and four, the two A sequences are grouped together, and the other is the outgroup. With such a simple case bootstrapping will give us the right answer, a confidence of roughly 33%. But real data looks more like this:

S1  AAATAAAAAAATTAAAA
S2  AATACAAAAAAAACAAA
S3  ATAAGTTAAATAAGAAT

Now more than 50% of all segregating sites have S3 as the outgroup. As a single bootstrap is a winner takes all situation, S1 will be clustered with S2 most of the time. In fact, if we append the above pattern a thousand times to itself, S1 will always be clustered with S2.

Let \(p_i\) with \(i \in {1,2,3,4,5}\) be the frequency of column-pattern \(i\) in the original sequence. So in the above example \(p_2 = 5/17\). Bootstrapping is a multi-nomial process and thus the new number of columns \(C\) of pattern 2 is distributed binomially: \(C = B(L; p_2)\). Furthermore, \(E(C) = L \cdot p_2\) and \(\mathit{Var}(C) = L \cdot p_2 \cdot (1- p_2)\).

However, the topology of the newly generated alignment is influenced by \(p_2’ = C/L\) with \(E(p_2’) = p_2\) and \(\mathit{Var}(p_2’) = p_2 \cdot (1-p_2) / L\). So the average frequency of pattern 2 is the same after bootstrapping as well as before bootstrapping. However, the variance is inverse-proportional to \(L\). As \(L\) grows, \(p_2’\) becomes fixed to \(p_2\)! So the longer the sequence, the less variance is introduced by bootstrapping. In the limit of whole genomes, barely anything wiggles, and so the confidence goes to 100%, because \(p_2 > 1/2\) and thus pattern 2 dominates the signal.

So for long sequences and especially whole-genomes, bootstrapping does not produce enough randomness for an accurate measure of confidence and other methods such as quartet analysis have to be used.

* Well, they do influence the tree, but not in the sense described here.