Читать книгу Principles of Microbial Diversity - James W. Brown - Страница 58

Generating a tree from a distance matrix

Оглавление

In the neighbor-joining method, the structure of the tree is determined first and then the branch lengths are fit to this skeleton.

Solving the tree structure

The tree starts out with a single internal node and a branch out to each sequence: an n-pointed star, where n is the number of sequences in the alignment. The pair of sequences with the smallest evolutionary distance separating them is joined onto a single branch (i.e., the neighbors are joined, hence the name of the method), and then the process is repeated after merging these two sequences in the distance matrix by averaging their distances from every other sequence in the matrix.

Using our distance matrix, the tree starts out like this (remember that we are sorting out the structure of the tree, not yet the branch lengths).


The closest neighbors in the distance matrix are A and B (0.11 evolutionary distance), so these branches are joined:


The distances to A and B from each of the other sequences are then averaged to reduce the distance matrix:


In this case, the averages are trivial; the average of 0.30 and 0.30 is, of course, 0.30, and so forth. This is not usually the case. Then the process is repeated. The closest neighbors in the reduced matrix are D with C (0.17):


Once again, the distance matrix is reduced by averaging. But be sure to average from the original distance matrix, not the previously reduced matrix:


Note that the value listed for A/B with C/D (0.265) is the average of four values in the original matrix: A to C (0.30), A to D (0.23), B to C (0.30), and B to D (0.23). If averaged AB/C (0.30) and AB/D (0.23) were averaged from the reduced matrix, the same number would be obtained in this case, but usually this is a more complex average and the numbers do not come out the same.

Once again, the smallest number in the matrix represents the nearest neighbors, in this case A/B with C/D (0.265), so these two branches are joined:


Each node on this tree has only three branches connecting to it; all of the nodes are completely resolved. This means that the structure of the tree has been determined. If there were more sequences, it would be necessary to reduce the matrix (joining A/B with C/D) and repeat the process until all of the nodes were resolved. The internal nodes have been arbitrarily labeled w, x, y, and z for reference when sorting out branch lengths (below).

Determining branch length

The next step is to determine the lengths of the branches on this tree. Basically, this is done by going through each node and finding where along the branch it is by figuring out the average difference in length along each of two branches. By choosing various sets of three sequences in a tree, the branch lengths can be sorted out just like a puzzle.

Branches w/A and w/B (w is the common node between A and B)

In our example, the distance between A and B is 0.11, and so the lengths of the two branches connecting them (A/w and w/B) must add up to 0.11. But where along this branch is the node (w)? If you look at the distance from any other sequence (C, D, or F) to A and to B, it is always the same. For example, C/A is 0.30 and so is C/B. This means that node w must be midway between A and B; each branch, then, has a length of 0.055. For example, with C used as reference:


Branches x/C and x/D

C and D are also simple neighbors, so we can easily solve these two connecting branches as well. The distance between C and D is 0.17. However, the distance to either C or D from elsewhere in the tree differs; this means that the node connecting C and D is not equidistant between them. In fact, this difference in distance between C and D varies a bit depending on which reference we use. These differences occur because we can only estimate evolutionary distance. As a result, we use the average (although most computer algorithms would use a least-squares average):


Therefore, 0.07 is the difference in branch length between A/C and A/D


This value, 0.08, is the average amount by which x/C is longer than x/D. Because the total length of C/D (think of this as C/x/D) is 0.170 and x/C is 0.08 longer than x/D, then:


x/C is the same 0.045 plus the 0.08 by which we already determined it was longer = 0.125:


Branches z/E and z/F

The distance between E and F can be calculated in the same way:


The length of E/z/F (E/F) is 0.30, and the average difference between anywhere else in the tree and E or F is −0.08. The negative sign means that the branch to F is longer than the branch to E. So z/E = (0.30 − 0.08)/2 = 0.11, and z/F − 0.11 + 0.08 = 0.1.


Internal branches w/x, x/y, and x/z

The tree so far is:


To solve the lengths of internal branches, the same process is used for collections of branches and then the average length outside the internal nodes is subtracted. To determine the lengths of w/y, x/y, and z/y, we need to determine the total lengths of w/x, x/z, and z/w and then figure out where along these to place y.

To solve for the length of w/x, the average distance between AB and CD in the reduced matrix (0.265) is used, and then the average w/A and w/B distance (AB/2 = 0.11/2 = 0.055) and the average x/C and x/D distance (CD/2 = 0.17/2 = 0.085) are subtracted to leave the w/x distance: 0.265 − 0.055 − 0.085 = 0.125.


To solve the length of x/z, the average CD/EF distance is calculated (this was not done in the reduced matrices, but it is the same process; it is 0.38) and then x/CD (CD/2 = 0.085) and x/EF (EF/2 = 0.30/2 = 0.150) are subtracted to give x/z = 0.145. Likewise, z/w = EF/AB (0.34) − w/AB (0.055) − z/EF (0.15) = 0.135.


Thus, w/x = 0.125, x/z = 0.145, and z/w = 0.135.

To determine where on any of these line segments node y is, we need to pick any of the line segments and calculate the difference to each end of this segment from the other branch. We do this just like we did for solving the lengths of the branches between A/B, C/D, or E/F. For example, for branch w/x:



Now we know all of the line segment lengths but one, y/z. We can get this by subtracting all of the known lengths of w/y from w/z, or all the known lengths of x/y from x/z; either way we get the same answer:

So, our final tree looks like this:


or like this with the branches drawn to scale:


Notice that in the final tree, the evolutionary distance between any two sequences is approximately equal (any difference is due to the need to average branch lengths) to the sum of the lengths of the line segments joining those two sequences; in other words, the tree is additive. This should be true on any quantitative phylogenetic tree.

The neighbor-joining method is very fast and so can be used on trees with much larger numbers of sequences than can other methods, which are discussed in chapter 5.

Principles of Microbial Diversity

Подняться наверх