Coner

MB 451 Microbial Diversity

Department of Microbiology - NC State University

Home | Announcements | Course Info | Lectures | Labs | Exams | Term Project | Grades | ~~~null pyro

Previous or Next lecture


Molecular phylogenetic analysis (continued...)

Note : there are problem sets on tree construction and interpreting trees for this lecture.

Steps in a phylogenetic analysis

  1. Decide what gene & species to analyze (=ssu-rRNA)
  2. Determine the gene sequences (=PCR+DNA sequencing)
  3. Identify homologous residues (=sequence alignment)
  4. Phylogenetic analysis (=tree construction)

Tree construction

The most common type of phylogenetic analysis is tree construction. There are several methods for building trees, including distance matrix methods and parsimony methods. We'll discuss 'neighbor-joining' and the 'least-squares distance matrix' methods. Both of these methods are distance-matrix approaches.

Tree building starts with a sequence alignment. Here is an example alignment of 5 sequences with 25 positions in the alignment:

    Seq A   AGAUUCGUCUGUAGGUUUCCACCAA
    Seq B   ACAUUCGUGUAUAGGUUUCCACUAA
    Seq C   ACAUUCGUGUAGAGGUUUCCACUAA
    Seq D   AAGUUCGCUUGGAGGUUUCCACGAA
    Seq E   AUCGUGAGAUCCAGGUAUCCACAAU 

The first step toward building a tree is to generate a similarity matrix: Just count the fraction of identical bases in every pair of sequences in the alignment.

        Seq A   AGAUUCGUCUGUAGGUUUCCACCAA
                |X||||||X|X|||||||||||X||  21/25 = 0.84
        Seq B   ACAUUCGUGUAUAGGUUUCCACUAA
         
         
          A       B       C       D       E
    A   -----   -----   -----   -----   -----
    B   0.84    -----   -----   -----   -----
    C   0.80    0.96    -----   -----   -----
    D   0.76    0.72    0.76    -----   -----
    E   0.52    0.52    0.52    0.52    ----- 

In this example, sequences A and B are 0.84 (= 84%) similar, A and C are 0.80 similar, B and C are 0.96 similar, etc, etc.

Next is the estimation of evolutionary distances from their sequence similarity. Conversion of similarities to evolutionary distances starts with 1 - similarity (i.e. converting similarity to difference), which is then usually corrected for the probability of underestimation due to multiple mutations.

Evolutionary dissimilarity is usually corrected (fudged) because it is an underestimate of the actual evolutionary distance. Counting differences between two sequences underestimates the number of changes that occured between them, because more than one evolutionary change at a single position (e.g. A -> G -> U) counts as only one difference between two sequences, and in the case of reversion counts as no change at all (e.g. A -> G -> A). One way to correct evolutionary distances is the Jukes & Cantor method. This method is a conversion relating similarity to evolutionary distance such that difference (dis-similarity) and distance are very close initially, but levels off at 0.25 similarity, where evolustionary distance is infinite. This makes sense; in two sequence that are very simliar, the frequency of multiple changes at a single site is low, requiring only a small correction, whereas two random sequences will be appear to be 25% similar, just because there are only 4 bases and 1 out of 4 will match by chance.


The equation for this curve is : estimated evolutionary distance = - 3/4 loge ( 1 - 4/3 observed dissimilarity)

With all of the similarities converted to evolutionary distances (whether or not they are corrected, or how they are corrected), you have a distance matrix:

                evolutionary distance
         A       B       C       D       E
   A   -----   -----   -----   -----   -----
   B   0.18    -----   -----   -----   -----
   C   0.23    0.04    -----   -----   -----
   D   0.29    0.35    0.29    -----   -----
   E   0.77    0.77    0.77    0.77    -----

Generating a tree by neighbor-joining

In the neighbor-joining method, the structure of the tree is determined first, then the branch lengths are fit to this skeleton. The tree starts out with a single internal node at a branch out to each sequence. The closest relatives ("neighbors") are "joined" onto a single branch, and then the process is repeated using the average distances from the original neighbors. Using our distance matrix, the tree starts out like this (remember that we're sorting out the structure of the tree, not the branch lengths, yet).

The closest neighbors in the distance matrix are B and C (0.04), so these branches are joined:

The distances from all other sequences to D and E are then averaged to reduce the distance matrix:

           A     B/C     D     E
     A     -      -      -     -
    B/C  0.205    -      -     - 
     D   0.29   0.320    -     -
     E   0.77   0.770   0.77   -

Now the closest neighbors are A and B, so join them:

That's it! If there were more sequences, you'd re-reduce the matrix as before (A with B and C to make A/B/C), & repeat the process over-and-over until all of the nodes were resolved. Please note that if you're joining 2 branches where one or more already has been joined (e.g. if you had to join A with B/C) you need to calculate the averages with the collection of original distances, not the already partially averaged numbers. In other words, the distance E to A/B/C = [(E to A) + (E to B) + (E to C)]/3 not [(E to A) + (E to A/B)]/2.

Once the structure (topology) of the tree is determined, it is an easy computational task to calculate the lengths of all the branches that fit best to the original distance matrix by minimizing the total least-squares deviation.

When you do this with our example, the tree looks like this:

Basically, what is done is to go through each node & figure out where along the branch it is by figuring out the difference in length along each branch. For example, if in the above example, the distance between A and B were the same as between A and C, then the node between B and C would be equidistant between them. If it were 0.01 shorter to B than to C, then the node would be 0.01 closer to B that to C. Since you know the distance between B and C, you know exactly where the node to A is. Of course in reality, you have to take an average using the results from all other sequence (A, D and E, in our example), because they're not likely to agree exactly. (Note that, in our example above, the estimated distances don't work very well to do this with B and C, because the distance between them is so small compared to the distance to other sequences in the tree.) By picking different sets of sequences, you can sort out the lengths of all of the branches.

For example:

   -    A    B   C
   A    -  .25 .30
   B    -    -   .15
   C    -    -    -

With only 3 sequences, there is only one possible tree structure:

tree

Since the distance from A to B is 0.05 smaller than the distance A to C, then the node must be 0.05 closer to B than to C. Since the distance between B and C is 0.15, the node must be 0.05 from B and 0.10 from C. Since the distance from A to C is 0.3, and the node is o.1 from C, then the distance from the node to A is 0.2. So, the tree is:

example2

By choosing various sets of three sequences in a tree, you can sort out the branch lengths just like a puzzle.

Neighbor-joining is really fast, and so can be used on trees with much larger numbers of sequences than other methods. Notice that the distance between any two sequences is (approximately) equal to the sum of the length of the line segments joining those two sequences - in other words, the tree is additive.

Rooting the tree with an outgroup

Whenever possible, include an outgroup sequence in the analysis; an outgroup is a sequence that is known to be outside of the group you're interested in treeing. Only by including an outgroup can you locate the root of a tree. For example, if you were building trees from mammalian sequences, you might include the sequence from a reptile as an outgroup. Outgroups provide the root to the rest of the tree - although no tree generated by these methods has a real root, if you know (from other information) that one of the sequences is unrelated to the rest, wherever that branch connects to the rest of the tree defines the root (common ancestor) of that portion of the tree. In the example above, sequences A - D might be mammalian whereas E might be a reptilian sequence. If the tree included only mammailian sequences, it would be impossible to know where the root is, but the inclusion of an outgroup provides that information.

Molecular phylogenetic trees depict the relationships between gene sequences

The last "step" in the tree-building process is the leap of faith that the tree depicting the relationships between the sequence similarities in an alignment also depicts the evolutionary history of the organisms the sequences came from. You can go wrong for many reasons at this step.

The common reason is the choice of sequence for use in the analysis - step 1 of the whole process. What would happen if, for example, you made a tree of mammals using globin genes, but use alpha globin sequences from some species and beta globin genes from others? What might happen if, in an analysis of plants, you used a nuclear ssu-rRNAs sequences from some plants, chloroplast ssu-rRNAs from others, and mitochondrial sequences from still others? What if, in an analysis of bacterial species, you used a gene like penicillin-resistance, that clearly moves readily from species to species? The trees you would build with these sequences might be perfectly valid trees, accurately representing the relationships between the genes used, but would not represent the relationships between the organisms!

So, if a tree just fundamentally doesn't look right, check the alignment first, but then think about the sequences used & how a tree of these sequences might not reflect the relationships between the 'host' organisms.


How to read a phylogenetic tree

A phylogenetic tree is a representation of the evolutionary/geneological relationships between a collection of organisms (or molecular sequences). There are many different ways to draw these trees, but they share a common set of features (please note that the trees that follow are generized approximations):

null

  • Scale. This typically is either time or evolutionay divergence. Trees with a time scale are based on some form of physical data, such as a fossil record, that provide dating information. If the scale is time, all modern organisms should obviously be shown at the same part of the scale. More often, the scale is evolutionary divergence, some measure of change in the organisms (or molecules). Because the extent of divergence is usually different in various parts of the tree, this is usually depicted by varying the lengths of the branches and providing a scale bar.
  • Terminal nodes. These are the ends of the branches of the evolutionary tree - typically the modern organisms (or molecules) that are being compared, but in some cases the ends of evolutionary branches that became extinct.
  • Internal nodes. These represent the last common ancestors of all of the organisms (or molecules) bound by this node.
  • Root. This is the 'base' of the tree - the last common ancestor of all of the organisms (or molecules) in the tree. It is not always possible to identify the root of a tree - typically, this requires either physical data (e.g. a fossil record) or data about organisms outside of the part of the tree shown.
  • Branches. These are the connections between nodes in the tree. These represent to evolutionary pathway between common ancestors (internal nodes) and modern organisms (terminal nodes). The length of these branches is defined by the scale - each branch represents a certain amount of historical time, if time is the scale used in the tree, or a certain amount of evolutionary change, if evolutionary divergence is the scale used in the tree.

There are a variety of ways of drawing trees. Here are two other trees of the same organisms that also use time as the scale:

null

The scales (time) in these two trees are horizontal rather than vertical, and the braches are simple diagonal lines connecting nodes, but the information in these trees is the same as in the previous tree. These trees are phenograms - the scale read by horizontal distance. Also notice that the order of the terminal nodes is irrelevant - only the topology of the tree and the lengths of the connections count. The positions of the branches and nodes can be switched around at will. as long as the nodes and their connections are not broken and remain true to the scale.

null

These trees use phylogenetic distance as the scale - measured in this case by the extent of sequence divergence in the small subunit ribosomal RNA sequences of the species. Notice that the lengths of the branches are uneven, because the rates of evolutionary drift in these sequences is not constant. The tree on the right is rootless - no root is shown. In order to root a tree, you need data from the fossil record, or other physical information, or in a molecular-based tree, you need to use an outgroup in the tree to place the root. This outgroup is chosen because you know, from other data, that this organism falls outside of the group defined by the other organisms. For example, in this tree of apes, you could root the tree using data from a New World monkey.

These trees are dendrograms - the scale is measured along the branch lengths rather than horizontally or vertically. The evolutionary distance between any two organisms is the total of the lengths of all of the branhes that connect them. For example, the evolutionary distance between lowland gorillas and lowland chimps would be about 0.1:

null

A phenogram can also be used with an evolutionary distance scale - in this case, remember that the scale (evolutionary distance) is measured only in horizontal (or vertical) distance:

null

Because evolutionary rates are not constant, some organisms have changed more than others since their common ancestry. In the example above, the ribosomal RNA sequences of humans have changed more than those of lowland chimps since their last common ancestor. Lowland chimps, then are primitive relative to humans with respect to rRNA sequence - humans are more highly derived than chimps, again with respect to rRNA sequence. If the traits of an organism overall are more similar to the ancestor than in the other members of that group, that organism is though of as a primitive organism. This is very useful information, but it can be dangerous - in most cases, the traits of an organism are not evolving at similar or constant rates, and so an organism might be primitive in most traits but highly derived in others. Sharks, for example, are primitive fish with respect to many traits (cartilagenous skeletons, dentat scales, etc) but are highly derived with respect to other (immunologically, and their electrosensory system). The danger is that there is a tendency to confuse generally primitive organisms with ancestors. For example, chimps are generally (morphologically) more primitive than humans, but chimps are not ancestors of humans - the common ancestor of humans and chimps was not a chimp! Chimps are modern organisms! They just share more similarity to the common ancestor of humans and chimps than do humans.


Questions for thought:

  • What organism would you choose for an outgroup for an rRNA tree of mammals? As outgroup for an rRNA tree of Bacteria? What about for a "universal tree" containing sequences or all kinds of organisms?
  • What would a tree of some animals look like if constructed from globin genes where some of the sequences were alpha globins & others were beta globins?
  • What would a tree (no pun intended) of plants look like if some of the sequences (rRNAs) were accidentally taken from the chloroplast instead of the nucleus. What if all of the sequences were from the chloroplasts?

Previous or Next lecture
Last updated April 03, 2009 by James W Brown