|
||
| Home | Announcements | Course Info | Lectures | Labs | Exams | Term Project | Grades | |
||
Molecular phylogenetic analysis (continued...) Other treeing and evaluation methods. The process I've described so far is straightforward - in fact too much so. Phylogenetic analysis is an entire scientific area of study, and of course what I've presented is very highly simplified. Today I'll touch (just touch) on some of the complexities. Protein-based trees At first glance, you might think that alignments and trees based on protein sequences would be more informative than those based on DNA or RNA sequences, because there are only 4 bases but 20 standard amino acids (BTW, there are 2 known "non-standard" amino acids that are built into proteins during translation, selenocysteine and pyrolysine). But despite the lower information content of nucleic acid sequences, there are fundamental reasons why non-mRNAs are generally better for phylogenetic analysis:
Alternative substitution models In the last lecture we talked about the Jukes & Cantor method to estimate evolutionary distance from sequence similarity. This is a simple method, but there are several other more sophistiateed methods. In the Jukes & Cantor method, any change is scored equivalently - for each position in a pairwise comparison, the bases are either a match or not. A commonly-used alternative is the Kimura 2-parameter model, in which transversions and transitions are scored differently, and these scores are based on sifting the alignment first to estimate their relative frequency. A refinement of this is to "weigh" the score of each position in an alignment differently based on how conserved that position is - a change in a conserved base then counts for more than a change in a highly variation position. Distance matrices from protein alignments usually use a scoring table derived from the observed relative frequencies with which any amino acid is coverted to another from a huge collection of aligned protein sequences, e.g. the PAM tables. One of the things substitution models fight is an artifact called long branch attraction, that affects all treeing algorithms, and is the result (primarily) of a underestimation of evolutionary distance of distantly-related sequences. This underestimation results in a tendency for the longest branches in a tree to cluster together artifactually. These and any other method for estimating evolutionary distance amount to an attempt to describe how sequences change. In other words, they are mathematical models of the mechanisms of evolution of these sequences, and are therefore usually called "substitution models". Choice of an appropriate substitution model is critical, and often underappreciated. Sometimes even in rRNA sequences the sequences changes adaptively - the big example here is a tendency for sequence to differ in G+C vs A+U content either because the genome has an unusual G+C content (i.e. there is pressure toward either GC or AT richness in the DNA) or because the organism is a thermophile & so prefers G=C basepairs over A=U in it's RNAs. This can reek havoc in a tree. A way around this is to do a purine vs pyrimidine analysis, which is just a conversion of the alignment so that A=G and C=U (commonly all G's are converted to A's & C's to U's). Trees are generated from these alignments in the usual fashion. These trees are, of course based on less data - you've thrown out more than half of the phylogenetic information in the alignment - but should be free of G+C bias artifacts. Alternative treeing algorithms Fitch - an alternative distance-matrix treeing method Another useful method for generating trees from distance matrices is that of Fitch and Margoliash - commonly called Fitch. This is done by starting with two of the sequences, separated by a line equal to the length of the evolutionary distance between them:
Then the next sequence is added to the tree such that the distances between A, B and C are approximately equal to the evolutionary distances:
Notice that the fit isn't perfect. If we could determine the evolutionary distances exactly, they would fit the tree exactly, but since we have to estimate these distances, the numbers are fit to the tree as closely as possible using a least-squares best fit. The next step is to add the next sequence, again re-adjusting the tree to fit the distances as well as possible:
And at last we can add the final sequence and readjust the branch lengths one last time using least-squares:
Neighbor-joining and Fitch are both least-squares distance-matrix methods, but a big difference is that neighbor-joining separates the determination of the tree structure from solving branch legnths, whereas Fitch solves them together, but does so by adding branches (sequences) one at a time. Parsimony This is actually a collection of related methods based on the same premise - Occam's Razor. In other words, the tree that requires the smallest number of sequence changes is the most likely tree. No distance matrix is calculated, instead trees are searched and each ancestral sequence calculated, then the number of "mutations" required are added up. Testing every possible tree is not usually possible, so a variety of search algorithms are used to examine only the most likely trees. Likewise, there are a variety of ways of counting (scoring) sequence changes. For example:
Each of the terminal sequences are the ones you start out with - the aligorithm then tries to find the tree and predict the internal node sequences that could lead to these sequences with the least number of changes. Parsimony methods are usually slower than distance-matrix methods (especially neighbor-joining, which is very fast) but much, much faster than the Maximum Likelihood methods described below. Parsimony uses more of the information in an alignment, since it doesn't redure the data to a distance matrix, but in my experience works best with relatively closely-related sequences. Maximum-likelihood This method turns the tree-contruction process on it's head - starting with a cluster analysis to generate a "starting" tree, from which the substitution model is derived, then goes back and calculates the tree that has the maximum likelihood of resulting in the observed sequences based on the model parameters. Sound complicated? It is, and ML tree construction is by far the most computationally intensive of the methods described. But it's generally also the best, in the sense that the trees are more consistant and robust. The limitation is that fewer sequences (and shorter sequences) can be analyzed by ML. A tree that might take a few seconds by neighbor-joining might take a few minutes by parsimony or Fitch, but a few hours, or a couple of days, by ML. This is serious - it means that you can't usually "play" with trees, testing various changes in the data or treeing parameters & seeing the result right away. Bootstrapping Bootstrapping is a method to evaluate the reliability of a tree. Unfortunately, it is mathematically impossible to extract confidence intervals for nodes in a tree using the standard treeing methods - in other words, you cannot determine how sure you should be of the placement of each node in a tree (branching-order). (BTW, this is not true of branch lengths, from which confidence intervals can be readily calculated.) The standard method of evaluating tree branching-order is bootstrapping. In a bootstrap analysis, a sequence alignment is randomly sampled over-and-over-again, typically 100 or 1000 times, and trees are generated from each random sampling. The reliability of a particular branching arrangement in a tree is judged by the frequency that the branch appears in all of the resulting trees. The random sampling starts with the input alignment. For a 'regular' tree, the similarity matrix is generated by checking sequences against each other pair-wise, tallying-up similarity with each position in the alignment counted once and only once. In a bootstrap sampling, a similarity matrix would be generated from an alignment of 1000 positions by comparing randomly selected positions in the alignment, so that some positions will be compared more than once, and some will not be compared. For example: position 1 2 3 4 5 6 7 8 9 sequence 1 g g u u c g c c u sequence 2 g c u u u g g c u sequence 3 g c u u u - g c u would be randomly sampled, ending up with a 'temporary' alignment, from which a similarity matrix and tree would be generated: position 9 2 5 8 3 9 2 1 9 sequence 1 u g c c u u g g u sequence 2 u c u c u u c g u sequence 3 u c u c u u c g u Notice that some positions in the alignment are included multiple times (9 and 2) and some are not included (4, 6, 7). In realistically large alignments, such randomly-sampled alignments yield good trees if the branching arrangements are well-supported by the sequence data. So in a bootstrap analysis, a large number of trees are generated from random samples of the alignment, and the number of these trees that agree with each branch of the the reference tree (generated the usual way) is shown on the tree. Often, the same type of analysis is performed using more than one method of tree construction. The evaluation of bootstrap scores is subjective, but generally branches that show up in 50% of trees generated from bootstrapped datasets are considered to be reliable. Here is an example of how this is usually shown on a tree:
Example phylogenetic analysis : strain ES-2 In about 1990, an organism (designated ES-2) was isolated from a deep-sea hydrothermal vent sample. ES-2 grows heterotrophically at 65C. A lipid analysis of the isolate was unusual, containing a number of apparently branched lipids as well as fatty acids. Electron microscopy and standard microbiological tests were not helpful in identifying the organism. LOCUS Eub.tmarin 893 bp RNA 12-FEB-1993
With the sequence in hand, the next step in the process is to align the sequence with the ssu-rRNA database and calculate phylogenetic trees. The alignment process is guided by the secondary structure generated for the signature analysis. Here is a portion of the sequence alignment:
The sequence was then 'tree-ed' with representative sequences from each of the bacterial phyla to see which was the closest relative:
The sequence seems most closely related to the representative of the Gram-positive bacterium (Bacillus subtilis). So, the next step is to generate another tree using the sequences available from the Gram-positive Bacteria. At the time this analysis was done, only a few such sequences were available - now more than a thousdand Gram-positive sequence are in the databank!
The tree shows that ES-2 is a member of the Clostridium/Eubacterium group of the low G+C Gram-positive bacteria. Conclusion ES-2 is a member of the Clostridium/Eubacterium group of the low G+C Gram-positive Bacteria. In this group, organisms that produce spores are called considered to be of the genus Clostridium, those that do not are Eubacterium. Because of the environment it was isolated in, the new species was named Eubacterium thermomarinus. Questions for thought:
|
||
| Last updated April 03, 2009 by James W Brown |