|
Home | Announcements | Course Info | Lectures | Labs | Exams | Term Project | Grades |     |
Molecular phylogenetic analysis
Note : There is a problem set on sequence alignment for this lecture
Molecular phylogenetic analysis is the use of macromolecular sequences to reconstruct the phylogenetic relationships between organisms. The extent of difference between homologous DNA, RNA, or protein sequences in different organisms is used as a measure of how much these organisms have diverged from one another evolutionarily.
Steps of a molecular phylogenetic analysis:
- Decide what organisms/sequence/gene/region to examine
- Determine the sequence(s) experimentally
- Identify homologous residues
- Phylogenetic analysis
Steps in a phylogenetic analysis
- Decide what gene & species to analyze - Choosing a molecular clock
- Determine the gene sequences
- Identify homologous residues
- Phylogenetic analysis
|
Deciding on a sequence for analysis - Molecular Clocks
Some sequences are better than others - the most important factors to consider in a sequence for phylogenetic analysis are:
- Clock-like behavior, i.e. sequence divergence in the gene between two organisms should be proportional to how long ago they diverged. Clock-like behavior depends mostly on functional constancy of the sequence - function change leads to large, directed sequence change. Clock-like behavior also depends on the sequence being long enough to provide statistically-significant information, and be comprized of many independently-evolving domains so that random changes in one domain don't affect other domains.
- Phylogenetic range. The sequence must be present in all of the organisms to to analysed and must have retained it's structure & function in these organisms. Watch out for gene families, because each member of the family is probably specialized for a slightly different function. The gene must have enough variation in sequence to evaluate statistically but must be similar enough to that homologous residues can be identified. Non-functional sequences (e.g. introns) usually change too fast for analysis except of the very closest of relatives.
- No horizontal transfer. This means that the gene must be aquired only by inheretance, not by transfer from another organism. An example of frequently horizontally- transferred genes are those encoding antibiotic resistance. You can still generate a tree with these sequences, but the do not reflect the geneology of the organism as a whole.
- Must have a large existing data set to compare your sequences with!
In most cases, the gene encoding the RNA in the small subunit of the ribosome (ssu rRNA) is the best choice because:
- It is present in all cells
- It has exactly the same function in all cells
- It is comprized of 1500-2000 residues - large enough but not too large to be time-consuming to sequence.
- It is made up of ca. 50 independently-evolving helices and ca. 500 base-pairs.
- It is conserved enough in sequence & structure of be readily & accurately aligned.
- It contains both rapidly & slowly evolving regions - the fast regions are useful for determining closely-related species, whereas the slow regions are useful for determining distant relationships
- Horizontal transfer of rRNA genes apparently does not occur (also true for other genes in the central information processing pathways of the cell).
- There is a large database (many, many thousands) of aligned sequences available
Here is the secondary structure of the ssu-rRNA of Escherichia coli:
Steps in a phylogenetic analysis
- Decide what gene & species to analyze
- Determine the gene sequences - PCR amplification and DNA sequencing
- Identify homologous residues
- Phylogenetic analysis
|
Note : This section is likely to be review for most students, and so we may go over it quickly or not at all in class.
Obtaining the sequence experimentally
The commonly used method these days, and the one we'll use in lab, is Polymerase Chain Reaction (PCR). PCR amplifies genes logarithmically - a single molecule of a gene, imbedded in the rest of the genomic DNA, is specifically amplified to up to a million molecules in 30 cycles! In a PCR reaction, 3 steps (denaturation, primer annealing, and DNA polymerization) are cycled over-&-over, each time doubling the amount of the specific DNA fragment.
The PCR product is then sequenced - often using the same oligonucleotide primers that were used in the PCR reaction. Sequencing involves denaturing the DNA, annealing an oligonucleotide primer, and extending from this primer with DNA polymerase in the presence of dNTPs and small amounts of 'chain terminator' dideoxynucleotides (analogs of dNTPs that DNA polymerase cannot continue extending from):
1. Denature the DNA (separate the strands) with heat or high pH.
2. Anneal an oligonucleotide primer complementary to the DNA:
3. Add all the 4 dNTPs and a small amount of each of the 4 dideoxydNTP (ddNTP), each with a different fluorescent 'tag', and DNA polymerase:
4. Run sample on a high-resolution gel (one that can separate DNAs that differ by only a single base):
A fluorometer at the bottom of the gel detects the termination dyes as they run past in each lane of the gel (there are usually about 50 lanes per gel!). The connected computer collects this data and 'reads' the sequence from tha pattern of peaks. The output from the computer looks like this:
(notice that the colors used here don't match the example)
Each reaction gives 300-800 bases of sequence, so it is usually necessary to use several primers spaced along the length of the molecule to get the complete sequence on an rRNA gene.
Steps in a phylogenetic analysis
- Decide what gene & species to analyze
- Determine the gene sequences
- Identify homologous residues - Sequence alignment
- Phylogenetic analysis
|
Identifying homologous residues - i.e aligning the sequences.
This is the most important part of the analysis!
example:
position: 1 2 3 4 5 6 7 8 910
seq A A A A C U U G U U U
seq B A C A C U U G U G U
seq C A G A U U U - U C U
An alignment is a 2-dimensional matrix of multiple sequences. Each sequence is in a line (row) of the matrix. Each position (column) in an alignment contains homologous residues of each sequence. Gaps (shown as dashes) are added where needed to maintain the alignment. These gaps represent bases absent in that sequence that are present in some other sequence in the alignment.
Sequences must be fairly similar in sequence and length to be readily alignable 'by eye', or by computer alignment programs (e.g. Clustal). Thank goodness, most of the length of ssu-rRNAs are highly conserved and can (with experience) be manually aligned without much trouble.
Some of the tricks to aligning sequences are:
- Sequences are often aligned sequentially - start by aligning the two most similar sequences, then add sequences to the alignment one at a time after this, starting with the sequences most similar to those already aligned aand finishing with the most distantly related sequences. Likewise if you're adding a single sequence to an existing alignment, start by identifying the most similar sequence in the alignment & use that sequence as a guide.
- Alternatively, you can identify conserved blocks of sequence in all of the sequences, and align these. You have now broken the alignment problem into smaller, easier chunks. Add gaps as need to align the space between pre-aligned chunks according to the criteria below.
- Start out by finding patches of very similar sequences and align these, then work out in both directions from these, adding gaps sparingly when needed. Everything after this is about rearranging (and potentially adding or removing) these gaps.
- Where there are sequence differences, slide the gaps around to keep purines (G, A) aligned with purines & pyrimidines (C, U) aligned with pyrimidines.
- Try also to keep differences together in variable sequence positions, and align gaps together in columns wherever possible. A single gap of two positions is a lot better than two separate gaps of one position each!
- Try to keep what look like conserved positions (columns) conserved, and all things being equal put differences into positions already known to be variable.
In more extreme cases, you may neeed to use "structural superimposition" - The ability to use well-defined secondary structures to identify homologus residues is one of the key advantages of RNA over protein for phylogenetic analysis. The idea is that structural conservation can be used to identify homologous bases (i.e. to align) even if the sequences are too different to align by sequence similarity. In other words, you can use the secondary structure of the RNA to identify homologous parts of the RNA, rather than relying only on sequence similarity.
This works because, in general, it doesn't matter (so much) to the RNA what the bases in the helices are, what matters is that they can form the correct secondary structure. As a result, the secondary structure of an RNA is much more conserved than it's sequence, because co-evolution of bases that form base-pairs maintains the secondary structure as the sequence changes. Variation in the length of the RNA is usually in hairpin lengthening or shortening. So it's usually possible to keep track of homologous parts of an RNA structure even if the sequences are completely different!
Questions for thought:
- What are some DNA sequences that would not be good for phylogenetic analysis? Why?
- What are some other sequences that would be good for phylogenetic analysis, and in what situations would these be good?
- How did people get large amounts of a specific DNA for sequencing before PCR was invented?
- In an episode of the X-files, Agent Scully sequences some alien DNA and finds 'missing bands' in the sequences that she interprets to correspond to bases that are unique to aliens (not found in earthling DNA). Why is this not technically feasible?
- Given the variation of sequences in the context of the same secondary structure, can you imagine how scientists solve these secondary structures by comparative sequence analysis?
- Mutations occur one-at-a-time. Can you imagine, then, how the basepairs in a helix could change without disrupting the structure of the RNA? Does this explain (at least in part) why basepair changes that keep the purines (G or A) & pyrimidines (C or U) in the same positions are more common that those that switch them?
- Although RNA tertiary structures are rarely known, there are hundreds of protein 3D structures, determined by X-ray crystallography. Can you imagine a way to use these structures, in ways analogous to the use of RNA secondary structures are used, to align protein sequences more reliably?
|
| Last updated
April 03, 2009
by James W Brown |