Term project
The term project is due April 11th, in class at 11:20pm and is worth 100 points (20% of the final grade). Students are encouraged to help each other on the project, but must turn in only their own work.
The term project is a molecular phylogenetic analysis of an unknown specie isolated in lab. Each student has isolated at least a couple of organisms from their own samples, then attempted to amplify ssu-rDNA from two of these. Any of these that contained an appropriate-sized PCR product (1kbp) were sent to MWG BioTech for sequencing with 515F, one of the primers used to obtain the DNA by amplification. These sequences are the starting point for a molecular phylogenetic analysis.
Part 1 : Your data
Microbiological data
For each of your isolates, put together a summary describing whatever you know about them microbiologically:
As always, the more details and information you can provide, the better. You will need all of this information at the end, to see if the phylotype of the organism(s) makes sense.
Where do these sequences comes from?
Samples of each PCR reaction and some oligonucleotide primer (515FSHORT - a shorter version of the primer used in the PCR reaction) were sent to MWG BioTech for sequencing (in High Point, NC). A few of days later they started sending back the sequence data by email. The sequences were downloaded and posted below for you.
Downloading your data
Key to the PCR reactions:
Rx PCR Seq Rx PCR Seq Name Rx PCR Seq Rx PCR Seq Name Rx PCR Seq Rx PCR Seq Name 61 + + 62 + + Jess 50 + + 51 - x Jo 31 + + 32 - x Alex 66 + - 67 + + Ellen 85 - x x x x Sadaf 35 + + 36 + + Kellie 65 + + 64 + + Ocha 93 + + 77 + w Michael 27 + + 28 w + Sarah M 78 + + 94 + + Josh 87 + + 88 + w Nour 9 w + 10 + + Jayme 80 + + 90 + + Ginger 84 - x 86 + + Daniel 3 w + 4 + + Rebecca C 74 + + 75 + w Tiffany 79 - x 63 - x Sophia 33 - x 34 - x Sarah K 72 + + 73 + + Lisa 99 + + 100 + + Chris 43 + + 44 - x Donna 97 + + 98 - x Robin 95 + + 96 w + Mueez 37 + + 38 + + Katie G 83 + + 82 + + Ashley P 13 + w 14 - x Erin 55 w + 56 + + Katie F 21 w - 22 + + Susan 7 + + 8 + + Olu 39 + Redo + 40 w - Farah 5 + + 6 + + Jessica 15 + + 16 + - Nicole 45 - x 46 + + Rebecca B 25 w Redo w 26 - Redo w Rob 23 + + 24 - x Ryan J 47 + + 48 + + Alyson 1 + + 2 + + Carla 19 + + 20 + + Sara 53 + + 54 + + Adam 17 w + 18 + + Ashley C 29 - x 30 - x Katrina 11 + + 12 w + Matt 41 - x 42 + + Sam 52 + + 57 + + Ryan T 49 - x x x x Sarah K 58 + Redo w 59 - Redo w JenNote: Most of the "w" sequencing data is from mixed templates, and so they appear as two sequences superimposed. If one sequence is enough stronger that the other, this data is still useable.
Gel images:
Sequencing data files:Click here to go to a list of all of the sequencing data files, from where you can download your info.
Sequencing data is listed by your PCR reaction numbers. All start with "C_" (for "comfort read", a type of sequencing service) followed by the PCR reaction number. The filenames also include the primer (_515FSHORT) and file type suffix (.abi, .pdf or .fasta). All of our samples that contained a visible product of the right size were sent for sequencing, whether they seemed good enough to provide data or not.
If your the sequencing of sample was repeated to see if they could get better data, these have "-REDO" before the file type suffix; these should be used in place of the original sequence.
Download your data files and save them with their .pdf or .fasta suffix. Get the .pdf, and the .fasta file for each of your reactions, whether they're good, bad, or ugly.
NOTE: If you wish, you can also download the original .abi file that contains these tracings in raw form. These can be viewed and manipulated in any of several free programs: 4Peaks (Mac - this is what I use), Chromas (PC), BioEdit (PC - this is also a great alignment editor), or FinchTV (Mac or PC), or TracerView (Mac, PC, or various Unix flavors).
Examining your data
You can view your sequencing data by opening the .pdf files you downloaded. Look carefully at your data. How does it look? Here is an example section from the beginning of a good sequence:

At the top is the sequence as the machine interprets it, from left to right, numbered just beneath. This example is from the start of the sequence - notice the sequence numbering "10", then "20" below to printed sequence. Below both the interpreted sequence and numbering is the raw data from the sequencing machine.
Some sequences don't start off this cleanly - the sequence only becomes clear after a few bases.
The sequence reads directly from the printout. Hopefully the first 500 bases of sequence (after perhaps a dozen or so if it has a rough start) should be reliable. Somewhere between 500 and 800, the sequence quality will degrade to the point of unreliability.
If your sequence comes from more than one template, i.e. your culture wasn't pure or the PCR reaction was contaminated, you will have sequences in which some peaks look good (if both sequences have the same base at that position) and some are two peaks in the same place (where the two sequence differ):

If one of the sequences is much stronger than the other, this is no problem; the extra peak will be small compared to the main peak, and the machine can correctly read the stronger sequence. If they are close to the same strength, the machine will not correctly read either sequence. If the two sequences are from very closely-related organisms, these double peaks may be sporatic, and concentrated in the most variable regions of the rRNA. If they are distantly-related organisms, the double peaks will be more common, as as soon as the two sequences hve a difference in length (an insertion/deletion relative to each other), they will be out of sync and most of the peaks will be twined.
Print out a copy of your data (the .pdf file); you'll need this to turn in with your Term Project.
Now open the .fasta file in a text editor (Notepad, Word, TextEdit, whatever), and print it out. This is the part of your sequence that the computer program in the sequencing machine has filtered and thinks is reliable. This is the sequence you'll actually use for your analysis. Go back to the printout of the .pdf of your data, and highlight the region of this sequence that is in the .fasta file.
Be sure to open and look at (and print out) the data for both of your PCR reactions.
Decision time
If either of your sequences is good, that's great. You may even have two good sequences - if so, use them both. If you have a sequence from a mixed template, use it only if it looks pretty good and if you don't have a clean sequence you can use.
No usable sequence data?

Some of you (only a few) did not get a PCR product from either reaction after purification. Others with PCR products will have failed to get good sequence data. If neither of your sequence yeilded useable data, and if you have a friend in the class that has two good sequences, then your best bet is to ask them if you can use one of their sequences - this way they do one and you do the other. Otherwise, I'll poll the class & get someone to provide a sequence number and microbiological data for you to use. Please let me know either way as soon as possible.
Part 2 : Searching the Ribosomal Database Project (RDP) for related sequences
Your next task is to perform a search of the Ribosomal Database Project with your sequence. This will give you a good idea of what kind of organism your sequence might come from.
Logging-in to the RDP web site
The URL of the RDP web site is: http://rdp.cme.msu.edu/
Click on the web address above to go there. This link will open the RDP web page in a new browser window, so you can go back-and-forth between the RDP site and these directions.
Loading your sequence into the RDP
Identify the sequences in RDP that are most similar to yours using "Sequence Match ".
Estimate the taxonomy of your sequence using "Classifier".
Critical reminder! Remember that what you have identified is the closest relative of your isolate whose 16S rRNA sequence is available in the RDP. You have not identified your isolate unless it is a perfect match - and even then you can't be sure!
Part 3 : Constructing a phylogenetic tree
The next step is to generate an informative phylogenetic tree containing your sequence(s) using the RDP "Tree" function. This involves selecting a series of sequences to include, generating a tree, then looking at the result so that you can go back & select additional sequences to include. Once you've done this back and forth a few times, you'll end up with a nice tree that displays the relationship between your sequence(s) and other organisms.
Is your sequence aligned yet?
Constructing a Phylum-level tree
| Thermotogae | Deinococcus-Thermus | Chloroflexi | Cyanobacteria |
| Chlorobi | alpha-proteobacteria | beta-proteobacteria | gamma-proteobacteria |
| delta-proteobacteria | epsilon-proteobacteria | Firmicutes (low G+C Gram-positives) |
Actinobacteria (high G+C Gram-positives) |
| Planctomycetes | Chlamydiae | Spirochaetes | Bacteroidetes |
Here is an example of what the tree might look like:
Look over your tree - does it make sense based on what the Sequence Match and Classifier results were? Before you turn in your project, highlight your sequence(s) in the tree, and label each of the phyla.
Constructing your final tree
Here is an example of what your tree might look like before you label it:
In the last part of the project, you need to organize and interpret your results, and draw some kind of conclusion from it, in a written report.
If you got significant help from another student on the computer-ology of this project, please include a note in your write-up telling me who helped, so I can give some trivial token of appreciation to that student, in the form of extra points on their Term Project.