Home | Announcements | Syllabus | Lectures | Labs | Exams | Term Project | Grades

Term project


The term project is due April 11th, in class at 11:20pm and is worth 100 points (20% of the final grade). Students are encouraged to help each other on the project, but must turn in only their own work.

The term project is a molecular phylogenetic analysis of an unknown specie isolated in lab. Each student has isolated at least a couple of organisms from their own samples, then attempted to amplify ssu-rDNA from two of these. Any of these that contained an appropriate-sized PCR product (1kbp) were sent to MWG BioTech for sequencing with 515F, one of the primers used to obtain the DNA by amplification. These sequences are the starting point for a molecular phylogenetic analysis.


The term project you turn in must contain the following items - THIS IS A CHECKLIST!:
  1. Your data (10 points)
  2. An database search with your sequence (15 points)
  3. A phylogenetic analysis of your sequence (40 points)
  4. The Writeup (35 points)

Part 1 : Your data

Microbiological data

For each of your isolates, put together a summary describing whatever you know about them microbiologically:

As always, the more details and information you can provide, the better. You will need all of this information at the end, to see if the phylotype of the organism(s) makes sense.

Where do these sequences comes from?

Samples of each PCR reaction and some oligonucleotide primer (515FSHORT - a shorter version of the primer used in the PCR reaction) were sent to MWG BioTech for sequencing (in High Point, NC). A few of days later they started sending back the sequence data by email. The sequences were downloaded and posted below for you.

Downloading your data

Key to the PCR reactions:

Rx
PCR
Seq
Rx
PCR
Seq
Name
Rx
PCR
Seq
Rx
PCR
Seq
Name
   
Rx
PCR
Seq
Rx
PCR
Seq
Name
61
+
+
62
+
+
Jess
   
50
+
+
51
-
x
Jo
31
+
+
32
-
x
Alex
66
+
-
67
+
+
Ellen
85
-
x
x
x
x
Sadaf
35
+
+
36
+
+
Kellie
65
+
+
64
+
+
Ocha
93
+
+
77
+
w
Michael
27
+
+
28
w
+
Sarah M
78
+
+
94
+
+
Josh
87
+
+
88
+
w
Nour
9
w
+
10
+
+
Jayme
80
+
+
90
+
+
Ginger
84
-
x
86
+
+
Daniel
3
w
+
4
+
+
Rebecca C
74
+
+
75
+
w
Tiffany
79
-
x
63
-
x
Sophia
33
-
x
34
-
x
Sarah K
72
+
+
73
+
+
Lisa
99
+
+
100
+
+
Chris
43
+
+
44
-
x
Donna
97
+
+
98
-
x
Robin
95
+
+
96
w
+
Mueez
37
+
+
38
+
+
Katie G
83
+
+
82
+
+
Ashley P
13
+
w
14
-
x
Erin
55
w
+
56
+
+
Katie F
21
w
-
22
+
+
Susan
7
+
+
8
+
+
Olu
39
+
Redo +
40
w
-
Farah
5
+
+
6
+
+
Jessica
15
+
+
16
+
-
Nicole
45
-
x
46
+
+
Rebecca B
25
w
Redo w
26
-
Redo w
Rob
23
+
+
24
-
x
Ryan J
47
+
+
48
+
+
Alyson
1
+
+
2
+
+
Carla
19
+
+
20
+
+
Sara
53
+
+
54
+
+
Adam
17
w
+
18
+
+
Ashley C
29
-
x
30
-
x
Katrina
11
+
+
12
w
+
Matt
41
-
x
42
+
+
Sam
52
+
+
57
+
+
Ryan T
49
-
x
x
x
x
Sarah K
58
+
Redo w
59
-
Redo w
Jen

Note: Most of the "w" sequencing data is from mixed templates, and so they appear as two sequences superimposed. If one sequence is enough stronger that the other, this data is still useable.

Gel images:

WgelA
WgelB
RgelA
RgelB
FgelA
FgelB
leftovergel


Sequencing data files:

Click here to go to a list of all of the sequencing data files, from where you can download your info.

Sequencing data is listed by your PCR reaction numbers. All start with "C_" (for "comfort read", a type of sequencing service) followed by the PCR reaction number. The filenames also include the primer (_515FSHORT) and file type suffix (.abi, .pdf or .fasta). All of our samples that contained a visible product of the right size were sent for sequencing, whether they seemed good enough to provide data or not.

If your the sequencing of sample was repeated to see if they could get better data, these have "-REDO" before the file type suffix; these should be used in place of the original sequence.

Download your data files and save them with their .pdf or .fasta suffix. Get the .pdf, and the .fasta file for each of your reactions, whether they're good, bad, or ugly.

NOTE: If you wish, you can also download the original .abi file that contains these tracings in raw form. These can be viewed and manipulated in any of several free programs: 4Peaks (Mac - this is what I use), Chromas (PC), BioEdit (PC - this is also a great alignment editor), or FinchTV (Mac or PC), or TracerView (Mac, PC, or various Unix flavors).

Examining your data

You can view your sequencing data by opening the .pdf files you downloaded. Look carefully at your data. How does it look? Here is an example section from the beginning of a good sequence:

good sequence

At the top is the sequence as the machine interprets it, from left to right, numbered just beneath. This example is from the start of the sequence - notice the sequence numbering "10", then "20" below to printed sequence. Below both the interpreted sequence and numbering is the raw data from the sequencing machine.

Some sequences don't start off this cleanly - the sequence only becomes clear after a few bases.

The sequence reads directly from the printout. Hopefully the first 500 bases of sequence (after perhaps a dozen or so if it has a rough start) should be reliable. Somewhere between 500 and 800, the sequence quality will degrade to the point of unreliability.

If your sequence comes from more than one template, i.e. your culture wasn't pure or the PCR reaction was contaminated, you will have sequences in which some peaks look good (if both sequences have the same base at that position) and some are two peaks in the same place (where the two sequence differ):

mixed sequence data

If one of the sequences is much stronger than the other, this is no problem; the extra peak will be small compared to the main peak, and the machine can correctly read the stronger sequence. If they are close to the same strength, the machine will not correctly read either sequence. If the two sequences are from very closely-related organisms, these double peaks may be sporatic, and concentrated in the most variable regions of the rRNA. If they are distantly-related organisms, the double peaks will be more common, as as soon as the two sequences hve a difference in length (an insertion/deletion relative to each other), they will be out of sync and most of the peaks will be twined.

Print out a copy of your data (the .pdf file); you'll need this to turn in with your Term Project.

Now open the .fasta file in a text editor (Notepad, Word, TextEdit, whatever), and print it out. This is the part of your sequence that the computer program in the sequencing machine has filtered and thinks is reliable. This is the sequence you'll actually use for your analysis. Go back to the printout of the .pdf of your data, and highlight the region of this sequence that is in the .fasta file.

Be sure to open and look at (and print out) the data for both of your PCR reactions.

Decision time

If either of your sequences is good, that's great. You may even have two good sequences - if so, use them both. If you have a sequence from a mixed template, use it only if it looks pretty good and if you don't have a clean sequence you can use.

No usable sequence data?
bad gel

Some of you (only a few) did not get a PCR product from either reaction after purification. Others with PCR products will have failed to get good sequence data. If neither of your sequence yeilded useable data, and if you have a friend in the class that has two good sequences, then your best bet is to ask them if you can use one of their sequences - this way they do one and you do the other. Otherwise, I'll poll the class & get someone to provide a sequence number and microbiological data for you to use. Please let me know either way as soon as possible.


Part 2 : Searching the Ribosomal Database Project (RDP) for related sequences

Your next task is to perform a search of the Ribosomal Database Project with your sequence. This will give you a good idea of what kind of organism your sequence might come from.

Logging-in to the RDP web site

The URL of the RDP web site is: http://rdp.cme.msu.edu/

Click on the web address above to go there. This link will open the RDP web page in a new browser window, so you can go back-and-forth between the RDP site and these directions.

Loading your sequence into the RDP

  1. On the RDP web site, click on the link for "myRDP". This takes you to a myRDP login page. You don't need an account to use this, however; just click on the "Test Drive" button. Now you're on the myRDP overview page. This page lists all of the public user data.
  2. Click on the "Upload" button. On the upload page, use these settings:
  3. Now click "upload". If there's a problem with your sequence, it'll let you know & return you to the Upload page. If it looks OK, it'll tell you there's 1 sequence in the file & ask if you want to load it - click "Continue".
  4. If you have 2 good sequences, repeat this process with the other sequence as well.
  5. Your sequences should now appear at the top on the myRDP Overview page list. While you're doing other things, it will align your sequence(s) to the database; when it's done, the "1" will move from the "pending" column to the "A" (aligned) column.
  6. Click on the grey "+" box in front of your sequence listing(s). They should now be red "-" boxes. This adds your sequences to your working list.

Identify the sequences in RDP that are most similar to yours using "Sequence Match ".

  1. Now, click on the link to "Sequence Match" in the menubar at the top of the page.
  2. Scroll to the bottom of the page, & use the following settings:
  3. Click on the "Do Seqmatch with Selected Sequence" button (not the "Submit" button!) and wait for the results - usually less than a minute.
  4. Look at the "Hierarchy View" - this gives you the taxonomy (lineage) of the sequence(s) as the RDP sees it; Domain, Phylum, &c, &c, down usually to the genus (depending on how closely related your sequence is to something in the database).
  5. Click "Show Printer Friendly Results" to see the details. There will be a list of the 20 best matches in the database, and the similarty of these sequences (S_ab) is shown in orange (the similarity score in purple will probably not be calculated). S_ab is a complex similarity score, but 2 identical sequences will have a score of 1.0, and the closer the score is to 1.0, the more similar the sequences are.
  6. Once you have an informative lineage, print out this page.
  7. Look through the resulting sequence list and find the best match (highest S_ab or similarity scores), and click on the number in front of it (it should look something like "S000463918") to pull up it's sequence record. Print out this page. If you have ties, print them all out.

Estimate the taxonomy of your sequence using "Classifier".

  1. Click the link to "Classifier" in the menubar at the top of the page.
  2. Click the "Do Classification with Selected Sequences" button (not the "Submit" button!) and wait for the results - it should just be a few seconds.
  3. Look at the "Hierarchy View" - this gives you the taxonomy (lineage) of the sequence according to this analysis. This should look a lot like the results for the "Sequence Match".
  4. Change the "Confidence Threshold" to it's lowest level - 50% - and see if this changes the result (it usually doesn't).
  5. Print out this page.

Critical reminder! Remember that what you have identified is the closest relative of your isolate whose 16S rRNA sequence is available in the RDP. You have not identified your isolate unless it is a perfect match - and even then you can't be sure!


Part 3 : Constructing a phylogenetic tree

The next step is to generate an informative phylogenetic tree containing your sequence(s) using the RDP "Tree" function. This involves selecting a series of sequences to include, generating a tree, then looking at the result so that you can go back & select additional sequences to include. Once you've done this back and forth a few times, you'll end up with a nice tree that displays the relationship between your sequence(s) and other organisms.

Is your sequence aligned yet?

  1. Click on the myRDP link in the menubar at the top of the page to bring up the myRDP Overview page.
  2. Your sequences (they may no longer be at the top of the page) should still be "selected" (with a red "-" box in front of them). By now, the alignment process should be complete and there should be a "1" in the "A" column. If not, wait a while; if the server is busy, it can take a while. You can even come back tomorro, but if you do you will have to re-select your sequences.

Constructing a Phylum-level tree

  1. Click on the "Browser" link in the menubar at the top of the page. This takes you to the Taxonomic Browser, from which you can select the sequences you want to include in your tree. At the top level, each of the major phylogenetic branches of Bacteria ("Phyla") are listed. Notice that only bacterial sequences are included in the RDP; no Eukarya, organelles, or Archaea. Notice the numbers after each name; these tell you the numbers of sequences in that group that you have selected so far, the total number of seqeunces in that group, and the number of search matches (we don't use this), respectively.
  2. Scroll to the bottom of the page, and if necessary change the settings to:
  3. Start with the Phylum Aquificae: this will be your outgroup. Click on the Aquificae name for a list of the sequences in this group in the database. Select a genus: Aquifex is a good choice, since we talked about it in class. In this case, there is only one sequence in this genus that matches your criteria: Aquifex pyrophilus. Check the box in front of the name to add it to your working list.
  4. Click on the "Bacteria" link on the Lineage at the top of the page to go back to the top of the hierarchy.
  5. Repeat this process with at least the major phyla we're talking about in class:

    Thermotogae Deinococcus-Thermus Chloroflexi Cyanobacteria
    Chlorobi alpha-proteobacteria beta-proteobacteria gamma-proteobacteria
    delta-proteobacteria epsilon-proteobacteria Firmicutes
    (low G+C Gram-positives)
    Actinobacteria
    (high G+C Gram-positives)
    Planctomycetes Chlamydiae Spirochaetes Bacteroidetes

    Feel free to choose the representatives you like, but it'll be easier for you to keep track of and interpret your trees if you try to choose familiar organisms, e.g. Escherichia coli for the gamma-proteobacteria.
  6. When you're finished, click on the Tree link in the menubar at the top of the page. Change the outgroup to Aquifex pyrophilus (or whatever sequence you chose from this group), then click "CREATE TREE".
  7. If all goes well, the tree will be displayed after a few seconds. Use the commands shown above the tree to adjust the tree to your liking, and print out a copy.

Here is an example of what the tree might look like:

tree1

Look over your tree - does it make sense based on what the Sequence Match and Classifier results were? Before you turn in your project, highlight your sequence(s) in the tree, and label each of the phyla.

Constructing your final tree

  1. Go back to your Sequence Match results summary page, and click the "view selectable matches" link after your sequence name. The best matches are listed taxonomically - select representatives from this list to include in your tree. Make sure to get the single best match, some good matches in that same group, and representatives from the other groups listed. Keep in mind that the Tree program has a limit of 50 sequences, and it gets slower & slower as the number of sequences increases.
  2. Click "Save selection and return to summary" at the top of the page.
  3. Repeat this with your other sequence, if you have two.
  4. Click on the "Tree" link at the top of the page. Your old tree will still be here; click "Start over" to generate a new tree with your new working list. Be sure to use the right outgroup!
  5. If all goes well, the tree will be displayed after a few seconds. Now, have a close look at this tree; does it make sense? What could you do to make your tree more informative? For example, if the phylum-level branches are all very deep, and your close relatives all very shallow, might it not be nice to add some representatives into the gap between? What about adding a second representative to each phylum branch, as distant a relaative as possible, to "flesh out" each of these branches? Are there branches on your tree that just don't need to be inculded? Have you picked the right outgroup?
  6. When you've got the tree 'populated' the way you want, use the commands above the tree to adjust it to your liking, and print out a copy. Highlight your sequence(s) in the tree, and label each phylogenetic group.

Here is an example of what your tree might look like before you label it:

tree2


Part 4 : The Writeup

In the last part of the project, you need to organize and interpret your results, and draw some kind of conclusion from it, in a written report.

If you got significant help from another student on the computer-ology of this project, please include a note in your write-up telling me who helped, so I can give some trivial token of appreciation to that student, in the form of extra points on their Term Project.


Last updated March 30, 2008 by James W Brown | Department of Microbiology | College of Ag and Life Sciences | NC State University