Understanding the Evolution of the Eukaryotic Cell: The Endosymbiotic Theory


Using the BLAST Tool to Find Homologous (Similar) Sequences
The M. thermoautotrophicum rpoK (RPOK_METTH) sequence should be listed on the Protein Tools homepage. We now want to determine whether this archaebacterial sequence is more closely related to eukaryotic or to "true" bacterial RNA polymerase sequences. To do this, you need to employ other tools available in the Workbench. To search for sequences similar or homologous to the M. thermoautotrophicum rpoK sequence, you are going to use a tool called BLAST.

1. In the Protein Tools homepage, check the box next to the RPOK_METTH sequence.

2. Scroll down the tool menu and select "BLASTP ¡V Compare a PS to a PS DB" (BLASTP is an abbreviation for "Compare a Protein Sequence to a Protein Sequence Database") and click "Run"



On the next page, we must select the database(s) to be searched. You want to search both eukaryotic and prokaryotic (bacteria and archaebacteria) gene databases for sequences homologous to the M. thermoautotrophicum rpoK sequence.

3. Select the "SwissProt" database by clicking on it. All of the remaining options on the page allow you to fine-tune your search, but that won¡¦t be necessary so simply scroll to the bottom of the page and click "Submit"





A screen with the results of your BLAST search will appear. Once again, you will see a lot of letters and numbers and have no idea what they mean. Also, notice that each search result is assigned an "E value" ¡V what does this mean?
The E value or "Expect" value is the most intuitive, or instinctive, way to rank the results of a search. The E value estimates the statistical significance of a search result by specifying the number of matches with a given score that could be expected to occur purely by chance in a search of a database of a particular size. For example, an Expect value of 2 would indicate that two matches with that particular score would be expected to occur purely by chance. It follows that negative Expect values (10-something) reflect a high degree of confidence that the observed sequence similarity is real whereas search results with E values much higher than 0.1 are unlikely to reflect true sequence relatives. Essentially, the smaller the E-value, the more homologous or similar the sequence is to the original sequence BLASTED. An E-value of zero indicates that no matches would be expected by chance - this would represent a perfect or near perfect match.

4. To see more information about a sequence, select it and click on "Show Record(s)".



Scroll down the page and look at what it says next to "Organism Classification". You should notice that most, if not all, of the sequences that show high homology to the archaebacterial M. thermoautotrophicum rpoK sequence (in other words, sequences that have negative E values) are either from other archaebacteria (as would be expected) or from eukaryotes. (Note: Some viral sequences may also be pulled up, but because viruses are not considered living organisms, since they cannot self-replicate, we will ignore them.) This data indicates that the M. thermoautotrophicum rpoK sequence is more closely related to eukaryotic RNA polymerase rpoK sequences than to bacterial rpoK sequences (indeed, at the time of writing this tutorial, no (eu)bacterial sequences were retrieved by the search engine).


Using the CLUSTALW Tool to Align Two Sequences for Comparison
We are now going to determine just how similar or homologous the M. thermoautotrophicum rpoK sequence is to a eukaryotic rpoK sequence using a tool called CLUSTALW. CLUSTALW is used to align two sequences one on top of the other so that it is possible to see where and how they differ. The alignment process takes place by comparing the two sequences and finding common regions within them. The Biology Workbench then uses an algorithm to compute the most likely position in which the two sequences line up. A color-coding system is used to differentiate highly conserved regions and semi-conserved regions (in royal blue and green, respectively) from the non-conserved regions. On to the alignment .

1. Click on the "Back" button of your browser to return to the Search Results window. (If you are working in Netscape, the Records may appear in a separate window, in which case you simply need to close that window to return to the Search Results page). Scroll to the bottom of the Results page and click on "Show Sequence(s)".



2. In the Sequences window, click on the box next to the "RPB6_DROME" sequence (the RPB6_DROME sequence is from the eukaryote Drosophila melanogaster, the common fruit fly). Make sure that RPB6_DROME is the only sequence selected, then scroll down to the bottom of the page and click on the "Import Sequence(s)" button.





You should now be back at the Protein Tools homepage, if not, click on the "Return" button in the window that you are sent to. Both the M. thermoautotrophicum (RPOK_METTH) and the D. melanogaster rpoK (RPB6_DROME) sequences should be listed on the Protein Tools homepage.

3. Select both the RPOK_METTH and RPB6_DROME sequences by clicking on the boxes next to them and highlight "CLUSTALW ¡V Multiple Sequence Alignment" in the tool menu. Click on "Run"



4. On the next page click on "Submit" (once again, it is not necessary to mess with the default settings on this page)



Now we have the alignment on the screen¡K



Note: The RPB6_DROME sequence we used in this exercise is a bit longer than the M. thermoautotrophicum rpoK sequence. Consequently, the alignment has dashes on the bottom strand for the first part of the sequence. This is not a problem because, in the areas where there is overlap, there is very high homology, which we can tell from the colors shown (royal blue represents fully conserved or identical bases, green represents highly conserved bases, in other words, bases with similar chemical properties.) As you can see, there is a high degree of homology between the D. melanogaster and M. thermoautotrophicum rpoK sequences


Using the BOXSHADE Tool to Align Two Sequences for Comparison
We are now going to use another tool in the workbench called BOXSHADE that is somewhat similar to the CLUSTALW tool. It is also an alignment tool, but does the color-coding in a different manner. Some people choose to use BOXSHADE because the colors are different and it is somewhat easier to visualize the degree of similarity between two sequences.

1. Scroll back up to the top of the CLUSTALW sequence alignment page and click on "Import Alignment(s)". You will be brought to the Alignment Tools homepage. The CLUSTALW alignment is listed below the tool menu.





2. Check the box next to the CLUSTALW alignment. Highlight the option called "BOXSHADE ¡V Color-Coded Plots of Pre-Aligned Sequences" in the tool menu and click "Run". You will be brought to a screen that allows you to fine-tune your alignment -- you will use the default settings so just click the "Submit" button.



Now we can see the BOXSHADE alignment. It is very similar to the CLUSTALW alignment you just saw a minute ago, in the sense that the green and blue boxes show semi- and highly conserved sequences, respectively. The colors are slightly different and boxing the letters together makes the similarities and differences more obvious.



As you can see, there is extensive homology between the eukaryotic Drosophila rpoK sequence and that of the archaebacterium M. thermoautotrophicum.

3. When you are done viewing the alignment, click on the "Return" button and you will be returned to the Alignment Tools homepage.


Transcription Factor 2D (TFIID)
Using the rpoK subunit of RNA polymerase, we have obtained sequence data that supports the argument that archaebacteria are more closely related to eukaryotes than to the "true" bacteria. Can we obtain further evidence that supports the findings of Woese et al.? Eukaryotic RNA polymerases can accurately locate a promoter only if other proteins called transcription factors are also present at the promoter. One of these transcription factors is called TFIID, which stands for "transcription factor 2D". Now we are going to use TFIID to see if we obtain the same results as we did with rpoK.

1. Click on the "Protein Tools" button at the top of the page. This will return you to the Protein Tools homepage. Next, you need to import some sequences that code for the TFIID protein. Let¡¦s do a database search to discover a good sequence to use.

2. Select "Ndjinn ¡V Multiple Database Search". Make sure that none of the protein sequences listed on this page are selected (that is, make sure that the boxes are unchecked) ¡V if a sequence is selected, you will receive a message telling you to deselect it as soon as you hit "Run". Click "Run".



3. In the input box, type in "TFIID" (note: The "2" is typed in as two letter I¡¦s) and select "All" in the "Hits per page" pull-down menu.

4. Scroll down and select the SWISSPROT database. Scroll back up to the top of the page and click "Search".



The results of your search will appear in a new window. At the time of writing this tutorial, 86 sequences were pulled up using the TFIID search string -- you may get more as new sequences are constantly being added to the DNA and protein databases. This time there are actually some results that can be understood without "Showing records".



5. Highlight the sequence labeled "tf2d_mouse" and click on the "Import Sequence(s)" button. The mouse sequence for transcription factor TFIID should now be visible in your Protein Tools homepage. Now we are going to do a BLAST search to see what the eukaryotic mouse sequence retrieves from the databases. Remember, the BLAST tool searches for sequences that are very similar or homologous to the sequence being "blasted".

6. Check the box next to the TF2D_MOUSE sequence, highlight "BLASTP ¡V Compare a PS to a PS DB" and click "Run"



7. On the next page select the SWISSPROT database and click "Submit". At the time of writing this tutorial, 83 sequences were found in this search.





8. Select the first 20 sequences and click the "Show Record(s)" button. Scroll down the Records page and look next to where it says "Organism Classification". There it will tell you which Domain the organism belongs to.





You should see that most, if not all, of the sequences determined by the search engine to be highly homologous to the eukaryotic mouse TFIID gene are either eukaryotic or archaebacterial in origin. See if you can find a (eu)bacterial sequence in the top 50 matches.


<< Previous ^Top^ Back >>