Bioinformatics Exercises
Paul Craig, Department of Chemistry Rochester Institute of Technology

Fundamentals of Biochemistry
Copyright © 2008 John Wiley & Sons, Inc. All rights reserved.

Chapter 3: Databases for the storage and "mining" of genome sequences

Chapter 3 is an introduction to nucleotides, nucleic acids (DNA and RNA), and the processes of transcription and translation.

  1. One of the major bioinformatics tools is the biological database. These databases are an important resource for the study of biochemistry at all levels. These databases contain huge amounts of information about the sequences and structures of nucleic acids (DNA and RNA) and proteins. They also contain software tools that can be used to analyze the data. Some of the software you can use directly from a web browser - these tools are called web applications. Other software must be downloaded and installed on your local computer - these are called freestanding applications. We'll start with finding databases.
    1. What major databases are available online that contain DNA and protein sequences?
    2. Which databases contain entire genomes?
    3. Using your textbook and online resources (http://www.google.com), make sure you understand the meaning of the following terms.
      1. BLAST
      2. Taxonomy
      3. gene ontology
      4. phylogenetic trees
      5. multiple sequence alignment
    4. Once you have defined these terms, find resources on the Internet which enable you to study them.




  2. TIGR (The Institute for Genomic Research) Exercise. TIGR is now part of a larger organization called the J. Craig Venter Institute (JCVI). Before you begin exploring the resources at TIGR, look up J. Craig Venter and identify his role in genomics.
    1. Open the JCVI site (http://www.tigr.org). Find the Comprehensive Microbial Resource under their Databases menu. Please note — web interfaces are always in a state of dynamic flux, so the hints on finding resources here may be out of date before they are published. Many sites contain a Search engine that can also be helpful in locating resources. Use PubMed to find the 2001 publication that describes the Comprehensive Microbial Resource at TIGR.
    2. Living organisms can be described in many ways. Prokaryotes can be divided into bacteria and archaea.
      1. Please define each of these terms, then find how many completed genomes are present for each in the Comprehensive Microbial Resource.
      2. How many completed genomes from Pseudomonas species have been deposited at TIGR? Which Pseudomonas species?
      3. Identify the primary reference for Pseudomonas putida KT2440.
    3. Find the link on the Comprehensive Microbial Resource home page for restriction digests under the Genome Tools tab (http://cmr.tigr.org/tigr-scripts/CMR/shared/MakeFrontPages.cgi?page=restriction_digest&crumbs=genomes). Perform a computer-generated restriction digest on Pseudomonas putida KT2440 with BamH1. How many fragments form and what is the average fragment size?

      What other options are available for analysis on this page?

      In addition to microbial genomes, TIGR also contains the genomes of many higher organisms. Return to the JCVI main page and search the site to identify five eukaryotic genomes that are available at TIGR. Include details of the links you followed to find your data.
    4. Identify one opportunity for online training at JCVI.

      Where would you go to apply for a job there?




  3. Through the use of high throughput methods, scientists are now able to sequence entire genomes in a very short period of time. Sequencing a genome is quite an accomplishment in itself, but it is really only the beginning of the study of an organism. Further study can be done both at the wet lab bench and on the computer. In this problem, you will use a computer to help you identify an open reading frame, determine the protein that it will express and find the bacterial source for that protein. Here is the DNA sequence.
    TACGCAATGCGTATCATTCTGCTGGGCGCTCCGGGCGCAGGTAAAGGTACTCAGGCTCAATT
    CATCATGGAGAAATACGGCATTCCGCAAATCTCTACTGGTGACATGTTGCGCGCCGCTGTAA
    AAGCAGGTTCTGAGTTAGGTCTGAAAGCAAAAGAAATTATGGATGCGGGCAAGTTGGTGACT
    GATGAGTTAGTTATCGCATTAGTCAAAGAACGTATCACACAGGAAGATTGCCGCGATGGTTT
    TCTGTTAGACGGGTTCCCGCGTACCATTCCTCAGGCAGATGCCATGAAAGAAGCCGGTATCA
    AAGTTGATTATGTGCTGGAGTTTGATGTTCCAGACGAGCTGATTGTTGAGCGCATTGTCGGC
    CGTCGGGTACATGCTGCTTCAGGCCGTGTTTATCACGTTAAATTCAACCCACCTAAAGTTGA
    AGATAAAGATGATGTTACCGGTGAAGAGCTGACTATTCGTAAAGATGATCAGGAAGCGACTG
    TCCGTAAGCGTCTTATCGAATATCATCAACAAACTGCACCATTGGTTTCTTACTATCATAAA
    GAAGCGGATGCAGGTAATACGCAATATTTTAAACTGGACGGAACCCGTAATGTAGCAGAAGT
    CAGTGCTGAACTGGCGACTATTCTCGGTTAATTCTGGATGGCCTTATAGCTAAGGCGGTTTA
    AGGCCGCCTTAGCTATTTCAAGTAAGAAGGGCGTAGTACCTACAAAAGGAGATTTGGCATGA
    TGCAAAGCAAACCCGGCGTATTAATGGTTAATTTGGGGACACCAGATGCTCCAACGTCGAAA
    GCTATCAAGCGTTATTTAGCTGAGTTTTTGAGTGACCGCCGGGTAGTTGATACTTCCCCATT
    GCTATGGTGGCCATTGCTGCATGGTGTTATTTTACCGCTTCGGTCACCACGTGTAGCAAAAC
    TTTATCAATCCGTTTGGATGGAAGAGGGCTCTCCTTTATTGGTTTATAGCCGCCGCCAGCAG
    AAAGCACTGGCAGCAAGAATGCCTGATATTCCTGTAGAATTAGGCATGAGCTATGGTTCAC
    
    1. First we will find an open reading frame in this segment of DNA. What is an open reading frame (ORF)? You can probably find the answer in your textbook or online with a simple Internet search (http://www.google.com). You may also wish to try the bookshelf at PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books). In bacteria, an open reading frame on a piece of mRNA almost always begins with AUG, which corresponds to ATG in the DNA segment that coded for the mRNA. According to the standard genetic code (see your textbook), there are three stop codons on mRNA: UAA, UAG, and UGA, which correspond to TAA, TAG, and TGA in the parent DNA segment. Here are the rules for finding an open reading frame in this piece of bacterial DNA:
      1. It must start with ATG. To simplify the exercise, the first ATG is the start codon. Normally you will not have this information in gene finding.
      2. It must end with TAA, TAG or TGA
      3. It must be at least 300 nucleotides long (coding for 100 amino acids)
      4. The ATG start codon and the stop codon must be in frame. This means that if you count the total number of bases in the sequence from the start to the stop codon, it must be evenly divisible by 3.

        Hints: I suggest you do this search by pasting the DNA sequence into a word processor, then searching for the start and stop codons. Once you have found a pair, you need to count the number of bases between them (including the 3 bases each for the start and stop codon). If you use Microsoft Word, you can highlight the text of your proposed ORF, then use the Tools..Word Count option from the drop-down menu and check the number for "Characters." This number must be divisible by 3.
    2. Admittedly, part a is a tedious approach. Here is an easier one. Highlight the DNA sequence again and copy it. Then go to the Translate tool on the Expasy server (http://www.expasy.org/tools/dna.html). Paste the sequence into the box entitled, "Please enter a DNA or RNA sequence in the box below (numbers and blanks are ignored)." Then select "Verbose ("Met", "Stop", spaces between residues)" as your Output format and click on Translate sequence. The "Results of Translation" page which appears will contain 6 different reading frames.
      1. What is a reading frame and why are there 6? Once again, you can refer to your textbook, the Internet or the PubMed bookshelf for the answer.
      2. Select the reading frame that contains a protein (more than 100 continuous amino acids with no interruptions by a stop codon). Which reading frame is that? Write that down.
      3. Now go back to the Translate tool page, leave the DNA sequence in the sequence box, but select "Compact ("M", "-", no spaces)" as your Output format. Go to the same reading frame as before and copy the protein sequence (by one letter abbreviations) starting with "M" for methionine and ending in "-" for stop codon. Save this sequence to a separate text file.
    3. Now we're going to identify the protein and the bacterial source. Go to the NCBI BLAST page (http://www.ncbi.nlm.nih.gov/BLAST/).
      1. What does BLAST stand for? We're going to do a simple BLAST search using our protein sequence, but you can do much more with BLAST. You are encouraged to work the Tutorials on the BLAST home page to learn more.
      2. On the BLAST page, select protein BLAST. Enter your search sequence in the "Search" box. Use the default values for the rest of the page and click on the BLAST button. The BLAST server will go through several pages until it gives you the final result, which may take a while. What is the protein and what is the source? It should be the first one that is listed in the BLAST output.

    For instructors: You can do this exercise with any DNA sequence you choose. It is probably best to choose a DNA segment that encodes only one protein.




  4. Sequence homology. We are going to use the protein that you identified in problem 3 to look at sequence homology with BLAST.
    1. First - some definitions: what do the terms "homolog", "ortholog", and "paralog" mean? (see textbook)
    2. Go to the NCBI BLAST page (http://www.ncbi.nlm.nih.gov/BLAST/). Find the text file of your protein sequence and paste it into the BLAST in the "Search" box. Before you click on the "Search" button, we're going to narrow the search by kingdom. As you look down the BLAST page, you'll see some Options.
      Database: Non-redundant proteins (nr)
      Organism: Leave blank
      Entrez Query: Eukaryota
      Algorithm: blastp
      Now click on the "BLAST!" button and wait for your results.
      1. Can you find a homologous sequence from yeast (hint: use the Find tool in your browser with the term Saccharomyces)? Follow the links to find the Genbank entry for the yeast protein. Select Display..FASTA on the Genbank page. Copy and paste the sequence in FASTA format to the same file where you saved your earlier protein sequence.
      2. Can you find a homologous sequence from humans (use the Find tool in your browser with the term Homo)? Save the sequence in FASTA format (as you did above) to the same file. Most biochemists consider 25% the cutoff for sequence homology, meaning that if two proteins have less than 25% identical sequence, more evidence is needed to determine if they are homologs. Can you find any sequences that are above the 25% identity mark in your BLAST search?
    3. Use the BLAST Help page (http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs) to discover the meaning of the Score and E Value for each sequence that is reported. Also, what is the difference between an identity and a conservative substitution? Provide an example of each from the comparison of your sequence and a search sequence obtained from BLAST.
    4. BLAST uses a substitution matrix to assign values in the alignment process, based on the analysis of amino acid substitutions in a wide variety of protein sequences. What is the default substitution matrix on the BLAST page? What other matrices are available? What is the source of the names for these substitution matrices? Go to the original BLAST page (http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?PAGE=Proteins&PROGRAM=blastp&BLAST_PROGRAMS=blastp&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on) and repeat the BLAST search in step 3.b. using a different substitution matrix. To do this you will need to click on the Algorithm parameters link on the bottom of the page. Do you find different answers?




  5. Plasmids and Cloning
    1. REBASE is the Restriction Enzyme Database (http://rebase.neb.com/rebase/rebase.html) which is supported by a number of commercial restriction enzyme suppliers. Go to the REBASE Enzymes page (http://rebase.neb.com/rebase/rebase.enz.html) and find a restriction enzyme from Rhodothermus marinus (it will start with the letters Rma). What is the abbreviation for this enzyme?

      What is the recognition sequence for this enzyme?

      What are the expected and actual frequencies of restriction enzyme recognition sites for this enzyme in Bacillus halodurans C-125?
    2. What is a plasmid? pBR322 was one of the first plasmids to be developed for experimental work. Go to the Entrez site (http://www.ncbi.nlm.nih.gov/Entrez) and find the sequence of pBR322 using the terms, "cloning vector pBR322."

      Save the sequence of pBR322 to a file in FASTA format (here are two sites that describe the FASTA format, but there are many more: http://ngfnblast.gbf.de/docs/fasta.html; http://bioinformatics.ubc.ca/resources/faq/?faq_id=1). To get Entrez to display your sequence in FASTA format, go to the pBR322 sequence page, select "FASTA" as the Display option. Save the pBR322 sequence in FASTA format to a file (suggested name, pBR322.fasta).

      Plasmids normally contain genes that encode enzymes that confer bacterial resistance. Look through the Entrez description of pBR322 (http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=208958) and identify one enzyme encoded by pBR322 and name the antibiotic that it targets.
    3. Go to PubMedCentral (http://www.pubmedcentral.gov/) and search for a 1978 article in Nucleic Acids Research about restriction mapping of pBR322. Download the article in pdf format (use Adobe Acrobat Reader to view it - you can get this at http://www.adobe.com). What is the total size of the pBR322 plasmid in bases?

      How many cut sites are there for the restriction enzyme HaeII on pBR322?
    4. When using restriction enzymes, some digests result in "blunt ends" and some in "sticky ends." Explain the meaning of those terms and provide an example of each.
    5. Go to the RESTRICT site at the Pasteur Institute (http://bioweb.pasteur.fr/seqanal/interfaces/restrict.html). Input your pBR322 sequence file. Make sure you enter your email address properly. Scroll down to the "Required Section" and notice that you have a Minimum recognition site length of 4 nucleotides and you have selected all the enzymes available in REBASE to digest pBR322 at the same time. Click on the "Run Restrict" button.
    6. After you hit the "Run Restrict" button, you'll see an output screen. Click on the link "output.out." This will take you to a simple text page that lists all the cuts that were made in the pBR322 plasmid. Select all the text on this page, copy it and save it to a simple text file on your computer (suggested file name: pBR322_restrict_4.txt). How many fragments did the pBR322 get cut into by "all" enzymes? Look for the "HitCount" number on the output.out page.
    7. Now let's play with the web site a bit. What happens to the number of fragments you get if you change the Minimum recognition site length to 6 nucleotides? Save the output.out data to a file (pBR322_restrict_6.txt). Why did the number change?
    8. Now change the enzyme name from "all" to "Bam HI" in the enzymes box under the Required section on this page. How many fragments do you find?

      Now try AvaI. Again, how many fragments do you get from AvaI?

      Now I'll give you one more example to try: Eco47III. How many fragments do you get from Eco47III?

      What is the size of the restriction site for Eco47III?
    9. What happens if you combine the three different enzymes (separate the enzyme names by commas)? Study the output.out file to find the following information?
      1. How many fragments are generated?
      2. Where are the restriction sites on pBR322?
      3. What is the size of the fragments that are generated?
    10. Design a set of experiments using mixtures of restriction enzymes aimed at drawing a restriction map of pBR322 similar to the map of pUC18 in Fig. 3-24, showing the relative positions of the restriction enzymes BamHI, AvaI and PstI. Carry out the experiment and draw the restriction map. Note: The Invitrogen web site includes an interesting new tool called VectorDesigner (https://vectordesigner.invitrogen.com/browser.cfm) that could also be used for this exercise.
    11. For the adventurous: Search REBASE to find an enzyme or combination of enzymes that will produce 10 fragments from pBR322.




  6. New Developments in RNA. Scientists are finding that RNA has many functions that we were not aware of until recently. For many years it was thought that proteins can only function properly if they have the correct three dimensional (3D) structure (still thought to be true), but that nucleic acids (DNA and RNA) were information carriers that could assume almost any shape and still retain their function. We are finding that structure is very important to the function of many RNA molecules.
    1. Traditionally RNA molecules have been subdivided into three groups: ribosomal RNA (rRNA), messenger RNA (mRNA) and transfer RNA (tRNA). In recent years a number of other types of RNA have been discovered and are assuming greater importance in our understanding of living systems. Explore one of the following links to find 3 more types of RNA and prepare a brief description of each type:
      1. The RNA World Website (http://www.imb-jena.de/RNA.html)
      2. RNA database listing at Stanford (http://cmgm.stanford.edu/WWW/www_databases.html#rna)
      3. RNA database listing at BioExplorer.net (http://www.bioexplorer.net/Databases/RNA_Databases/)
      4. RNAbase (http://www.rnabase.org/)
      5. Anything you can find on Google.
    2. The first 3D structures of nucleic acids were tRNA molecules. Go to the Protein Data Bank (http://www.rcsb.org/pdb), search for tRNA and use the Jmol viewer to explore the structure. What features of tRNA can you see in this viewer?
    3. Scientists are finding that secondary structure prediction for RNA molecules is now a very important contributor to our understanding of the structure and function of RNA. For this exercise, you will need to identify a web resource about the secondary structure of RNA. Report back
      1. The purpose of the web resource
      2. For databases: What data resources are found there? How can they be accessed? Are they linked to resources containing medical information, genomic information and gene ontology data?
      3. If it is a web application, what questions is it designed to answer? What type of input does it take and what outputs are generated?