Nucleotides

From CompSciWiki
Revision as of 19:59, 4 April 2011 by AndreiP (Talk | contribs)

Jump to: navigation, search

Back to the Case Studies homepage

Problem

DNA can be considered the source code of life. It contains the instructions for (most of the) life on Earth. Unlike digital computers that use binary to encode information and instructions in 1s and 0s, DNA has four possible nucleotides: Adenine, Cytosine, Guanine, and Thymine, represented by A, C, G and T, respectively.

Bioinformatics is field of study that applies concepts from computer science and statistics to study molecular biology using computer software and hardware.

Biologists and bioinformaticians typically store genetic databases of DNA in a text‐based file format called FASTA2. The FASTA file format is relatively simple:

  • Each FASTA‐formatted file starts with a header line. The first character on a header line is always the greater‐than character (‘>’)
  • The header line is followed by lines of sequence data (As, Cs, Gs, and Ts), with 80 (or fewer) characters per line

Example:

>gi|21071042|ref|NM_000193.2| Homo sapiens sonic hedgehog (SHH), mRNA
GCGAGGCAGCCAGCGAGGGAGAGAGCGAGCGGGCGAGCCGGAGCGAGGAAGGGAAAGCGCAAGAGAGAGC
GCACACGCACACACCCGCCGCGCGCACTCGCGCACGGACCCGCACGGGGACAGCTCGGAAGTCATCAGTT
CCATGGGCGAGATGCTGCTGCTGGCGAGATGTCTGCTGCTAGTCCTCGTCTCCTCGCTGCTGGTATGCTC
GGGACTGGCGTGCGGACCGGGCAGGGGGTTCGGGAAGAGGAGGCACCCCAAAAAGCTGACCCCTTTAGCC
TACAAGCAGTTTATCCCCAATGTGGCCGAGAAGACCCTAGGCGCCAGCGGAAGGTATGAAGGGAAGATCT
CCAGAAACTCCGAGCGATTTAAGGAACTCACCCCCAATTACAACCCCGACATCATATTTAAG

Write a program that will:

  • read a file containing one FASTA formatted sequence record
  • count the number of each of the four nucleotides (A,C,G and T) per sequence in the record
  • print a histogram displaying the genetic make‐up of the sequence record


Input:

Use JFileChooser to prompt the user to select a FASTA formatted file, and use Scanner to read the file selected by the user. Assume that the file chosen by the user will meet the standard FASTA file format described above, and that the sequence data will be all upper case characters.

So far in the course you’ve used Scanner to get input from the user with the keyboard. In some situations, getting input from the keyboard is cumbersome (entering 1000 bases, for example). To use JFileChooser your program must add a couple of import statements to the regular import statements:

import java.io.File;
import java.io.FileNotFoundException;
import javax.swing.JFileChooser;

You’ll also need to insert and use the following static method:

/**
* Prompt the user to select a file using JFileChooser and return
* the File object that the user selected.
*
* @return the file selected by the user
*/
public static File getFile() {
	JFileChooser chooser = new JFileChooser();
	File fastaFile = null;
	int returnValue;
	
	// show the open file dialog
	returnValue = chooser.showOpenDialog(null);

	// check to see that the user clicked the OK button
	if (returnValue == JFileChooser.APPROVE_OPTION) {

		// get the selected file
		fastaFile = chooser.getSelectedFile();
	}

	return fastaFile;
}

So far we’ve been using Scanner by passing System.in to get input from the keyboard:

	Scanner keyboard = new Scanner(System.in);

Scanner can also be used to read the contents of a file. Simply replace System.in with an instance of a File variable (like the one returned by getFile()):

	Scanner fileReader = new Scanner(aFile);

Finally, you must surround any code using a fileReader instance of Scanner using a try/catch block. Try/catch blocks are used in Java to deal with possible runtime exceptions. In the case of reading files using Scanner, we have to deal with a possible FileNotFoundException – an exception that may occur if the user typed in a file name that did not exist.

Example:

try {
	Scanner fileReader = new Scanner(aFile);
	// your code working with fileReader goes here

} catch (FileNotFoundException e) {
	e.printStackTrace();
}


Methods:

In addition to the getFile() method, your program must implement and use the following static methods:

	public static void main(String[] args)

Your main method should prompt the user to select a FASTA file using the getFile() method. Then, the main method should pass the file to getHeader() and getSequence() to get the header and sequence from the FASTA file. Next, the main method should call countBase() for each of the four different types of bases and determine the percentage of the entire sequence that each base represents. The percentage should be rounded to the nearest whole number. Finally, the main method should call printDNAHistogram() to print out the histogram representing the statistics of the file.

	public static String getHeader(File fastaFile)

Using Scanner, the getHeader method should read the first line contained in fastaFile (which is the header line of the record). This method should remove the leading ‘>’ character before returning the header line.

	public static String getSequence(File fastaFile)

Using Scanner, the getSequence method should skip the first line contained in fastaFile (which is the header line of the record) and store any subsequent lines (the sequence data) until the Scanner has no more lines left. The getSequence method should use the hasNextLine() and nextLine() methods from the Scanner class to read the sequence from the file. Once the Scanner has no more lines remaining, the getSequence method should return all of the data that it stored.

	public static int countBase(String sequence, char base)

The countBase method accepts sequence data (a String containing only As, Cs, Gs, and Ts), and a specific base character to count. The countBase method should return the number of times that base occurred in sequence.

	public static void printDNAHistogram(String header, int aPct, int
cPct, int gPct, int tPct)

The printDNAHistogram method should print out the histogram for the result calculated by other methods. The printDNAHistogram method should first print out header, and then it should call the printBar method (described below) for each of aPct, cPct, gPct, and tPct.

	public static void printBar(String label, int percent, char barSymbol)

The printBar method should print out a single bar for a histogram. The printBar method should first print out label. Then, the method should print out percent copies of the symbol barSymbol. Finally, the method should print out the actual percentage that the bar displays. All of this information should be on the same line. For example, the call

	printBar("A",25,'=');

would print out the following line:

A: ========================= (25%)


Output:

Use System.out for all output. For the sequence record in the file, print:

  • The header line (without the leading ‘>’)
  • A histogram showing the percentage of As, the percentage of Cs, the percentage of Gs, and the percentage of Ts
 

Gene casestudy.jpg

Solution

I will be using a step-by-step process to guide you to the solution to the problem. Please keep in mind, that there may be many other possible solutions.

Step 0: Before you start coding... PLAN!

A common problem that many students have when starting to program is the lack of planning before the actual coding. Many students just jump right in and start coding away, without any real plan to follow. While that method may work for simple assignments, it can become a frustrating experience in the more complicated programming assignments as you will be doing many different things in your code. Planning allows you to get a good understanding of the problem and a baseline of how you would code the solution.

Figure out exactly what the problem is asking you to do

The first step to understanding how to start coding something is to figure out what the problem is asking you to do. Since, you can't really code without knowing what you're coding for. Also, to help give you a "big-picture-overview" of how the code should look, be sure to read the assignment. It helps to read it multiple times and use a highlighter to make the important points stand out.

In this problem, these are clearly stated within the outline:

  • read a file containing one FASTA formatted sequence record
  • count the number of each of the four nucleotides (A,C,G and T) per sequence in the record
  • print a histogram displaying the genetic make‐up of the sequence record
  • It also asks you to output the following:

    • The header line (without the leading ‘>’)
    • A histogram showing the percentage of As, the percentage of Cs, the percentage of Gs, and the percentage of Ts

Once you have an understanding of what you need to solve, the next step is to create pseudocode.

Pseudocode

Pseudocode can be thought of as an outline or structure of your code without real coding. It is meant to be readable and can serve as the basis for your real code. By pseudocoding, we can take the required methods in the assignment and figure out the logic or algorithm we need to use in the real code. Also, this is an easy way to allow us to see how methods interact with one another. The key thing is to read carefully what is needed for each method and apply it to your pseudocode. Please refer back to the Problem section for the specific details for each method. Since the exact code has already been provided for the getFile() method, it does not need to be pseudocoded.

	public static void main(String [] args)
	{
		file = getFile()	//prompts user for and reads in the text file.  the text file is then tied to a variable
		header = getHeader(file)	//grabs the first line in the text file and saves it to a string
		sequence = getSequence(file) //stores the rest of the lines in one string
		
		//4 bases: A, C, G, T
		//count the bases in the sequence and save them to their respective variables
		a = countBase()
		c = countBase()
		g = countBase()
		t = countBase()
		
		//determine the percentages of each base
		apct = # of a's/total bases * 100
		bpct = # of c's/total bases * 100
		cpct = # of g's/total bases * 100
		tpct = # of t's/total bases * 100
		
		printDNAHistogram() //print histogram of results
	}
	public static String getHeader(File fastaFile)
	{
		//grab the first line of text from fastaFile (the header)
		//remove the leading '<'
		//save the header (minus the '<' character) into a string variable
		
		//return the  string variable
	}
	public static int countBase(String sequence, char base)
	{
		//create an integer variable to store the number of a specific base in the sequence
		
		//loop until we have gone through each character in the sequence
			//if the current character is equal to our base
				//iterate the integer variable
		//once we have gone through each characer in the sequence...
		
		//return the integer variable
	}
	public static void printDNAHistogram(String header, int aPct, int cPct, int gPct, int tPct)
	{
		//print out the header
		
		//call print bar for each base (A, C, G, T)
		//So for each of the following you're sending... a string corresponding to the base, the integer corresponding to frequency the base appeared
		//in the sequence, and the character to be displayed as the "bars" of the histogram
		printBar("A", a, '=')	
		printBar("C", c, '=')
		printBar("G", g, '=')
		printBar("T", t, '=')
	}
	public static void printBar(String label, int percent, char barSymbol)
	{
		//remember all of these should be printed on the same line
		
		//print out the label
		
		//loop for an x # of times depending on percent's value
			//print out the bar symbol (in this case: '=')
		
		//print out the percentage using percent
	}

Diagram

In addition to the pseudocode, using some sort of visual or diagram can even help you further in understanding how every method of the code fits together. While there are many ways to create your diagram, the key thing is to make it as simple as possible. Try and avoid any unecessary details. In the following example, I have created a diagram of when methods are used in the program, and also the methods that they call.

      • Insert picture here once uploaded****

Step 1: Start with the main method

Since the main method is the first thing that is run by the program, this is a good place to start your coding. However, if you look at the pseudocode for the main method and diagram that we have created, you can see that a lot of the work is done by the other methods. So using the pseudocode for main as a basis, we can start coding the other methods. I will be coding in a top-down fashion. The first piece of code that appears in our main method pseudocode is a method call to getFile()

Step 2: Understanding the getFile() method

Since the code for the getFile() method was already provided for us, all we really need to do for this method is to use and understand how it works.

Let's start with the variable declarations:

	JFileChooser chooser = new JFileChooser();
	File fastaFile = null;
	int returnValue;

The JFileChooser is a class included in Java. It provides a simple mechanism for the user to choose a file and allow it to be loaded in to the program. The File variable is where we store the file that was read in using JFileChooser. This variable allows us to grab and manipulate the contents. int return value will be used to grab the "success" or "fail" code when using JFileChooser.

The next thing done is to open the file dialog to allow the user to choose their fasta or input file from their local directories.

		// show the open file dialog
		returnValue = chooser.showOpenDialog(null);

Using the returnValue variable created earlier, if the user presses the ok button in the prompt that appears, it will grab that value and store it in returnValue.

Next, we check if the ok button was pressed. If it was pressed then we save the file that was read in to the File fastaFile variable.

		// check to see that the user clicked the OK button
		if (returnValue == JFileChooser.APPROVE_OPTION) {
			// get the selected file
			fastaFile = chooser.getSelectedFile();
		}

Once the work is done in the method, we return the File fastaFile variable to the method that called it. In this case, the main method.


Declaring Parallel Arrays

As mentioned in the problem description, you will need two parallel arrays. One will store the names of the birds as Strings and one will store the sighting counts for each bird as ints. It is smart to declare these two arrays right away.

String[] species;  // to store the names of the birds
int[] counts;	// to store the sighting counts for each bird


Closing Remarks

Remember that this is not the only "correct" progress report. There can be many different progress reports based on how the vital points of the program were interpreted. Although, in this case study the calculations are exceptionally important points to the correct execution of the program.

Code

Solution Code

Back to the Case Studies homepage