Nucleotides

From CompSciWiki
Jump to: navigation, search

Back to the Case Studies homepage

Problem

DNA can be considered the source code of life. It contains the instructions for (most of the) life on Earth. Unlike digital computers that use binary to encode information and instructions in 1s and 0s, DNA has four possible nucleotides: Adenine, Cytosine, Guanine, and Thymine, represented by A, C, G and T, respectively.

Bioinformatics is field of study that applies concepts from computer science and statistics to study molecular biology using computer software and hardware.

Biologists and bioinformaticians typically store genetic databases of DNA in a text‐based file format called FASTA2. The FASTA file format is relatively simple:

  • Each FASTA‐formatted file starts with a header line. The first character on a header line is always the greater‐than character (‘>’)
  • The header line is followed by lines of sequence data (As, Cs, Gs, and Ts), with 80 (or fewer) characters per line

Example:

>gi|21071042|ref|NM_000193.2| Homo sapiens sonic hedgehog (SHH), mRNA
GCGAGGCAGCCAGCGAGGGAGAGAGCGAGCGGGCGAGCCGGAGCGAGGAAGGGAAAGCGCAAGAGAGAGC
GCACACGCACACACCCGCCGCGCGCACTCGCGCACGGACCCGCACGGGGACAGCTCGGAAGTCATCAGTT
CCATGGGCGAGATGCTGCTGCTGGCGAGATGTCTGCTGCTAGTCCTCGTCTCCTCGCTGCTGGTATGCTC
GGGACTGGCGTGCGGACCGGGCAGGGGGTTCGGGAAGAGGAGGCACCCCAAAAAGCTGACCCCTTTAGCC
TACAAGCAGTTTATCCCCAATGTGGCCGAGAAGACCCTAGGCGCCAGCGGAAGGTATGAAGGGAAGATCT
CCAGAAACTCCGAGCGATTTAAGGAACTCACCCCCAATTACAACCCCGACATCATATTTAAG

Write a program that will:

  • read a file containing one FASTA formatted sequence record
  • count the number of each of the four nucleotides (A,C,G and T) per sequence in the record
  • print a histogram displaying the genetic make‐up of the sequence record


Input:

Use JFileChooser to prompt the user to select a FASTA formatted file, and use Scanner to read the file selected by the user. Assume that the file chosen by the user will meet the standard FASTA file format described above, and that the sequence data will be all upper case characters.

So far in the course you’ve used Scanner to get input from the user with the keyboard. In some situations, getting input from the keyboard is cumbersome (entering 1000 bases, for example). To use JFileChooser your program must add a couple of import statements to the regular import statements:

import java.io.File;
import java.io.FileNotFoundException;
import javax.swing.JFileChooser;

You’ll also need to insert and use the following static method:

/**
* Prompt the user to select a file using JFileChooser and return
* the File object that the user selected.
*
* @return the file selected by the user
*/
public static File getFile() {
	JFileChooser chooser = new JFileChooser();
	File fastaFile = null;
	int returnValue;
	
	// show the open file dialog
	returnValue = chooser.showOpenDialog(null);

	// check to see that the user clicked the OK button
	if (returnValue == JFileChooser.APPROVE_OPTION) {

		// get the selected file
		fastaFile = chooser.getSelectedFile();
	}

	return fastaFile;
}

So far we’ve been using Scanner by passing System.in to get input from the keyboard:

	Scanner keyboard = new Scanner(System.in);

Scanner can also be used to read the contents of a file. Simply replace System.in with an instance of a File variable (like the one returned by getFile()):

	Scanner fileReader = new Scanner(aFile);

Finally, you must surround any code using a fileReader instance of Scanner using a try/catch block. Try/catch blocks are used in Java to deal with possible runtime exceptions. In the case of reading files using Scanner, we have to deal with a possible FileNotFoundException – an exception that may occur if the user typed in a file name that did not exist.

Example:

try {
	Scanner fileReader = new Scanner(aFile);
	// your code working with fileReader goes here

} catch (FileNotFoundException e) {
	e.printStackTrace();
}


Methods:

In addition to the getFile() method, your program must implement and use the following static methods:

	public static void main(String[] args)

Your main method should prompt the user to select a FASTA file using the getFile() method. Then, the main method should pass the file to getHeader() and getSequence() to get the header and sequence from the FASTA file. Next, the main method should call countBase() for each of the four different types of bases and determine the percentage of the entire sequence that each base represents. The percentage should be rounded to the nearest whole number. Finally, the main method should call printDNAHistogram() to print out the histogram representing the statistics of the file.

	public static String getHeader(File fastaFile)

Using Scanner, the getHeader method should read the first line contained in fastaFile (which is the header line of the record). This method should remove the leading ‘>’ character before returning the header line.

	public static String getSequence(File fastaFile)

Using Scanner, the getSequence method should skip the first line contained in fastaFile (which is the header line of the record) and store any subsequent lines (the sequence data) until the Scanner has no more lines left. The getSequence method should use the hasNextLine() and nextLine() methods from the Scanner class to read the sequence from the file. Once the Scanner has no more lines remaining, the getSequence method should return all of the data that it stored.

	public static int countBase(String sequence, char base)

The countBase method accepts sequence data (a String containing only As, Cs, Gs, and Ts), and a specific base character to count. The countBase method should return the number of times that base occurred in sequence.

	public static void printDNAHistogram(String header, int aPct, int
cPct, int gPct, int tPct)

The printDNAHistogram method should print out the histogram for the result calculated by other methods. The printDNAHistogram method should first print out header, and then it should call the printBar method (described below) for each of aPct, cPct, gPct, and tPct.

	public static void printBar(String label, int percent, char barSymbol)

The printBar method should print out a single bar for a histogram. The printBar method should first print out label. Then, the method should print out percent copies of the symbol barSymbol. Finally, the method should print out the actual percentage that the bar displays. All of this information should be on the same line. For example, the call

	printBar("A",25,'=');

would print out the following line:

A: ========================= (25%)


Output:

Use System.out for all output. For the sequence record in the file, print:

  • The header line (without the leading ‘>’)
  • A histogram showing the percentage of As, the percentage of Cs, the percentage of Gs, and the percentage of Ts
 

Gene casestudy.jpg

Solution

I will be using a step-by-step process to guide you to the problem solution. The first step involves planning your work and the subsequent steps involve creating each method separately.

Step 0: Before you start coding... PLAN!

A common problem that many students have when programming is the lack of planning before the actual coding begins. Many students jump right in and start coding away without any real plan to follow. While this method may work for simple assignments, it can become a frustrating experience as assignment difficulty and complexity increases. Planning allows you to get a good grasp of the problem and gives you a baseline to follow when coding.


Figure out exactly what the problem is asking you to do

The first step to to coding is to figure out what the problem is asking you to do. It helps to read the assignment multiple times and mark down important points.


In this problem, you are asked to:

  • read a file containing one FASTA formatted sequence record
  • count the number of each of the four nucleotides (A,C,G and T) per sequence in the record
  • print a histogram displaying the genetic make‐up of the sequence record


It also asks you to output the following:

  • The header line (without the leading ‘>’)
  • A histogram showing the percentage of As, the percentage of Cs, the percentage of Gs, and the percentage of Ts


Once you have a solid understanding of what you need to do, the next step is to create pseudocode.

Pseudocode

Pseudocode can be thought of as an outline or structure of your code. It is meant to be readable and easy to follow. By pseudocoding, we can take the required methods in the assignment and figure out the logic we need to use in the actual code. This is also an easy way to to see how methods interact with one another. The key thing is to understand what is needed for each method and apply it to your pseudocode. Please refer back to the Problem section for the specific details for each method. Since the exact code has already been provided for the getFile() method, it does not need to be pseudocoded.

	public static void main(String [] args)
	{
		file = getFile()	//prompts user for and reads in the text file.  the text file is then tied to a variable
		header = getHeader(file)	//grabs the first line in the text file and saves it to a string
		sequence = getSequence(file) //stores the rest of the lines in one string
		
		//4 bases: A, C, G, T
		//count the bases in the sequence and save them to their respective variables
		a = countBase()
		c = countBase()
		g = countBase()
		t = countBase()
		
		//determine the percentages of each base
		apct = # of a's/total bases * 100
		bpct = # of c's/total bases * 100
		cpct = # of g's/total bases * 100
		tpct = # of t's/total bases * 100
		
		printDNAHistogram() //print histogram of results
	}
	public static String getHeader(File fastaFile)
	{
		//grab the first line of text from fastaFile (the header)
		//remove the leading '<'
		//save the header (minus the '<' character) into a string variable
		
		//return the  string variable
	}
	public static int countBase(String sequence, char base)
	{
		//create an integer variable to store the number of a specific base in the sequence
		
		//loop until we have gone through each character in the sequence
			//if the current character is equal to our base
				//iterate the integer variable
		//once we have gone through each characer in the sequence...
		
		//return the integer variable
	}
	public static void printDNAHistogram(String header, int aPct, int cPct, int gPct, int tPct)
	{
		//print out the header
		
		//call print bar for each base (A, C, G, T)
		//So for each of the following you're sending... a string corresponding to the base, the integer corresponding to frequency the base appeared
		//in the sequence, and the character to be displayed as the "bars" of the histogram
		printBar("A", a, '=')	
		printBar("C", c, '=')
		printBar("G", g, '=')
		printBar("T", t, '=')
	}
	public static void printBar(String label, int percent, char barSymbol)
	{
		//remember all of these should be printed on the same line
		
		//print out the label
		
		//loop for an x # of times depending on percent's value
			//print out the bar symbol (in this case: '=')
		
		//print out the percentage using percent
	}


Diagram

In addition to the pseudocode, using a diagram or some sort of visual aid can help you further understand your code. While there are many ways to create a diagram, the key thing is to make it as simple as possible; avoid any unecessary details. In the following example, I created a diagram of the how the methods in the program are connected.

Diagram.jpg

Once this is finished, we can now start coding.

Step 1: Start with the main() method

Since the main() method is the first method run by the program, it is a good place to start your coding. However, if you look at the pseudocode for the main method, you can see that most of the work is done by other methods called by main(). So using the pseudocode for main() as a basis, we can start coding the other methods. The first piece of code that appears in the main() method is a call to getFile()

Step 2: Understanding the getFile() method

Since the code for the getFile() method is already provided for us, all we really need to do for this method is to understand how it works and then use it.

Let's start with the variable declarations:

	JFileChooser chooser = new JFileChooser();
	File fastaFile = null;
	int returnValue;

JFileChooser is a class included in Java. It provides a simple mechanism for the user to choose a file and allow it to be loaded in to the program. The File variable is where we store the file that was read in using JFileChooser. This variable allows us to grab and manipulate the contents. int returnValue will be used to grab the "success" or "fail" code when using JFileChooser.

The next thing done is to open the file dialog, allowing the user to choose their FASTA input file.

		// show the open file dialog
		returnValue = chooser.showOpenDialog(null);

If the user presses the ok button in the prompt that appears, it will grab the value assigned to the ok button and store it in returnValue.

Next, we check if the ok button was pressed. If it was pressed then the file is saved into the File fastaFile variable.

		// check to see that the user clicked the OK button
		if (returnValue == JFileChooser.APPROVE_OPTION) {
			// get the selected file
			fastaFile = chooser.getSelectedFile();
		}

Once the work is done in the method, we return the File fastaFile variable to the main() method.

Step 3: Creating the getHeader() method

Looking back again at the main() method pseudocode, next line that appears is a call to getHeader(). The method signature is given to us in the assignment:

public static String getHeader(File fastaFile)

This method will return a String variable and take in a File variable for processing. Taking a look at the pseudocode for the getHeader() method, the first thing that needs to be done is to grab the first line of text (the header) from fastaFile.

Start off by creating a String variable to store the header.

		String header = "";

As was mentioned in the Problem section, the try/catch is used to deal with possible runtime exceptions. This is a requirement when using anything with the Scanner class. The next set of code will appear inside the try block as follows:

		try {
			//insert code in the try block
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		}

Next, we want to tie the File fastaFile variable to Scanner, allowing us to view and grab its contents.

			Scanner scanner = new Scanner(fastaFile);

After that, using Scanner, we need to grab the header, which appears on the first line. The header needs to be saved into String header variable. To do this, we use Scanner's nextLine() method. What this does is it grabs the line of text, and iterates to the next line every time it is used. For example, if I use nextLine() for the first time, it will grab the first line. If I use it again, it will grab the second line. This keeps going until there are no lines left. This code will also appear inside the try block.

			header = scanner.nextLine();

We now have the header saved to a String variable. Looking at the pseudocode again, the next step is to remove the '<' character from the string variable. To do this we use the built-in Java String function called replaceFirst(). This function replaces the first appearance of the character in the first argument and replaces it with what was specified in the second argument. Note that this code also appears within the try block.

			header = header.replaceFirst(">", "");

More specifically this code will do is replace the first appearance of the character ">", and replace it with "" (an empty character).

Now that we have the header saved to a string and removed the leading ">" character, we can return the String header to the caller.

		return header;

Step 4: Creating the getSequence() method

After the getHeader() method is called in the main method, the getSequence() is called right after. What getSequence() should do is get the sequence data of the FASTA record using a Scanner object to grab all the lines except the first one.

The method signature is as follows:

	public static String getSequence(File fastaFile)

This method takes in a File variable and returns back a String.

Start off by creating a String to store the FASTA sequence.

		String sequence = "";

Since we will be using Scanner just like the previous example, a try/catch block will need to be used. The next set of code will again appear inside the try block.

		try {
				//insert code in the try block
			}

		} catch (FileNotFoundException e) {
			e.printStackTrace();
		}

We want to tie the File fastaFile variable to Scanner in order to allow us to view and grab its contents. We also need to create a String variable to temporarily store the text that is read in.

			Scanner scanner = new Scanner(fastaFile);
			String line;

Next, we want to iterate through the text file and save each line to String sequence as one long string.

			while (scanner.hasNextLine()) { //loop until all the lines have been hit. ie: if there is still a next line, loop.
				line = scanner.nextLine(); //grab the next line and store it in a temporary string variable
				
				//filter out/skip the first line (the header)
				//if the current line is not the header, continue...
				if (!line.startsWith(">")) { //Can be read as... if the line does not start with the character ">", continue
					sequence += line; //Add the current line to the end of the sequence string
				}
			}	

Once we have the whole sequence saved into one string, return String sequence to the caller.

		return sequence;

Step 5: Creating the printBar() method

If you look at the pseudocode, you can see that the last method called in the main() is printDNAHistogram(). This method then calls on printBar() four times. For an easier picture, take a look at the diagram. Since printBar() is not using any other required methods, this is a good next step to for your code. printBar() prints out a single bar for a histogram.

The method signature is:

	public static void printBar(String label, int percent, char barSymbol)

The first thing we need to do in this method is to print out the label using String label.

		System.out.print(label + ": ");

Next, we need to iterate a loop based on the number tied to int percent. In this loop we print out the symbol which is indicated by char barSymbol.

		for (int i = 0; i < percent; i++) { //loop from 0 until we hit percent's value
			System.out.print(barSymbol); //print out barSymbol
		}

Afterwards, we need to print out the actual percentage as a value. Since the percentage calculation was already done in the main() method, we simply add a "%" character and print.

		System.out.println(" (" + percent + "%)");

As an example, the output would look something like this:

A: ========================= (25%)

int percentage for this example is equal to 25. This is the same as the number of bars and percentage that appear.

Step 6: Creating the printDNAHistogram() method

Now that we have the printBar() method created, we can now use it in the printDNAHistogram() method.

The signature for this method is as follows:

	public static void printDNAHistogram(String header, int aPct, int cPct, int gPct, int tPct) 

int aPct corresponds to the percent for base A, bPct for base B, etc.

If you take a look at the pseudocode for this method, the header will need to be printed out first.

		System.out.println(header);

Looking again at the pseudocode will notice that printBar is called four times. The input variables for these method calls also correspond to each base (A, G, C, T).

the code will look something like:

		//the first input variables in order are: the base, the base percentage, character bar to use for printing
		printBar("A", aPct, '=');
		printBar("G", gPct, '=');
		printBar("C", cPct, '=');
		printBar("T", tPct, '=');

Step 7: Code the main method

Now that the rest of the required methods have been created, we can now easily code the main() method.

If you look at the pseudocode, most of the work has already been done for you!

We start off by using the getFile(), getheader() and getSequence() methods.

		File file = getFile(); //prompt the user for the text/FASTA file and load it into the program
		String header = getHeader(file); //grab the header and save it to a string
		String sequence = getSequence(file); //grab the FASTA sequence and save it to a string

After those are put into the program, we need to figure out the total number of characters/bases in the sequence for later calculations.

		int totalBases = sequence.length(); //counts the number of characters in the string and saves it to an integer

Next, we use the countBase() method for each specific Base (A, C, G, T) to count the number of the bases in the sequence.

		int a = countBase(sequence, 'A'); //send the sequence String and A base
		int c = countBase(sequence, 'C'); //send the sequence String and C base
		int g = countBase(sequence, 'G'); //send the sequence String and G base
		int t = countBase(sequence, 'T'); //send the sequence String and T base

After, we need to determine the percentages for each base.

		int aPct = (int)(Math.round(((double) a) / totalBases * 100));
		int cPct = (int)(Math.round(((double) c) / totalBases * 100));
		int gPct = (int)(Math.round(((double) g) / totalBases * 100));
		int tPct = (int)(Math.round(((double) t) / totalBases * 100));

Using the first line as an example, I will explain how the calculation works. (double) a must be casted as a double in order to allow for decimal calculations. Next, we divide that by the total number of bases and multiply by 100 to get the percentage. Math.round is then used to round that number up. (int)(current value) will cast this value back to an int.

Once we have all the percentage calculations for each base, we can now call printDNAHistogram() to print the output.

		printDNAHistogram(header, aPct, cPct, gPct, tPct);

Step 8: Test the program

While it is not covered in this case study, its always a good idea to test your code frequently. This allows you to fix small problems before they become huge ones. Waiting until the end to test will leave you with a lot of errors and frustration.

Run your program and test out both output files and ensure you do not get any errors. Next, take a look at the output and make sure everything looks up to par.

Your output should look something like this:

gi|21071042|ref|NM_000193.2| Homo sapiens sonic hedgehog (SHH), mRNA
A: ========================= (25%)
G: ================================= (33%)
C: ============================= (29%)
T: ============== (14%)

Congratulations, you have finished the program!

Closing Remarks

Planning before you start coding is a really good habit to pick up. It helps you understand exactly how you want to go about your coding; from structure to specific lines of code. This extra step will end up saving you many hours of hair-pulling and frustration. You should also be testing your code bit by bit - do not wait until the end to test. Not only will you catch small errors and prevent them from becoming big ones, you will also save yourself many hours of problem solving. By doing these things, you are well on your way to becoming a master coder.

Code

Solution Code

Back to the Case Studies homepage