Difference between revisions of "Nucleotides"

From CompSciWiki
Jump to: navigation, search
 
(7 intermediate revisions by the same user not shown)
Line 195: Line 195:
 
|Solution=
 
|Solution=
  
I will be using a step-by-step process to guide you to the problem solution.  The first step involves planning your work and the subsequent steps involve creating each method separately.  While the solution I provide will work, keep in mind that there may be other alternative solutions to the problem as well.
+
I will be using a step-by-step process to guide you to the problem solution.  The first step involves planning your work and the subsequent steps involve creating each method separately.
  
 
==Step 0: Before you start coding... PLAN!==
 
==Step 0: Before you start coding... PLAN!==
<p>A common problem that many students have when programming is the lack of planning before the actual coding.  Many students jump right in and start coding away without any real plan to follow.  While that method may work for simple assignments, it can become a frustrating experience as assignment difficulty and complexity increases.  Planning allows you to get a good understanding of the problem and acts as a baseline of how you would code the solution.</p>
+
<p>A common problem that many students have when programming is the lack of planning before the actual coding begins.  Many students jump right in and start coding away without any real plan to follow.  While this method may work for simple assignments, it can become a frustrating experience as assignment difficulty and complexity increases.  Planning allows you to get a good grasp of the problem and gives you a baseline to follow when coding.</p>
 
<br/>
 
<br/>
 
===Figure out exactly what the problem is asking you to do===
 
===Figure out exactly what the problem is asking you to do===
<p>The first step to understanding how to start coding something is to figure out what the problem is asking you to do.  Since you can't really code without knowing what you have to do.
+
<p>The first step to to coding is to figure out what the problem is asking you to do.  It helps to read the assignment multiple times and mark down important points.</p>
It helps to read the assignment multiple times and mark down important points.</p>
+
 
<br/>
 
<br/>
 
In this problem, you are asked to:
 
In this problem, you are asked to:
Line 221: Line 220:
 
===Pseudocode===
 
===Pseudocode===
 
<p>
 
<p>
Pseudocode can be thought of as an outline or structure of your code without real coding.  It is meant to be readable and can serve as the basis for the actual code.
+
Pseudocode can be thought of as an outline or structure of your code.  It is meant to be readable and easy to follow.
By pseudocoding, we can take the required methods in the assignment and figure out the logic or algorithm we need to use in the real code.  This is also an easy way to to see how methods interact with one another.
+
By pseudocoding, we can take the required methods in the assignment and figure out the logic we need to use in the actual code.  This is also an easy way to to see how methods interact with one another.
  
The key thing is to carefully what is needed for each method and apply it to your pseudocode.  Please refer back to the Problem section for the specific details for each method.
+
The key thing is to understand what is needed for each method and apply it to your pseudocode.  Please refer back to the Problem section for the specific details for each method.
 
Since the exact code has already been provided for the <b>getFile()</b> method, it does not need to be pseudocoded.
 
Since the exact code has already been provided for the <b>getFile()</b> method, it does not need to be pseudocoded.
  
Line 302: Line 301:
 
<br/>
 
<br/>
 
===Diagram===
 
===Diagram===
<p>In addition to the pseudocode, using a diagram or some sort of visual aid can help you further understand how your code fits together.
+
<p>In addition to the pseudocode, using a diagram or some sort of visual aid can help you further understand your code.
While there are many ways to create your diagram, the key thing is to make it as simple as possible;  avoid any unecessary details.
+
While there are many ways to create a diagram, the key thing is to make it as simple as possible;  avoid any unecessary details.
<br/>
+
 
In the following example, I have created a diagram of when methods are used in the program, and also the methods that they call.</p>
+
In the following example, I created a diagram of the how the methods in the program are connected.</p>
<br/>
+
 
<p class="center">[[Image:Diagram.jpg]]</p>
 
<p class="center">[[Image:Diagram.jpg]]</p>
<br/>
+
Once this is finished, we can now start coding.
==Step 1: Start with the main method==
+
==Step 1: Start with the main() method==
<p>Since the main method is the first thing that is run by the program, this is a good place to start your coding.  However, if you look at the pseudocode for the main method and diagram that we have created, you can see that a lot of the work is done by the other methods.  So using the pseudocode for main as a basis, we can start coding the other methods.  I will be coding in a top-down fashion.  The first piece of code that appears in our main method pseudocode is a method call to getFile()</p>
+
<p>Since the <b>main()</b> method is the first method run by the program, it is a good place to start your coding.   
 +
However, if you look at the pseudocode for the main method, you can see that most of the work is done by other methods called by <b>main()</b>.   
 +
So using the pseudocode for <b>main()</b> as a basis, we can start coding the other methods.   
 +
The first piece of code that appears in the <b>main()</b> method is a call to <b>getFile()</b></p>
  
 
==Step 2: Understanding the getFile() method==
 
==Step 2: Understanding the getFile() method==
<p>Since the code for the getFile() method was already provided for us, all we really need to do for this method is to use and understand how it works.</p>
+
<p>Since the code for the <b>getFile()</b> method is already provided for us, all we really need to do for this method is to understand how it works and then use it.</p>
  
 
Let's start with the variable declarations:
 
Let's start with the variable declarations:
Line 320: Line 321:
 
int returnValue;
 
int returnValue;
 
</pre>
 
</pre>
The JFileChooser is a class included in Java. It provides a simple mechanism for the user to choose a file and allow it to be loaded in to the program.  
+
<b>JFileChooser</b> is a class included in Java. It provides a simple mechanism for the user to choose a file and allow it to be loaded in to the program.  
The File variable is where we store the file that was read in using JFileChooser.  This variable allows us to grab and manipulate the contents.
+
The <b>File</b> variable is where we store the file that was read in using JFileChooser.  This variable allows us to grab and manipulate the contents.
int return value will be used to grab the "success" or "fail" code when using JFileChooser.
+
<b>int returnValue</b> will be used to grab the "success" or "fail" code when using <b>JFileChooser</b>.
  
The next thing done is to open the file dialog to allow the user to choose their fasta or input file from their local directories.
+
The next thing done is to open the file dialog, allowing the user to choose their FASTA input file.
 
<pre>
 
<pre>
 
// show the open file dialog
 
// show the open file dialog
 
returnValue = chooser.showOpenDialog(null);
 
returnValue = chooser.showOpenDialog(null);
 
</pre>
 
</pre>
Using the returnValue variable created earlier, if the user presses the ok button in the prompt that appears, it will grab that value and store it in returnValue.
+
If the user presses the ok button in the prompt that appears, it will grab the value assigned to the ok button and store it in <b>returnValue</b>.
  
Next, we check if the ok button was pressed.  If it was pressed then we save the file that was read in to the File fastaFile variable.
+
Next, we check if the ok button was pressed.  If it was pressed then the file is saved into the <b>File fastaFile</b> variable.
 
<pre>
 
<pre>
 
// check to see that the user clicked the OK button
 
// check to see that the user clicked the OK button
Line 340: Line 341:
 
</pre>
 
</pre>
  
Once the work is done in the method, we return the File fastaFile variable to the method that called it.  In this case, the main method.
+
Once the work is done in the method, we return the <b>File fastaFile</b> variable to the <b>main()</b> method.
  
 
==Step 3: Creating the getHeader() method==
 
==Step 3: Creating the getHeader() method==
<p>Looking back again at the main method pseudocode, you will see that the next thing being done is a method call of getHeader().
+
<p>Looking back again at the <b>main()</b> method pseudocode, next line that appears is a call to <b>getHeader()</b>.
  
 
The method signature is given to us in the assignment:</p>
 
The method signature is given to us in the assignment:</p>
Line 349: Line 350:
 
public static String getHeader(File fastaFile)
 
public static String getHeader(File fastaFile)
 
</pre>
 
</pre>
So what this means is, this method will return a String variable to the caller and will take in a File variable for processing.
+
This method will return a <b>String</b> variable and take in a <b>File</b> variable for processing.
Taking a look at the pseudocode for the getHeader() method, the first thing that needs to be done is to grab the first line of text (the header) from the fastaFile.
+
Taking a look at the pseudocode for the <b>getHeader()</b> method, the first thing that needs to be done is to grab the first line of text (the header) from <b>fastaFile</b>.
  
With that said, start off by creating a String variable to store the header.
+
Start off by creating a <b>String</b> variable to store the header.
 
<pre>
 
<pre>
 
String header = "";
 
String header = "";
 
</pre>
 
</pre>
  
As was mentioned in the Problem section, the try/catch is used to deal with possible runtime exceptions.  This is a requirement when using anything with the Scanner class.
+
As was mentioned in the Problem section, the try/catch is used to deal with possible runtime exceptions.  This is a requirement when using anything with the <b>Scanner</b> class.
The next set of code will then appear inside the try block as follows:
+
The next set of code will appear inside the try block as follows:
 
<pre>
 
<pre>
 
try {
 
try {
Line 367: Line 368:
 
</pre>
 
</pre>
  
Next, we want to tie the File fastaFile variable to Scanner in order to allow us to view and grab it's contents.
+
Next, we want to tie the <b>File fastaFile</b> variable to <b>Scanner</b>, allowing us to view and grab its contents.
Just a reminder that this code should be inside the try block.
+
 
<pre>
 
<pre>
 
Scanner scanner = new Scanner(fastaFile);
 
Scanner scanner = new Scanner(fastaFile);
 
</pre>
 
</pre>
  
After that, using scanner, we need to grab the header, which also happens to be on the first line.  The header needs to be saved into the header String variable.  To do this, we use Scanner's nextLine() method.
+
After that, using <b>Scanner</b>, we need to grab the header, which appears on the first line.  The header needs to be saved into <b>String header</b> variable.  To do this, we use Scanner's <b>nextLine()</b> method.
What this does is it grabs the line of text, and iterates to the next one every time it is used.  For example, if I use nextLine() for the first time, it will grab the first line.
+
What this does is it grabs the line of text, and iterates to the next line every time it is used.  For example, if I use <b>nextLine()</b> for the first time, it will grab the first line.
If I use it again, it will grab the second line. Again, this code will also appear inside the try block.
+
If I use it again, it will grab the second line. This keeps going until there are no lines left. This code will also appear inside the try block.
  
 
<pre>
 
<pre>
Line 381: Line 381:
 
</pre>
 
</pre>
  
Now we have the header saved to string, looking at the pseudocode again, that next step is to remove te '<' character from the string variable.
+
We now have the header saved to a String variable. Looking at the pseudocode again, the next step is to remove the '<' character from the string variable.
To do this we use the built-in Java String function called replaceFirst().  This function replaces the first appearance fo the first argument, and replaces it with what was specified in the second argument.
+
To do this we use the built-in Java String function called <b>replaceFirst()</b>.  This function replaces the first appearance of the character in the first argument and replaces it with what was specified in the second argument.
 
Note that this code also appears within the try block.
 
Note that this code also appears within the try block.
 
<pre>
 
<pre>
 
header = header.replaceFirst(">", "");
 
header = header.replaceFirst(">", "");
 
</pre>
 
</pre>
What this code will do is replace the first appearance of the character ">", and replace it with "".
+
More specifically this code will do is replace the first appearance of the character ">", and replace it with "" (an empty character).
  
Now that we have the header saved to a string and removed the leading ">" character we can return the String header to the caller.
+
Now that we have the header saved to a string and removed the leading ">" character, we can return the <b>String header</b> to the caller.
 
<pre>
 
<pre>
 
return header;
 
return header;
 
</pre>
 
</pre>
 
  
 
==Step 4: Creating the getSequence() method==
 
==Step 4: Creating the getSequence() method==
After the getHeader() method is called in the main method, the getSequence() is called right after.  What getSequence() should do is get the sequence data of the FASTA record using a Scanner object to get all lines except the first line.
+
After the <b>getHeader()</b> method is called in the main method, the <b>getSequence()</b> is called right after.  What <b>getSequence()</b> should do is get the sequence data of the FASTA record using a <b>Scanner</b> object to grab all the lines except the first one.
  
 
The method signature is as follows:
 
The method signature is as follows:
Line 402: Line 401:
 
public static String getSequence(File fastaFile)
 
public static String getSequence(File fastaFile)
 
</pre>
 
</pre>
So this method takes in a File variable and returns back a String.
+
This method takes in a '''File''' variable and returns back a '''String'''.
  
Start off by creating a String to store the FASTA sequence.
+
Start off by creating a '''String''' to store the FASTA sequence.
 
<pre>
 
<pre>
 
String sequence = "";
 
String sequence = "";
 
</pre>
 
</pre>
  
Since we will be using Scanner, just like the previous example, a try/catch block will need to be used.  The next set of code will again appear inside the try block.
+
Since we will be using '''Scanner''' just like the previous example, a try/catch block will need to be used.  The next set of code will again appear inside the try block.
 
<pre>
 
<pre>
 
try {
 
try {
Line 420: Line 419:
 
</pre>
 
</pre>
  
We want to tie the File fastaFile variable to Scanner in order to allow us to view and grab it's contents.  We also need to create a String variable to temporarily store the text that is read in.
+
We want to tie the '''File fastaFile''' variable to '''Scanner''' in order to allow us to view and grab its contents.  We also need to create a '''String''' variable to temporarily store the text that is read in.
 
<pre>
 
<pre>
 
Scanner scanner = new Scanner(fastaFile);
 
Scanner scanner = new Scanner(fastaFile);
Line 426: Line 425:
 
</pre>
 
</pre>
  
Next, we want to iterate through the text file and save each line to String sequence as one long string.
+
Next, we want to iterate through the text file and save each line to '''String sequence''' as one long string.
 
<pre>
 
<pre>
 
while (scanner.hasNextLine()) { //loop until all the lines have been hit. ie: if there is still a next line, loop.
 
while (scanner.hasNextLine()) { //loop until all the lines have been hit. ie: if there is still a next line, loop.
Line 439: Line 438:
 
</pre>
 
</pre>
  
Once we have the whole sequence saved into one string, return String sequence to the caller.
+
Once we have the whole sequence saved into one string, return '''String sequence''' to the caller.
 
<pre>
 
<pre>
 
return sequence;
 
return sequence;
Line 445: Line 444:
  
 
==Step 5: Creating the printBar() method==
 
==Step 5: Creating the printBar() method==
If you look at the pseudocode, you can see that the last method called in the main is <b>printDNAHistogram()</b>.  This in turn calls on <b>printBar()</b> four times.  For an easier picture, take a look at the diagram. Since printBar() is not using any other required methods, this is a good next step to for your code.  printBar() prints out a single bar for a histogram.
+
If you look at the pseudocode, you can see that the last method called in the <b>main()</b> is <b>printDNAHistogram()</b>.  This method then calls on <b>printBar()</b> four times.  For an easier picture, take a look at the diagram. Since '''printBar()''' is not using any other required methods, this is a good next step to for your code.  '''printBar()''' prints out a single bar for a histogram.
  
 
The method signature is:
 
The method signature is:
Line 452: Line 451:
 
</pre>
 
</pre>
  
The first thing we need to do in this method is to print out the label using String label.
+
The first thing we need to do in this method is to print out the label using '''String label'''.
 
<pre>
 
<pre>
 
System.out.print(label + ": ");
 
System.out.print(label + ": ");
 
</pre>
 
</pre>
  
Next, we need to iterate in a loop based on the number tied to int percent.  In this loop we print out the symbol which is indicated by char barSymbol.
+
Next, we need to iterate a loop based on the number tied to '''int percent'''.  In this loop we print out the symbol which is indicated by '''char barSymbol'''.
 
<pre>
 
<pre>
 
for (int i = 0; i < percent; i++) { //loop from 0 until we hit percent's value
 
for (int i = 0; i < percent; i++) { //loop from 0 until we hit percent's value
Line 464: Line 463:
 
</pre>
 
</pre>
  
Afterwards we need to print out the actual percentage as a value.  Since the calculation to get the percentage was already done in the main method, we simply add a "%" character and print.
+
Afterwards, we need to print out the actual percentage as a value.  Since the percentage calculation was already done in the '''main()''' method, we simply add a "%" character and print.
 
<pre>
 
<pre>
 
System.out.println(" (" + percent + "%)");
 
System.out.println(" (" + percent + "%)");
 
</pre>
 
</pre>
  
As an example, the actual output would look something like this:
+
As an example, the output would look something like this:
 
<pre>
 
<pre>
 
A: ========================= (25%)
 
A: ========================= (25%)
 
</pre>
 
</pre>
So int percentage for this example was equal to 25. This is equal to the number of bars and percentage that appear.
+
'''int percentage''' for this example is equal to 25. This is the same as the number of bars and percentage that appear.
  
 
==Step 6: Creating the printDNAHistogram() method==
 
==Step 6: Creating the printDNAHistogram() method==
Now that we have the printBar() method created, we can now use it in the printDNAHistogram() method.
+
Now that we have the '''printBar()''' method created, we can now use it in the '''printDNAHistogram()''' method.
  
 
The signature for this method is as follows:
 
The signature for this method is as follows:
 
<pre> public static void printDNAHistogram(String header, int aPct, int cPct, int gPct, int tPct) </pre>
 
<pre> public static void printDNAHistogram(String header, int aPct, int cPct, int gPct, int tPct) </pre>
int aPct corresponds to the percent for base A, bPct for base B, etc.
+
'''int aPct''' corresponds to the percent for base A, '''bPct''' for base B, etc.
  
 
If you take a look at the pseudocode for this method, the header will need to be printed out first.
 
If you take a look at the pseudocode for this method, the header will need to be printed out first.
Line 490: Line 489:
 
The input variables for these method calls also correspond to each base (A, G, C, T).   
 
The input variables for these method calls also correspond to each base (A, G, C, T).   
  
So the code will look something like:
+
the code will look something like:
 
<pre>
 
<pre>
 
//the first input variables in order are: the base, the base percentage, character bar to use for printing
 
//the first input variables in order are: the base, the base percentage, character bar to use for printing
Line 500: Line 499:
  
 
==Step 7: Code the main method==
 
==Step 7: Code the main method==
Now that the rest of the required methods have been created, we can now easily code the main method.
+
Now that the rest of the required methods have been created, we can now easily code the '''main()''' method.
  
If you look at the pseudocode, most of the work has already been done for you! I hope that was enough to convince you that planning before coding is useful.  :)
+
If you look at the pseudocode, most of the work has already been done for you!
  
We start off by using the getFile(), getheader() and getSequence() methods.
+
We start off by using the '''getFile()''', '''getheader()''' and '''getSequence()''' methods.
 
<pre>
 
<pre>
 
File file = getFile(); //prompt the user for the text/FASTA file and load it into the program
 
File file = getFile(); //prompt the user for the text/FASTA file and load it into the program
Line 516: Line 515:
 
</pre>
 
</pre>
  
Next, we use the countBase() method for each specific Base (A, C, G, T) to count the number of the specific base in the sequence.
+
Next, we use the '''countBase()''' method for each specific Base (A, C, G, T) to count the number of the bases in the sequence.
 
<pre>
 
<pre>
 
int a = countBase(sequence, 'A'); //send the sequence String and A base
 
int a = countBase(sequence, 'A'); //send the sequence String and A base
Line 531: Line 530:
 
int tPct = (int)(Math.round(((double) t) / totalBases * 100));
 
int tPct = (int)(Math.round(((double) t) / totalBases * 100));
 
</pre>
 
</pre>
I will be using the first line as an example to explain how the calculation works.
+
Using the first line as an example, I will explain how the calculation works.
(double) a must be casted as a double in order to allow for dealing with decimals.
+
(double) a must be casted as a double in order to allow for decimal calculations.
Next, we divide that by the total number of bases and multiply that by 100 to get the percentage.
+
Next, we divide that by the total number of bases and multiply by 100 to get the percentage.
 
Math.round is then used to round that number up.  (int)(current value) will cast this value back to an int.
 
Math.round is then used to round that number up.  (int)(current value) will cast this value back to an int.
  
 
Once we have all the percentage calculations for each base, we can now call printDNAHistogram() to
 
Once we have all the percentage calculations for each base, we can now call printDNAHistogram() to
print the output needed. ie) the print bars
+
print the output.
 
<pre>
 
<pre>
 
printDNAHistogram(header, aPct, cPct, gPct, tPct);
 
printDNAHistogram(header, aPct, cPct, gPct, tPct);
Line 543: Line 542:
  
 
==Step 8: Test the program==
 
==Step 8: Test the program==
While it was not covered in this case study, its always a good idea to test your code while you code.  This allows you to fix small problems before they become huge ones.  Waiting until the end to test will
+
While it is not covered in this case study, its always a good idea to test your code frequently.  This allows you to fix small problems before they become huge ones.  Waiting until the end to test will
 
leave you with a lot of errors and frustration.
 
leave you with a lot of errors and frustration.
  
Line 561: Line 560:
  
 
==Closing Remarks==
 
==Closing Remarks==
Remember that by planning your attack before coding is a really good habit to pick up.  It helps you understand exactly how you want to go about your coding; from structure to specific lines of code.  This extra step will end up saving you many hours of hair-pulling and frustration.  You should also be testing your code a bit at a time, instead of waiting until the end to test.  Not only will you catch small errors and prevent them from becoming big ones, you will also save yourself many hours of hair-pulling and frustration just like the planning.  By doing these things, you are well on your way to becoming a master coder.
+
Planning before you start coding is a really good habit to pick up.  It helps you understand exactly how you want to go about your coding; from structure to specific lines of code.  This extra step will end up saving you many hours of hair-pulling and frustration.  You should also be testing your code bit by bit - do not wait until the end to test.  Not only will you catch small errors and prevent them from becoming big ones, you will also save yourself many hours of problem solving.  By doing these things, you are well on your way to becoming a master coder.
  
 
|SolutionCode=
 
|SolutionCode=

Latest revision as of 03:31, 7 April 2011

Back to the Case Studies homepage

Problem

DNA can be considered the source code of life. It contains the instructions for (most of the) life on Earth. Unlike digital computers that use binary to encode information and instructions in 1s and 0s, DNA has four possible nucleotides: Adenine, Cytosine, Guanine, and Thymine, represented by A, C, G and T, respectively.

Bioinformatics is field of study that applies concepts from computer science and statistics to study molecular biology using computer software and hardware.

Biologists and bioinformaticians typically store genetic databases of DNA in a text‐based file format called FASTA2. The FASTA file format is relatively simple:

  • Each FASTA‐formatted file starts with a header line. The first character on a header line is always the greater‐than character (‘>’)
  • The header line is followed by lines of sequence data (As, Cs, Gs, and Ts), with 80 (or fewer) characters per line

Example:

>gi|21071042|ref|NM_000193.2| Homo sapiens sonic hedgehog (SHH), mRNA
GCGAGGCAGCCAGCGAGGGAGAGAGCGAGCGGGCGAGCCGGAGCGAGGAAGGGAAAGCGCAAGAGAGAGC
GCACACGCACACACCCGCCGCGCGCACTCGCGCACGGACCCGCACGGGGACAGCTCGGAAGTCATCAGTT
CCATGGGCGAGATGCTGCTGCTGGCGAGATGTCTGCTGCTAGTCCTCGTCTCCTCGCTGCTGGTATGCTC
GGGACTGGCGTGCGGACCGGGCAGGGGGTTCGGGAAGAGGAGGCACCCCAAAAAGCTGACCCCTTTAGCC
TACAAGCAGTTTATCCCCAATGTGGCCGAGAAGACCCTAGGCGCCAGCGGAAGGTATGAAGGGAAGATCT
CCAGAAACTCCGAGCGATTTAAGGAACTCACCCCCAATTACAACCCCGACATCATATTTAAG

Write a program that will:

  • read a file containing one FASTA formatted sequence record
  • count the number of each of the four nucleotides (A,C,G and T) per sequence in the record
  • print a histogram displaying the genetic make‐up of the sequence record


Input:

Use JFileChooser to prompt the user to select a FASTA formatted file, and use Scanner to read the file selected by the user. Assume that the file chosen by the user will meet the standard FASTA file format described above, and that the sequence data will be all upper case characters.

So far in the course you’ve used Scanner to get input from the user with the keyboard. In some situations, getting input from the keyboard is cumbersome (entering 1000 bases, for example). To use JFileChooser your program must add a couple of import statements to the regular import statements:

import java.io.File;
import java.io.FileNotFoundException;
import javax.swing.JFileChooser;

You’ll also need to insert and use the following static method:

/**
* Prompt the user to select a file using JFileChooser and return
* the File object that the user selected.
*
* @return the file selected by the user
*/
public static File getFile() {
	JFileChooser chooser = new JFileChooser();
	File fastaFile = null;
	int returnValue;
	
	// show the open file dialog
	returnValue = chooser.showOpenDialog(null);

	// check to see that the user clicked the OK button
	if (returnValue == JFileChooser.APPROVE_OPTION) {

		// get the selected file
		fastaFile = chooser.getSelectedFile();
	}

	return fastaFile;
}

So far we’ve been using Scanner by passing System.in to get input from the keyboard:

	Scanner keyboard = new Scanner(System.in);

Scanner can also be used to read the contents of a file. Simply replace System.in with an instance of a File variable (like the one returned by getFile()):

	Scanner fileReader = new Scanner(aFile);

Finally, you must surround any code using a fileReader instance of Scanner using a try/catch block. Try/catch blocks are used in Java to deal with possible runtime exceptions. In the case of reading files using Scanner, we have to deal with a possible FileNotFoundException – an exception that may occur if the user typed in a file name that did not exist.

Example:

try {
	Scanner fileReader = new Scanner(aFile);
	// your code working with fileReader goes here

} catch (FileNotFoundException e) {
	e.printStackTrace();
}


Methods:

In addition to the getFile() method, your program must implement and use the following static methods:

	public static void main(String[] args)

Your main method should prompt the user to select a FASTA file using the getFile() method. Then, the main method should pass the file to getHeader() and getSequence() to get the header and sequence from the FASTA file. Next, the main method should call countBase() for each of the four different types of bases and determine the percentage of the entire sequence that each base represents. The percentage should be rounded to the nearest whole number. Finally, the main method should call printDNAHistogram() to print out the histogram representing the statistics of the file.

	public static String getHeader(File fastaFile)

Using Scanner, the getHeader method should read the first line contained in fastaFile (which is the header line of the record). This method should remove the leading ‘>’ character before returning the header line.

	public static String getSequence(File fastaFile)

Using Scanner, the getSequence method should skip the first line contained in fastaFile (which is the header line of the record) and store any subsequent lines (the sequence data) until the Scanner has no more lines left. The getSequence method should use the hasNextLine() and nextLine() methods from the Scanner class to read the sequence from the file. Once the Scanner has no more lines remaining, the getSequence method should return all of the data that it stored.

	public static int countBase(String sequence, char base)

The countBase method accepts sequence data (a String containing only As, Cs, Gs, and Ts), and a specific base character to count. The countBase method should return the number of times that base occurred in sequence.

	public static void printDNAHistogram(String header, int aPct, int
cPct, int gPct, int tPct)

The printDNAHistogram method should print out the histogram for the result calculated by other methods. The printDNAHistogram method should first print out header, and then it should call the printBar method (described below) for each of aPct, cPct, gPct, and tPct.

	public static void printBar(String label, int percent, char barSymbol)

The printBar method should print out a single bar for a histogram. The printBar method should first print out label. Then, the method should print out percent copies of the symbol barSymbol. Finally, the method should print out the actual percentage that the bar displays. All of this information should be on the same line. For example, the call

	printBar("A",25,'=');

would print out the following line:

A: ========================= (25%)


Output:

Use System.out for all output. For the sequence record in the file, print:

  • The header line (without the leading ‘>’)
  • A histogram showing the percentage of As, the percentage of Cs, the percentage of Gs, and the percentage of Ts
 

Gene casestudy.jpg

Solution

I will be using a step-by-step process to guide you to the problem solution. The first step involves planning your work and the subsequent steps involve creating each method separately.

Step 0: Before you start coding... PLAN!

A common problem that many students have when programming is the lack of planning before the actual coding begins. Many students jump right in and start coding away without any real plan to follow. While this method may work for simple assignments, it can become a frustrating experience as assignment difficulty and complexity increases. Planning allows you to get a good grasp of the problem and gives you a baseline to follow when coding.


Figure out exactly what the problem is asking you to do

The first step to to coding is to figure out what the problem is asking you to do. It helps to read the assignment multiple times and mark down important points.


In this problem, you are asked to:

  • read a file containing one FASTA formatted sequence record
  • count the number of each of the four nucleotides (A,C,G and T) per sequence in the record
  • print a histogram displaying the genetic make‐up of the sequence record


It also asks you to output the following:

  • The header line (without the leading ‘>’)
  • A histogram showing the percentage of As, the percentage of Cs, the percentage of Gs, and the percentage of Ts


Once you have a solid understanding of what you need to do, the next step is to create pseudocode.

Pseudocode

Pseudocode can be thought of as an outline or structure of your code. It is meant to be readable and easy to follow. By pseudocoding, we can take the required methods in the assignment and figure out the logic we need to use in the actual code. This is also an easy way to to see how methods interact with one another. The key thing is to understand what is needed for each method and apply it to your pseudocode. Please refer back to the Problem section for the specific details for each method. Since the exact code has already been provided for the getFile() method, it does not need to be pseudocoded.

	public static void main(String [] args)
	{
		file = getFile()	//prompts user for and reads in the text file.  the text file is then tied to a variable
		header = getHeader(file)	//grabs the first line in the text file and saves it to a string
		sequence = getSequence(file) //stores the rest of the lines in one string
		
		//4 bases: A, C, G, T
		//count the bases in the sequence and save them to their respective variables
		a = countBase()
		c = countBase()
		g = countBase()
		t = countBase()
		
		//determine the percentages of each base
		apct = # of a's/total bases * 100
		bpct = # of c's/total bases * 100
		cpct = # of g's/total bases * 100
		tpct = # of t's/total bases * 100
		
		printDNAHistogram() //print histogram of results
	}
	public static String getHeader(File fastaFile)
	{
		//grab the first line of text from fastaFile (the header)
		//remove the leading '<'
		//save the header (minus the '<' character) into a string variable
		
		//return the  string variable
	}
	public static int countBase(String sequence, char base)
	{
		//create an integer variable to store the number of a specific base in the sequence
		
		//loop until we have gone through each character in the sequence
			//if the current character is equal to our base
				//iterate the integer variable
		//once we have gone through each characer in the sequence...
		
		//return the integer variable
	}
	public static void printDNAHistogram(String header, int aPct, int cPct, int gPct, int tPct)
	{
		//print out the header
		
		//call print bar for each base (A, C, G, T)
		//So for each of the following you're sending... a string corresponding to the base, the integer corresponding to frequency the base appeared
		//in the sequence, and the character to be displayed as the "bars" of the histogram
		printBar("A", a, '=')	
		printBar("C", c, '=')
		printBar("G", g, '=')
		printBar("T", t, '=')
	}
	public static void printBar(String label, int percent, char barSymbol)
	{
		//remember all of these should be printed on the same line
		
		//print out the label
		
		//loop for an x # of times depending on percent's value
			//print out the bar symbol (in this case: '=')
		
		//print out the percentage using percent
	}


Diagram

In addition to the pseudocode, using a diagram or some sort of visual aid can help you further understand your code. While there are many ways to create a diagram, the key thing is to make it as simple as possible; avoid any unecessary details. In the following example, I created a diagram of the how the methods in the program are connected.

Diagram.jpg

Once this is finished, we can now start coding.

Step 1: Start with the main() method

Since the main() method is the first method run by the program, it is a good place to start your coding. However, if you look at the pseudocode for the main method, you can see that most of the work is done by other methods called by main(). So using the pseudocode for main() as a basis, we can start coding the other methods. The first piece of code that appears in the main() method is a call to getFile()

Step 2: Understanding the getFile() method

Since the code for the getFile() method is already provided for us, all we really need to do for this method is to understand how it works and then use it.

Let's start with the variable declarations:

	JFileChooser chooser = new JFileChooser();
	File fastaFile = null;
	int returnValue;

JFileChooser is a class included in Java. It provides a simple mechanism for the user to choose a file and allow it to be loaded in to the program. The File variable is where we store the file that was read in using JFileChooser. This variable allows us to grab and manipulate the contents. int returnValue will be used to grab the "success" or "fail" code when using JFileChooser.

The next thing done is to open the file dialog, allowing the user to choose their FASTA input file.

		// show the open file dialog
		returnValue = chooser.showOpenDialog(null);

If the user presses the ok button in the prompt that appears, it will grab the value assigned to the ok button and store it in returnValue.

Next, we check if the ok button was pressed. If it was pressed then the file is saved into the File fastaFile variable.

		// check to see that the user clicked the OK button
		if (returnValue == JFileChooser.APPROVE_OPTION) {
			// get the selected file
			fastaFile = chooser.getSelectedFile();
		}

Once the work is done in the method, we return the File fastaFile variable to the main() method.

Step 3: Creating the getHeader() method

Looking back again at the main() method pseudocode, next line that appears is a call to getHeader(). The method signature is given to us in the assignment:

public static String getHeader(File fastaFile)

This method will return a String variable and take in a File variable for processing. Taking a look at the pseudocode for the getHeader() method, the first thing that needs to be done is to grab the first line of text (the header) from fastaFile.

Start off by creating a String variable to store the header.

		String header = "";

As was mentioned in the Problem section, the try/catch is used to deal with possible runtime exceptions. This is a requirement when using anything with the Scanner class. The next set of code will appear inside the try block as follows:

		try {
			//insert code in the try block
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		}

Next, we want to tie the File fastaFile variable to Scanner, allowing us to view and grab its contents.

			Scanner scanner = new Scanner(fastaFile);

After that, using Scanner, we need to grab the header, which appears on the first line. The header needs to be saved into String header variable. To do this, we use Scanner's nextLine() method. What this does is it grabs the line of text, and iterates to the next line every time it is used. For example, if I use nextLine() for the first time, it will grab the first line. If I use it again, it will grab the second line. This keeps going until there are no lines left. This code will also appear inside the try block.

			header = scanner.nextLine();

We now have the header saved to a String variable. Looking at the pseudocode again, the next step is to remove the '<' character from the string variable. To do this we use the built-in Java String function called replaceFirst(). This function replaces the first appearance of the character in the first argument and replaces it with what was specified in the second argument. Note that this code also appears within the try block.

			header = header.replaceFirst(">", "");

More specifically this code will do is replace the first appearance of the character ">", and replace it with "" (an empty character).

Now that we have the header saved to a string and removed the leading ">" character, we can return the String header to the caller.

		return header;

Step 4: Creating the getSequence() method

After the getHeader() method is called in the main method, the getSequence() is called right after. What getSequence() should do is get the sequence data of the FASTA record using a Scanner object to grab all the lines except the first one.

The method signature is as follows:

	public static String getSequence(File fastaFile)

This method takes in a File variable and returns back a String.

Start off by creating a String to store the FASTA sequence.

		String sequence = "";

Since we will be using Scanner just like the previous example, a try/catch block will need to be used. The next set of code will again appear inside the try block.

		try {
				//insert code in the try block
			}

		} catch (FileNotFoundException e) {
			e.printStackTrace();
		}

We want to tie the File fastaFile variable to Scanner in order to allow us to view and grab its contents. We also need to create a String variable to temporarily store the text that is read in.

			Scanner scanner = new Scanner(fastaFile);
			String line;

Next, we want to iterate through the text file and save each line to String sequence as one long string.

			while (scanner.hasNextLine()) { //loop until all the lines have been hit. ie: if there is still a next line, loop.
				line = scanner.nextLine(); //grab the next line and store it in a temporary string variable
				
				//filter out/skip the first line (the header)
				//if the current line is not the header, continue...
				if (!line.startsWith(">")) { //Can be read as... if the line does not start with the character ">", continue
					sequence += line; //Add the current line to the end of the sequence string
				}
			}	

Once we have the whole sequence saved into one string, return String sequence to the caller.

		return sequence;

Step 5: Creating the printBar() method

If you look at the pseudocode, you can see that the last method called in the main() is printDNAHistogram(). This method then calls on printBar() four times. For an easier picture, take a look at the diagram. Since printBar() is not using any other required methods, this is a good next step to for your code. printBar() prints out a single bar for a histogram.

The method signature is:

	public static void printBar(String label, int percent, char barSymbol)

The first thing we need to do in this method is to print out the label using String label.

		System.out.print(label + ": ");

Next, we need to iterate a loop based on the number tied to int percent. In this loop we print out the symbol which is indicated by char barSymbol.

		for (int i = 0; i < percent; i++) { //loop from 0 until we hit percent's value
			System.out.print(barSymbol); //print out barSymbol
		}

Afterwards, we need to print out the actual percentage as a value. Since the percentage calculation was already done in the main() method, we simply add a "%" character and print.

		System.out.println(" (" + percent + "%)");

As an example, the output would look something like this:

A: ========================= (25%)

int percentage for this example is equal to 25. This is the same as the number of bars and percentage that appear.

Step 6: Creating the printDNAHistogram() method

Now that we have the printBar() method created, we can now use it in the printDNAHistogram() method.

The signature for this method is as follows:

	public static void printDNAHistogram(String header, int aPct, int cPct, int gPct, int tPct) 

int aPct corresponds to the percent for base A, bPct for base B, etc.

If you take a look at the pseudocode for this method, the header will need to be printed out first.

		System.out.println(header);

Looking again at the pseudocode will notice that printBar is called four times. The input variables for these method calls also correspond to each base (A, G, C, T).

the code will look something like:

		//the first input variables in order are: the base, the base percentage, character bar to use for printing
		printBar("A", aPct, '=');
		printBar("G", gPct, '=');
		printBar("C", cPct, '=');
		printBar("T", tPct, '=');

Step 7: Code the main method

Now that the rest of the required methods have been created, we can now easily code the main() method.

If you look at the pseudocode, most of the work has already been done for you!

We start off by using the getFile(), getheader() and getSequence() methods.

		File file = getFile(); //prompt the user for the text/FASTA file and load it into the program
		String header = getHeader(file); //grab the header and save it to a string
		String sequence = getSequence(file); //grab the FASTA sequence and save it to a string

After those are put into the program, we need to figure out the total number of characters/bases in the sequence for later calculations.

		int totalBases = sequence.length(); //counts the number of characters in the string and saves it to an integer

Next, we use the countBase() method for each specific Base (A, C, G, T) to count the number of the bases in the sequence.

		int a = countBase(sequence, 'A'); //send the sequence String and A base
		int c = countBase(sequence, 'C'); //send the sequence String and C base
		int g = countBase(sequence, 'G'); //send the sequence String and G base
		int t = countBase(sequence, 'T'); //send the sequence String and T base

After, we need to determine the percentages for each base.

		int aPct = (int)(Math.round(((double) a) / totalBases * 100));
		int cPct = (int)(Math.round(((double) c) / totalBases * 100));
		int gPct = (int)(Math.round(((double) g) / totalBases * 100));
		int tPct = (int)(Math.round(((double) t) / totalBases * 100));

Using the first line as an example, I will explain how the calculation works. (double) a must be casted as a double in order to allow for decimal calculations. Next, we divide that by the total number of bases and multiply by 100 to get the percentage. Math.round is then used to round that number up. (int)(current value) will cast this value back to an int.

Once we have all the percentage calculations for each base, we can now call printDNAHistogram() to print the output.

		printDNAHistogram(header, aPct, cPct, gPct, tPct);

Step 8: Test the program

While it is not covered in this case study, its always a good idea to test your code frequently. This allows you to fix small problems before they become huge ones. Waiting until the end to test will leave you with a lot of errors and frustration.

Run your program and test out both output files and ensure you do not get any errors. Next, take a look at the output and make sure everything looks up to par.

Your output should look something like this:

gi|21071042|ref|NM_000193.2| Homo sapiens sonic hedgehog (SHH), mRNA
A: ========================= (25%)
G: ================================= (33%)
C: ============================= (29%)
T: ============== (14%)

Congratulations, you have finished the program!

Closing Remarks

Planning before you start coding is a really good habit to pick up. It helps you understand exactly how you want to go about your coding; from structure to specific lines of code. This extra step will end up saving you many hours of hair-pulling and frustration. You should also be testing your code bit by bit - do not wait until the end to test. Not only will you catch small errors and prevent them from becoming big ones, you will also save yourself many hours of problem solving. By doing these things, you are well on your way to becoming a master coder.

Code

Solution Code

Back to the Case Studies homepage