Accessing the web in Java

From CompSciWiki
Jump to: navigation, search

Extra Lab: URL

In this lab, we'll learn how to use the URL class to access information on the web.


Step 1: Imports

The URL class is part of the Java API, so you don't need to download anything. Simply import java.net.* to gain access to the URL class. Additionally, we'll need to handle the material from other classes to handle the contents of websites, so your typical include statements should look like:

import java.net.*;
import java.io.*;
import java.util.Scanner;

Step 2: Basics

Once you've imported the URL class, you can begin accessing websites. The steps are

  1. Declare a variable of type URL, and point it to the website you want to read.
  2. Use I/O tools (probably Scanner) to access the contents of the website.
  3. Read in the contents of the website, doing whatever you want to them in the process -- it's just text now.

Here's a sample that does this:

import java.net.*;
import java.io.*;
import java.util.Scanner;

public class url { 

	public static void main (String[] args) { 

		try {
			URL website = new URL("http://www.cbc.ca/news");
                        Scanner in = new Scanner(new InputStreamReader(website.openStream()));
			while (in.hasNext()) {
				System.out.println(in.next());
			}
		} catch (Exception e) {
			System.out.println(e.getCause());
			System.out.println(e.getMessage());
		}

	}

}

Notice a few things about this code:

  1. All this code does is open the website http://www.cbc.ca/news and then prints out the contents of the page using a Scanner.
  2. We've put in a try-catch block around the code to catch any exceptions raised by the code. This includes (in the case of URL) IOException or MalformedURLException.
  3. Reading from a URL is a two-step process. First you need to declare the variable of type URL with the line
URL myURLVariable = new URL("http://....")

where myURLVariable is whatever variable name you want to use and "http://..." is the address of the site you want to open.

  1. The second step is two stream the contents of the website into a Scanner variable. To do that, use
Scanner myScanner = new Scanner( new InputStreamReader(myURLVariable.openStream());

Notice that you must use the same variable name as in your URL variable declaration.

You can then use Scanner in the same way as if you were reading from System.in. Consult the wiki page on the Scanner Class for more details.

Step 3: Reading smartly.

When you run the previous program, you notice that the code actually prints out everything that is contained in that website, including all the markup for the page. This is good and bad: good in the sense that you have everything from that website, but bad in that you have to sort through it to get what you want, in a way that is readable for your user. How can we do this? One idea is to search through the results and find what you are looking for, strip off the markup, and use what's left.

For instance, look at this website: http://en.wikipedia.org/w/index.php?namespace=0&tagfilter=&limit=500&hideminor=1&title=Special%3ARecentChanges. It shows all the substantial edits to wikipedia that have been made recently.

If you view the source, you'll notice that all the actual edits look like this:
<li class="mw-line-even">[[INFORMATION ON CHANGED PAGE]]

or

<li class="mw-line-odd">[[INFORMATION ON CHANGED PAGE]]

So to find the updates in that mess, you could just look for lines that say mw-line and then use only those lines:

URL website = new URL("http://en.wikipedia.org/w/index.php?namespace=0&tagfilter=&limit=500&hideminor=1&title=Special%3ARecentChanges");
Scanner in = new Scanner(new InputStreamReader(website.openStream()));
while (in.hasNextLine()) {
     String next = in.nextLine();
     if (next.contains("mw-line")) {
         System.out.println(next);
     }
}

This code uses the contains method, which searches for a specific block of text (in this case mw-line) and then returns true if it finds it.

When you run this, you'll find that the println statement actually prints out everything, including the HTML code that you used to identify the line.

if (next.contains("mw-line")) {
	filter = next.replaceAll("<[^>]*>","");
	System.out.println(filter);
}


This will remove both the span and a tags and leave only the text of the updates. It uses the replaceAll method for strings, which replaces a string it finds with another one. Here's another example,

String x = "aabaabc";
String y = x.replaceAll("aa","rr");

After this code, y is the string "rrbrrbc". The example we used

filter = next.replaceAll("<[^>]*>","");

is more complicated, but can be read as follows: replace all tags (defined by a "regular expression", <[^>]*>) with the string "". Since the replacement is the empty string, this has the effect of deleting all tags. You can also try to replace all occurrences of, e.g., &amp; by & or   by a single space.

In general, whatever website you are reading, you will generally follow this type of procedure:

  1. find the lines in the HTML file that you are interested in and save them in a String.
  2. use replaceAll to remove useless HTML code (if desired) to retain only the text.

This process is somewhat of an art. It also uses the idea of a regular expression, which can be complicated and hard to read or write. You can learn a bit about regular expressions (patterns) in the Java API.

Here's the full code:

public class url { 

	public static void main (String[] args) { 

		String filter;
		try {
			URL website = new URL("http://en.wikipedia.org/w/index.php?namespace=0&tagfilter=&limit=500&hideminor=1&title=Special%3ARecentChanges");
                        Scanner in = new Scanner(new InputStreamReader(website.openStream()));
			while (in.hasNextLine()) {
				String next = in.nextLine();
			 	if (next.contains("mw-line")) {
					filter = next.replaceAll("<[^>]*>","");
					System.out.println(filter);
				}
			}
		} catch (Exception e) {
			System.out.println(e.getCause());
			System.out.println(e.getMessage());
		}

	}

}

Another example is available here.


Step 4: Don't ruin things for everyone

As with the jTwitter extra lab, please don't wreck things for everyone. Writing a program that repeatedly and rapidly (i.e., with a loop) asks for the content of a website is a great way to anger the owner of a website. Please don't do it.

Step 5 (Bonus Material): When not to use URL

As we have seen, running through the HTML code and extracting text can be really annoying. Fortunately, most websites who want you to have their content give you better ways to access it. For example, the social news site reddit provides their front page in two formats: JSON and XML. The purpose of these feeds is to provide easier ways for programs to handle the information on reddit's front page (without having to sort through the HTML as we did). Reddit has published information on how to access the JSON and XML feeds.

JSON (Javascript Object Notation) is one format for storing data. You can read about it here. A library of code for dealing with JSON in java is here. With the library, we can get the information from reddit really easily: here is a very-small-but-working example getting the authors and titles of the stories on the front page of reddit:

import java.net.*;
import java.io.*;
import org.json.*;

public class redditReader {

	public static void main (String[] args) {
		
		try {
			JSONObject redditJSON = new JSONObject(new JSONTokener(new InputStreamReader(new URL("http://www.reddit.com/.json").openStream())));
			
			JSONArray stories = redditJSON.getJSONObject("data").getJSONArray("children");
			
			for (int i =0; i < stories.length(); i++) { 
				JSONObject story = stories.getJSONObject(i).getJSONObject("data");
				System.out.println(story.get("author") + " submitted \"" + story.get("title") + "\"");
			}
		
		} catch (Exception e) { 
			System.out.println(e.getMessage());
		}
		
	}
}

The first line of this example get the reddit feed (available at www.reddit.com/.json) and turns it into a JSON Object. The next line gets the array of stories (by accessing the children field) on the front page. Then the loop simply loops through the stories, printing the author and title for each one (this time using the author and title tags). To run this code, you need to download and install the JSON java code.

Here's this code's "please don't wreck this": reddit asks that you only run this code once every 30 seconds.