Programming Assignment 11: Crawler

Estimated reading time: 10 minutes
Estimated time to complete: two to three hours (plus debugging time)
Prerequisites: Assignment 10
Starter code: crawler-student.zip
Collaboration: not permitted

Overview

Remember Assignment 09, where you built part of a search engine? Search engines index documents and let you retrieve documents of interest. But how do they get their documents in the first place? In some cases, they’re part of a known corpus, but in other cases, the author of the search engine has to go and get them.

The canonical example of “going to get them” is a web crawler, software that visits a page on the web, retrieves its contents, and parses the text and links. It then marks the page as visited, puts those links into a list of pages to be visited, and visits another page, and so on. Every major web search engine has many crawlers working in concert to keep their document indices up-to-date.

Since this is COMPSCI 190D, we’ll start a little more modestly. In this assignment, you’ll write a crawler that performs most of the functions of a true web crawler, but only on your local filesystem. We’ll leverage an existing library, jsoup, for the parsing of pages and links, and use our knowledge of abstract data types to manage the visited and to-be-visited collections appropriately.

We’ve provided a set of unit tests to help with automated testing. The Gradescope autograder includes a parallel but different set of tests.

Goals

  • Build a simple crawler, following links and visiting pages, storing their contents appropriately.
  • Practice writing unit tests.
  • Test code using unit tests.

Downloading and importing the starter code

As in previous assignments, download and save (but do not decompress) the provided archive file containing the starter code. Then import it into Eclipse in the same way; you should end up with a crawler-student project in the “Project Explorer”.

Examining the code

You will be doing your work in the UriCrawler class, so open it up and read through it. You should also open and read at least a few of the UriCrawlerTest cases, to get a sense of how a UriCrawler works.

Side note: What’s URI? URIs are a superset of URLs. For our purposes, they’re interchangeable, but in practice not all URIs are URLs. You can think of it as a link (http://www.umass.edu/) though we’re going to work only on links that are local (file:) rather than retrieved using HTTP (http:).

Side note 2: URIs can refer to any type of file or resource, but we’re going to limit our attention to HTML files. HTML files are text with markup – “tags” that are identified by angle brackets and look like, say, <title>The Title<title> to indicate logical parts of documents. HTML files can link to other documents (yes, yes, I know you know this) by using an anchor tag with an href attribute. For example a link to google might be written:

<a href="http://www.google.com/">here is a link to google</a>

which would show up as

here is a link to google

support contains various classes to make writing your UriCrawler easier. You’ll store each parsed, retrieved document in a RetrievedDocument instance. You’ll use the utility methods in CrawlerUtils to parse documents and extract the links they contain. And you’ll throw the exceptions when appropriate.

What to do

The UriCrawler will require some state, that is, instance variables (for example, you’ll need to store the number of visits attempted so far, the URIs visited, and the RetrievedDocuments your crawler creates). You’ll want to declare and instantiate these variables as you figure out what you’ll need to store.

The two get methods should be straightforward, as should the addUri method.

The meat of this assignment is in the visitOne method. Read the documentation carefully to discern what it should do. In short, it should:

  • get an unattempted URI from the UriCrawler‘s collection of such URIs (and remove it from the collection, of course)
  • add it to the collection of attempted URIs
  • parse it into a Document using CrawlerUtils.parse
  • create a new RetrievedDocument consisting of the URI, the document’s text (which you can get using the Document.text method), and the List of URIs from that document (which you can get using CrawlerUtils.getFileUriLinks)
  • add the as-yet unattempted links into the collection of URIs to be attempted

Of course, there are some edge cases you’ll need to consider, like what to do if the parse fails, or if the maximum number of attempts has already been reached, and so on. Their expected behavior is documented in the comments of UriCrawler.

visitAll should be straightforward once visitOnce is working.

Other notes

Why not crawl actual web pages? Certainly you could, and if you read the jsoup cookbook you’ll see that it would be easy. But I’d rather you not, for two reasons.

  • First, testing remote web pages requires you have an Internet connection and introduces an additional source of errors that I’d like to avoid.
  • Second, testing remote web pages with an automatic web crawler can lead to … unintended consequences. Imagine if your crawler wasn’t working quite right, and just kept hammering the server with requests for the same page. Imagine if several of your classmates did the same. Many bad consequences could flow from this: the server might go down; IT might shut off your network access if they thought you were attacking a CS page; other students’ tests might fail through no fault of their own, and so on.

That said, I’m sure at least a few of you are going to toy with making this into a real web crawler. If you must, then you’ll have to modify both methods in CrawlerUtils; the first to work on URLs instead of Files; the second to filter on http/https rather than file links (which are URIs that point to local files rather than HTTP-accessible resources).

Why is there a visitQuota? Paralleling the reasons above, the visitQuota will help stop otherwise hard-to-diagnose infinite loops from occurring, and will help keep those of you who point this thing at the Internet at large from getting into trouble.

Submitting the assignment

When you have completed the changes to your code, you should export an archive file containing the src/ directory from your Java project. To do this, follow the same steps as from Assignment 01 to produce a .zip file, and upload it to Gradescope.

Remember, you can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.