Programming Assignment 11: Crawler
Estimated reading time: 10 minutes
Estimated time to complete: two to three hours (plus debugging time)
Prerequisites: Assignment 10
Starter code: crawler-student.zip
Collaboration: not permitted
Overview
Remember Assignment 09, where you built part of a search engine? Search engines index documents and let you retrieve documents of interest. But how do they get their documents in the first place? In some cases, they’re part of a known corpus, but in other cases, the author of the search engine has to go and get them.
The canonical example of “going to get them” is a web crawler, software that visits a page on the web, retrieves its contents, and parses the text and links. It then marks the page as visited, puts those links into a list of pages to be visited, and visits another page, and so on. Every major web search engine has many crawlers working in concert to keep their document indices up-to-date.
Since this is COMPSCI 186, we’ll start a little more modestly. In this assignment, you’ll write a crawler that performs most of the functions of a true web crawler, but only on your local filesystem. We’ll leverage an existing library, jsoup, for the parsing of pages and links, and use our knowledge of abstract data types to manage the visited and to-be-visited collections appropriately.
We’ve provided a set of unit tests to help with automated testing. The Gradescope autograder includes a parallel but different set of tests.
Goals
- Build a simple crawler, following links and visiting pages, storing their contents appropriately.
- Practice writing unit tests.
- Test code using unit tests.
Downloading and importing the starter code
As in previous assignments, download and save (but do not decompress) the provided archive file containing the starter code. Then import it into Eclipse in the same way; you should end up with a crawler-student
project in the “Project Explorer”.
Examining the code
You will be doing your work in the UriCrawler
class, so open it up and read through it. You should also open and read at least a few of the UriCrawlerTest
cases, to get a sense of how a UriCrawler
works.
Side note: What’s URI
? URIs are a superset of URLs. For our purposes, they’re interchangeable, but in practice not all URIs are URLs. You can think of it as a link (http://www.umass.edu/
) though we’re going to work only on links that are local (file:
) rather than retrieved using HTTP (http:
).
Side note 2: URIs can refer to any type of file or resource, but we’re going to limit our attention to HTML files. HTML files are text with markup – “tags” that are identified by angle brackets and look like, say, <title>The Title<title>
to indicate logical parts of documents. HTML files can link to other documents (yes, yes, I know you know this) by using an anchor tag with an href
attribute. For example a link to google might be written:
<a href="http://www.google.com/">here is a link to google</a>
which would show up as
support
contains various classes to make writing your UriCrawler
easier. You’ll store each parsed, retrieved document in a RetrievedDocument
instance. You’ll use the utility methods in CrawlerUtils
to parse documents and extract the links they contain. And you’ll throw the exceptions when appropriate.
What to do
The UriCrawler
will require some state, that is, instance variables (for example, you’ll need to store the number of visits attempted so far, the URIs visited, and the RetrievedDocument
s your crawler creates). You’ll want to declare and instantiate these variables as you figure out what you’ll need to store.
The two get
methods should be straightforward, as should the addUri
method.
The meat of this assignment is in the visitOne
method. Read the documentation carefully to discern what it should do. In short, it should:
- get an unattempted URI from the
UriCrawler
‘s collection of such URIs (and remove it from the collection, of course) - add it to the collection of attempted URIs
- parse it into a
Document
usingCrawlerUtils.parse
- create a new
RetrievedDocument
consisting of the URI, the document’s text (which you can get using theDocument.text
method), and theList
of URIs from that document (which you can get usingCrawlerUtils.getFileUriLinks
) - add the as-yet unattempted links into the collection of URIs to be attempted
Of course, there are some edge cases you’ll need to consider, like what to do if the parse fails, or if the maximum number of attempts has already been reached, and so on. Their expected behavior is documented in the comments of UriCrawler
.
visitAll
should be straightforward once visitOnce
is working.
Other notes
Why not crawl actual web pages? Certainly you could, and if you read the jsoup
cookbook you’ll see that it would be easy. But I’d rather you not, for two reasons.
- First, testing remote web pages requires you have an Internet connection and introduces an additional source of errors that I’d like to avoid.
- Second, testing remote web pages with an automatic web crawler can lead to … unintended consequences. Imagine if your crawler wasn’t working quite right, and just kept hammering the server with requests for the same page. Imagine if several of your classmates did the same. Many bad consequences could flow from this: the server might go down; IT might shut off your network access if they thought you were attacking a CS page; other students’ tests might fail through no fault of their own, and so on.
That said, I’m sure at least a few of you are going to toy with making this into a real web crawler. If you must, then you’ll have to modify both methods in CrawlerUtils
; the first to work on URL
s instead of File
s; the second to filter on http
/https
rather than file
links (which are URIs that point to local files rather than HTTP-accessible resources).
Why is there a visitQuota
? Paralleling the reasons above, the visitQuota
will help stop otherwise hard-to-diagnose infinite loops from occurring, and will help keep those of you who point this thing at the Internet at large from getting into trouble.
Submitting the assignment
When you have completed the changes to your code, you should export an archive file containing the src/
directory from your Java project. To do this, follow the same steps as from Assignment 01 to produce a .zip
file, and upload it to Gradescope.
Remember, you can resubmit the assignment as many times as you want, until the deadline. If it turns out you missed something and your code doesn’t pass 100% of the tests, you can keep working until it does.