CS691: Big Data Systems

 

Fall 2012

     

Home

Schedule

Resources

 

 

This dataset consists of articles from online news sources, blogs, and social media. The data consists of about 40M documents/day. The total volume of data is about 35GB/day or about a terabyte a month.

Accessing the data: There are two ways to access the data. The first is through a mysql database server on the server plum.cs.umass.edu. The second is to download mysqldump files from the server pear.cs.umass.edu. The two methods are explained below.

(1) Accessing the database server: You can connect to the mysql server running on plum from any cs.umass.edu machine as

mysql -uUsername -pPassword Database

where 'Database' is 'sybil' and Username is 'cs691bd' and Password is as announced in the class. The data is stored in a table called 'documents_CURRENT' and consists of the following fields (and a few others that you can ignore):

  • ID: An auto-increment bigint counter that is unique for each row and also uniquely identifies the document.
  • DOCUMENT_ID: Another bigint id that also uniquely identifies the document and also encodes the document_date.
  • SOURCE_TYPE: A string specifying whether the document has the type NEWS, BLOG, or TWITTER.
  • DOCUMENT_DATE: The timestamp of the document in mysql datetime format.
  • URL: The URL of the document.
  • TITLE: The title of the document.
  • CONTENT: The body text of the document. To some extent, this is stripped of unrelated sidebar content (e.g., advertisements or links to other articles), but you should still expect to find them.
  • LANG_CODE: A string specifying the language code. 'en' is english and 'U' (for unknown) is mostly but not always english.
  • AUTHOR_NAME: The name of an author if available.
  • AUTHOR_URL: URL of the author if available.

You can inspect the structure of the table by running

show create table documents_CURRENT;

on the database command prompt (or other GUIs connecting to mysql databases). You can see the indexes created on the table by running

show indexes in documents_CURRENT;

though it may take many seconds for it to return. Take the index information into account while constructing queries. Queries searching for document_date or ID or DOCUMENT_ID ranges should be relatively quick as these fields are indexed. As the table is gigantic, carelessly composed queries can take days to finish and are likely to slow down others as well.

(2) Downloading mysqldump files: You can download the following mysqldump files from any cs.umass.edu machine:

http://pear.cs.umass.edu/691/april.sql.tgz
http://pear.cs.umass.edu/691/may.sql.tgz

You should be able to download compressed versions with a .tgz appended to the links above when the compression completes. The former is about a couple weeks of traces in April 2012 and the latter is all of May's traces. You can import these dump files into a mysql database running on your or any other machine provided you create first create a database called 'sybil' and a table called 'documents_ARCHIVE' in the destination database. If so, you should be able to import the dump files by running

mysql -uUsername -pPassword sybil < dumpfile.sql

Finally, there is also a single day's dump that will automatically create the table documents_ARCHIVE upon import:

http://pear.cs.umass.edu/691/september3.sql.tgz

**Note**: Importing the above dump will drop any existing table of the same name while creating the table. So you may want to first import this one-day table and then the longer tables using the import command above. You can also create dump files of the type you find convenient to use by running mysqldump on the plum database server.