CS691: Big Data Systems | |
Fall 2012 |
|
This dataset consists of articles from online news sources, blogs, and social media. The data consists of about 40M documents/day. The total volume of data is about 35GB/day or about a terabyte a month. Accessing the data: There are two ways to access the data. The first is through a mysql database server on the server plum.cs.umass.edu. The second is to download mysqldump files from the server pear.cs.umass.edu. The two methods are explained below. (1) Accessing the database server: You can connect to the mysql server running on plum from any cs.umass.edu machine as
where 'Database' is 'sybil' and Username is 'cs691bd' and Password is as announced in the class. The data is stored in a table called 'documents_CURRENT' and consists of the following fields (and a few others that you can ignore):
You can inspect the structure of the table by running
on the database command prompt (or other GUIs connecting to mysql databases). You can see the indexes created on the table by running
though it may take many seconds for it to return. Take the index information into account while constructing queries. Queries searching for document_date or ID or DOCUMENT_ID ranges should be relatively quick as these fields are indexed. As the table is gigantic, carelessly composed queries can take days to finish and are likely to slow down others as well. (2) Downloading mysqldump files: You can download the following mysqldump files from any cs.umass.edu machine:
You should be able to download compressed versions with a .tgz appended to the links above when the compression completes. The former is about a couple weeks of traces in April 2012 and the latter is all of May's traces. You can import these dump files into a mysql database running on your or any other machine provided you create first create a database called 'sybil' and a table called 'documents_ARCHIVE' in the destination database. If so, you should be able to import the dump files by running
Finally, there is also a single day's dump that will automatically create the table documents_ARCHIVE upon import:
**Note**: Importing the above dump will drop any existing table of the same name while creating the table. So you may want to first import this one-day table and then the longer tables using the import command above. You can also create dump files of the type you find convenient to use by running mysqldump on the plum database server. |