CMPSCI 691 Big Data Systems, Fall 2012

CS691: Big Data Systems

Fall 2012

Datasets

News, blog, and social media data: A multi-terabyte dataset of news, blogs, and social media data will be made available (details to be posted). The data is roughly 35GB/day or a terabyte a month and consists of about 40M "articles" per day.
Stanford SNAP datasets, a variety of social, communication, citation, collaboration, web, road, etc. network datasets.
Wikipedia access logs.

Some tools you may find useful.

Hadoop: Open-source mapreduce. Also check out other related projects related to data stores like Cassandra or HBase, the learning/mining library Mahout, the Zookeeper distributed coordination system, etc.
Solr or ElasticSearch for search systems with a web server front-end that is easy to set up and use.
Amazon Web Services: A whole suite of integrable tools for running services in the cloud, e.g., EC2 for computing, S3 for distributed storage, DynamoDB for a NoSQL data store, CloudFront for a content distribution service, and many others. Most of these services are free for low usage levels, so it is easy to get started.
Swarm: A local departmental cluster to run Hadoop jobs.
HighCharts: A javascript-based pretty charting library if you want to visualize mined trends on a webpage.