CS691: Big Data Systems

 

Fall 2012

     

Home

Schedule

Resources

 

 

Resources

Review template

Mining of Massive Datasets (online book)

Datasets

  • News, blog, and social media data: A multi-terabyte dataset of news, blogs, and social media data will be made available (details to be posted). The data is roughly 35GB/day or a terabyte a month and consists of about 40M "articles" per day.
  • Stanford SNAP datasets, a variety of social, communication, citation, collaboration, web, road, etc. network datasets.
  • Wikipedia access logs.

Project ideas and requirements

Some tools you may find useful.

  • Hadoop: Open-source mapreduce. Also check out other related projects related to data stores like Cassandra or HBase, the learning/mining library Mahout, the Zookeeper distributed coordination system, etc.
  • Solr or ElasticSearch for search systems with a web server front-end that is easy to set up and use.
  • Amazon Web Services: A whole suite of integrable tools for running services in the cloud, e.g., EC2 for computing, S3 for distributed storage, DynamoDB for a NoSQL data store, CloudFront for a content distribution service, and many others. Most of these services are free for low usage levels, so it is easy to get started.
  • Swarm: A local departmental cluster to run Hadoop jobs.
  • HighCharts: A javascript-based pretty charting library if you want to visualize mined trends on a webpage.