Home
Schedule
Resources
|
|
Resources
Review template
Mining of Massive Datasets (online book)
Datasets
- News, blog, and social media data: A multi-terabyte dataset of news, blogs, and social media data will be made available (details to be posted). The data is roughly 35GB/day or a terabyte a month and consists of about 40M "articles" per day.
- Stanford SNAP datasets, a variety of social, communication, citation, collaboration, web, road, etc. network datasets.
- Wikipedia access logs.
Project ideas and requirements
Some tools you may find useful.
- Hadoop: Open-source mapreduce. Also check out other related projects related to data stores like Cassandra or HBase, the learning/mining library Mahout, the Zookeeper distributed coordination system, etc.
- Solr or ElasticSearch for search systems with a web server front-end that is easy to set up and use.
- Amazon Web Services: A whole suite of integrable tools for running services in the cloud, e.g., EC2 for computing, S3 for distributed storage, DynamoDB for a NoSQL data store, CloudFront for a content distribution service, and many others. Most of these services are free for low usage levels, so it is easy to get started.
- Swarm: A local departmental cluster to run Hadoop jobs.
- HighCharts: A javascript-based pretty charting library if you want to visualize mined trends on a webpage.
|