CMPSCI 691 Big Data Systems, Fall 2012

CS691: Big Data Systems

Fall 2012

Projects

The projects are meant to be open-ended and designed by you. The requirements for a reasonable project are that (1) the task must be accomplished using at least 4 machines, either in a local cluster or distributed across a wide area network. (2) the task must be interesting and well motivated by applications in use today or in the near future. (3) the end-result should be a simple, cool demo (or a short paper) that can be easily understood and appreciated by your friends NOT in computer science.

Deliverables: (1) A project proposal 1-3 pages long by ~~Oct 8~~ Oct 11 that outlines the goals, the motivation, the architecture of a multi-node system to accomplish your project, a timeline of at most 8 weeks with weekly or bi-weekly milestones to complete the project, an assessment of risks as to why you may not be able to satisfy the proposed goals in the proposed time and how you plan to trim the goals accordingly. (2) A project report expanding on the proposal with quantitative data on the insights gathered through the system you built. These could either be interesting data insights verifying or rejecting a hypothesis (see examples below) or performance numbers showing why your clever system design is superior to a naive way of accomplishing the same task.

Project ideas:

News, blogs, and social media: Because of the size and coverage, this data easily lends itself to a number of interesting forms of analyses some of which are listed below.

Meme tracking: What words or phrases are currently "hot"? While words are easy, it gets harder to automatically infer memes based on more complex phrases, especially those that can appear in many different forms. It may be easier to restrict the scope of the tracking to a specific domain where you have more domain knowledge about how memes evolve. Check out this web site about meme tracking for the previous US elections.
Sentiment mining: Several online services attempt to monitor the social sentiment of companies (e.g., Radian6), individuals (e.g., Klout), popular opinion on a hot topic, and so on. These sentiments are inferred by monitoring the articles for the presence of "positive" or "negative" words, e.g., "rising star" likely means the article is positive and "rising antipathy" likely means otherwise. Inferring the sentiment of an article itself is a nontrivial problem and you should, to the extent possible, try to reuse open-source tools available to do this instead of re-inventing the wheel. Current examples of such sentiment mining services are here: election.twitter.com,
Predictive analytics: Being able to predict trends is sginificantly more powerful than identifying historic trends. Some examples below.
- Can you analyze correlations between online data sources and real-life events? For example, can you predict poll numbers for the US 2012 Elections based on tweets, blogs, or news articles? For example, considering polling data (obtained by actually asking people who they will vote for) listings such as here as "ground truth", can you analyze how well they are predicted by articles mentioning the running candidates? Which of the sources--news, blogs, tweets--are more predictive? Note that you may also find that the online articles have little predictive value and are simply an effect of popular sentiment reflected in the polls.
- Some have even suggested that Twitter sentiment can predict the Dow Jones Index, e.g., this paper. Can you verify this claim using the data above by mining the sentiment of articles referring to names or stock symbols of companies? Note: you can download historic stock quotes data from online sources such as Yahoo Finance.

Wikipedia access logs: You can analyze several questions of a flavor similar to those above by looking at Wikipedia access logs. Which pages are the most frequently updated? Can you identify correlations between Wikipedia update frequency and real-life events?

SNAP datasets: The datasets and related papers should be a natural source of ideas for a variety of analyses relating to how networks evolve, how users perform correlated actions (like purchases on Amazon), location-based social networking, etc.