CS 585, Fall 2017, UMass Amherst: Introduction to Natural Language Processing

Final Project

Poster session and list of projects

There will be two poster sessions on Tuesday, 12/12, in CS room 150/151.

Session 1: 3:30-5:00

Theodore Proulx, Bryce Bodley-Gomes: Song Genre Classification
Katherine Mayo: Predicting Sentiment in StockTwits and Financial News Headlines Using an RNN
Quan Hoang: Predicting Movie Genres Based on Plot Summaries
Dani Mednikoff, Jen Zhu: Emotion Classification of Song Lyrics
Janani Krishna, Bhuvana Surapaneni: Sentiment Analysis and Summarization of Reviews
Jay Shah, Twinkle Tanna: Wordify: A Reverse Dictionary for Everyone
Karthik Anantharamu: Natural Language Interface for Database Systems
Emily Yu, Stanley Lok: Sarcasm Detection in Social Media
Pritish Yuvraj, Neelesh Kumar Boddu: Conversational Agent Models (Chatbots)
Kriti Shrivastava, Yicheng Shao, Tapojit Debnath Tapu: Predicting Intensity of Emotion in Tweets
Dhanya Bhaskar Bhat, Ramya Sarma: Sentiment Analysis of Financial News
Aarsh Patel, Lynn Samson: Modeling Affect Intensity in Tweets
Brennan Waters, Roy Jackman, Peter DelMastro: Temporal Resolution of Texts through the X-Bar Theory of Phrase Structure
Harihara Subramanian, Alexander Lamson: Predicting subreddits from reddit posts
Jie Song, Ruisi Zhang, Chen Pan: Sentiment Analysis and Opinion Mining on YELP Restaurant Reviews
Zhou Xu, Buqin Wang, Zhejun Shen: Similarity of Questions Pairs
Nishit Parekh: Aspect Extraction using Dependency Parsing and Semantic Clustering
Varun Sharma, Hisham Alhussain: An Approach for Emoji Prediction for Tweets Based on Sentiment Analysis
Ashish Ranjan, Mili Shah, Rheeya Uppaal: Character Identification on Multiparty Dialogues
Gahyun Kim, Richard Cui, Tri Nguyen: Birds of a Feather: Friend-Finding on Twitter
Liu Hang, Wenbo Xie, Wei Xie: Sentiment Analysis of Amazon Reviews
Ivan Bunin: Classifying Anonymous Texts
Craig Fan, Alan Van Dijk, Nick Laurin: Song Lyric Generation
Daanial Ahmed: An unsupervised learning technique for Automatic Summarization using TextRank
Jonathan Bailey, Francis DeBlasio, Mary Buckler: Sentiment Analysis of Twitter
Matthew Robinson, Rachit Nigam, Sebastian Lacki: Emoji Classification for Tweets
Terry Breen: Hypernym Discovery in the Domain of Music
Pengshan Cai: A joint neural model for entity and relation extraction in medical literatures
Zhichao Yang, Dongxu Zhang: Take Other Candidate Answers into Consideration: A New Scoring Approach for Answer Selection
Chen Zhou, Huan Wang, Rui Pan: Irony detection in tweets
Craig Norton, Hassaan Khan, Mikey Shlisleberg: Vulgar Language Processesing
Nikolai Narma, Joseph Isble, Dylan Bowers: Predicting Audience through a Collective Iterative Learning System
Amanda Pellerite: Twitter Sentiment Classification and Analysis
Michael Au: Determining toxicity in online discussions
Ninad Khargonkar, Shreesh Ladha: Cross Lingual Transfer Learning for Hindi POS tagging
Mohit Uniyal, Shruthi Yashavanth: Emoji Prediction of Twitter data
David Ter-Ovanesyan, Harry Koumjian: An approach for modeling emotional intensity using Twitter
Trevor Kearns, Ian Torres: Affect in the Poetry of Emily Dickinson
Mike Sadler, Siddarth Patel: Author Identification Using Text Classification

Session 2: 5:00-6:30

Roy Chan: Wikipedia Question Answering
Zitao Wang, Shiyan Yin, Ruifeng Wang: Financial News and Stock Markets: An Experiment on Finding Influential News Company
Krishna Prasad Sankaranarayanan, Sree Harsha Ramesh: Evaluating Deep Learning Approaches for Character Identification in Multiparty Dialogues
Kristoffer Johansen, Corey Clemente, Swapnil Debarshi: Text Analysis and Modeling of Job Advertisements
Zoey Sun: Application of Relation Extraction on Financial Statements Logical Relationship Recognition
Daksh Jotwani, Avaneesh Reddy Gavva: Defeating Plagiarism Checkers
Gota Gando: Generating Natural Language Inference Sentences
Albert James, Shankar Venkitachalam: Machine Comprehension on SQuAD
Sean Chickosky, Ozias Gonet: Detecting Directed Insults on Social Media
Vedant Puri, Shubham Mehta, Ronit Arora: What's your emoji
Shikha Agarwal: Irony detection in tweets
Deeksha Razdan, Nikhil Adhe: Review's take on revenue
Lopamudra Pal, Ly (Harriet) Bui, Rishi Mody: Impact of Sentiment Analysis of movie reviews on Revenue Prediction
Alex Karle, Makenzie Schwartz, William Warner: Predicting Emojis From Tweet Text Through Sentiment Analysis
David Boslee: Predicting stock market trends with twitter sentiment analysis
Yonatan Rubenstein, Zachary Tousignant: Character-to-Character Sentiment Analysis in Plays
Kevin Joseph, Sreekar Reddy, Srijan Mishra: Authorship Attribution
Ravi Agrawal, Abhay Doke: Reading Comprehension using Bidirectional Attention Flow model
Nick Merlino, Ben Kaufman, Sagar Thapar: Predicting Movie Revenue Given Plot Summary
Kishalve Pethia,Srikanth Prabala: Annotating articles related to malwares for information retrieval
Sanjay Reddy Satti, Aditya Agrawal: An approach for author profiling on data from heterogeneous sources
Ajinkya Zadbuke, Arhum Savera: Stance Detection to Identify Fake News
Zhuohan Zeng, Dewei Li: Feature Identification of Dishes base on Yelp Reviews
Xiaoyi Duan, Zixin Kong: Sarcasm Detection in Social Media using Deep Learning
Nikhil Garg, Shivangi Singh: Predicting the quality of different aspects of a restaurant based on its 3-star Yelp reviews
Misha Kanai: Price Prediction of Alternative Cryptocurrencies using Telegram Group Chats
Justin Martinelli, Chenhao Huang: Guess The Emoji Using Twitter Dataset with Natural Language Processing
Xin Liu, Jucong He, Che-Ting Lin: Predicting Yelp Restaurant Review Ratings
Ajinkya Indulkar, Divyendra Mikkilineni: Yes/No Question Answering
Hao Liu, Jian Yang: Sentiment Analysis with Twitter
Logan Rennick: Satire Detection using Machine Learning
Nicholas Bertrand, Gregory Herman, Ilan Shenar: Predicting Release Year of Song Based on Changes in Lyric Linguistics
Monark Modi: Predicting Stock Market trend using Sentiment Analysis of Financial Data
Trevor Brown: Twitter Analysis: Discovering pictures of pets
YangJunqing Qiao: Emoji Classification Predictions in Tweets
Joseph Svrcek: Geographic Representation of Language
Jin Huang: Information Retrieval and Named Entities Recognition for Gun Violence based on Neural Network
Corwin Burdick: Creating a Codenames AI
Chih-Yu Hsu: Unsupervised hypernym detection in material science papers using distributional inclusion vector embedding

Introduction

The final project is to either build a natural language processing system, or apply one for some task. The project must use or develop a dataset, and report empirical results or analyses with the dataset. It may use machine learning or rule-based approaches. It may use any type of open-source or widely available software.

You can choose to emphasize:

Implementing and developing algorithms and features.
Defining a new linguistic / text analysis task, and tackling it with off-the-shelf NLP software.
Collect and explore a new textual dataset to address research hypotheses about it.

Different projects will have different balances of these three things.

The key requirement is to investigate, analyze, and come to research findings about new methods, or insights about previously existing methods.

This course does not have a final exam. The final project is the focus for the final part of the course.

The project will be completed in groups of 1-3. We encourage size 2, which often works well.

The project has four components over the second half of the semester: Proposal, Progress Report, Presentation, and Final report.

(Requirements for the items after the proposal are subject to revision as we get closer to them.)

Proposal (due 10/17)

A 2-4 page document outlining the problem, your approach, possible dataset(s) and/or software systems to use. This proposal

Describes the scope of the proposed work, which we will use to help give feedback and define what is necessary to complete for the project.
Cites and briefly describes at least two pieces of relevant prior work (typically research papers).
Proposes at least one dataset to use or try using. You must learn a bit about it and convince us that it is available for you, and that you can easily get it, and that it is appropriate for the task and research questions you care about.
Proposes what pre-existing software will be used to accomplish the analysis task.
Says whether human annotation will be required, and if so, how much.
Proposes a preliminary experiment to run on the data (this will be reported on in the progress report), as well as the scope of the final total project.

In general, you should illustrate that you have learned about and thought through some of the problem space and possible avenues of analysis and approaches to the problem.

Ideally, try to answer the following questions as well.

Will your project require human annotation, or will you use a ready-made (or “pre-baked”) dataset? We think it is absolutely great to make your own annotations or labels — this lets you be more creative with defining the task you care about. However, make sure to build in time for this. We suggest you use yourselves (the project group members) as the human annotators (though if you can get others to do it, that’s great, but be aware it can be difficult to manage depending on the situation.)
How will the train/test split be done? Some datasets have one built in. Some do not. Will you split examples by document? Author? Location? Book? Time?
What baseline algorithm will you use? A baseline algorithm is one that is very simple and trivial to implement. For example, “predict the most common class,” or “tag all capitalized words as names,” or “select the first sentence in the document”. Sometimes it can be difficult to get a fancy algorithm to beat a baseline. “Always ask yourself, ‘What’s the simplest experiment I could do to (in)validate my hypothesis?’ Talented researchers have a knack for coming up with simple baselines.”

Formatting: please use a 10 to 12 point font with single spacing.

Submit via the HotCRP system on cs585projects.cs.umass.edu. It should allow you to do a group submission.

In special cases, some groups may want to change after this point. That's OK, but please be very clear when doing later turn-ins.

Peer feedback

After submitting your project proposal, you will be assigned other proposals to give feedback on.

Progress Report (due November 20 (not 17))

You’ve had a few weeks to work on the project! You have now clarified and revised your proposed idea. You have started working on it and have some preliminary results to report.

The progress report is a 5-10 page document that describes your preliminary work and results. You should do and report on work including

Acquire your dataset. Report its source, its basic statistics (source, size, number of words/sentences/documents) and other important properties.
If your project involves annotation, you’ve started a pilot annotation experiment, annotating a few dozen or few hundred examples. What major issues have come up? Do you and your project partner agree or disagree on examples? (At this stage, qualitative findings about these questions are fine.)
Run some sort of NLP algorithm — classifier, parser, etc. — on the data, and report its result. If you are using a ready-made dataset, you should define a train/test split, and you should have at least one accuracy number to report at this point.
The report must contain at least one table or graph that conveys numerical information – for example, statistics about your data or annotations, accuracy or other results of running an algorithm on the data, or something else.
You now have a better idea of how much you can accomplish in the rest of the semester. Lay out the major items you want to accomplish. Provide a timeline to finish them by.

Poster session presentations (near end of classes)

We will have a poster session where all groups will present their work. It will be open to the community, in conjunction with the Data Science Tea. It should be fun! Logistical details forthcoming.

Final Report (due 12/22 at end of semester)

The final report is a 12 to 20 page document that describes your project and final results. Unlike the proposal (which was only about a possible project and related work), or the progress report (which was only about results), the final report must be a complete, standalone document. Conceptually, it should include the content of both the proposal and progress report, though they will be changed. The final report describes and motivates the problem, places it in context of related work, describes the dataset and your approach, and reports results with discussion and thoughts for future work.

Submit your PDF on Gradescope, and implementation on Moodle. (Moodle limits the size of the zip file, so don’t include large data files, but feel free to provide us a URL to them.)

Here is a sample outline for your final report. There are different possible ways to structure it (for example, if you can, you can weave related work into the other sections), but we suggest you follow this outline unless you have substantial prior experience writing technical reports and research papers.

Abstract: summarize the main components of your work in one paragraph (no more than 5 sentences). What problem are you solving? What is the key to your approach? What results did you achieve? Your abstract should draw the reader in and interest them in reading the rest of your paper to understand the details of your work.
Introduction: explain the problem, motivate it (why is it important?), and briefly describe your approach. State a research question that your project seeks to answer: what are you trying to learn from this research project? You may also report some of your results without discussing the details of your method.
Related work: explain what other approaches have been to the problem. Cite specific instances of previous work. (Note that 585-02 students have an extra requirement here: see below.)
Data: Describe the dataset that you are using.
Method: Describe your approach to handling the problem. This should should include any models you used and any modeling assumptions you made. If you’ve developed new models for this project, you may even want to split a description/analysis of your models into its own section.
Results: Describe the experiments you ran and identify your baseline method(s). Include the results you achieved with the various methods you are comparing making. This section will probably also include some figures that succinctly summarize your results. Analyze your results (including your models). If you did exploratory analysis or a significant amount of feature engineering, your analysis may merit its own section. After reading this section (and your dataset and methods), an interested reader should be able to duplicate your experiments and results.
Discussion and Future Work: discuss any implications of your analysis for the problem as a whole, and what are the next steps for future work. Any other concluding remarks should go here.

Some things to remember:

WRITE CLEARLY. A good paper is direct, unambiguous and describes all stages of the work in complete—but not superfluous—detail. You must provide this, while not causing confusion to the reader. Quality of writing will affect your grade.
All plots, figures, and tables must have a title, labeled axes and a caption. There should be no confusion about what you’re trying to express, or why the plot was included.
Do not include multiple figures that all convey the same information. Be succinct.
Avoid redundancy. Do not describe the same component of your work twice.
Do not include much code in your report. If your approach contains a new algorithm you developed, it may be appropriate for you to include its pseudocode, but do not copy and paste code from your editor. Do not include well-known algorithms like the perceptron training algorithm. (If you feel like you need to, instead include a citation to the literature that describes the algorithm, and provide a brief English description of what it does.)
You may include back-references to things you’ve already discussed. Avoid forward references.
Provide a title for your report.

Writing a paper is like composing a piece of art. Be deliberate in your choice of what to include, when to include it and how to express it. For example, don’t include every plot you generated; pick the ones that best demonstrate your results. Work on the clarity of your writing. At the end of the day, your report is all the reader has to generate an opinion about your work. You don’t want good work to be obscured by poor writing.

Extra requirement for 585-02 students (graduate students): You must cite at least 10 relevant research papers, and describe them and how they relate to your work. It may be convenient to structure this as a related work or literature review section.

Formatting: please use a 10 to 12 point font with single or 1.5 spacing. Please divide your report clearly into sections/subsections. We suggest the ACL stylesheets, though they're not necessary to use.