CS 685, Spring 2020, UMass Amherst
Final Project
Introduction
The final project is to build and apply
a natural language processing system.
The project must use or develop a dataset, and report empirical results or analyses with the dataset. It may use machine learning or rule-based approaches. It may use any type of open-source or widely available software.
You can choose to emphasize:
- Implementing and developing algorithms and features.
- Defining a new linguistic / text analysis task.
- Collect and explore a new textual dataset to address research hypotheses about it.
Different projects will have different balances of these three things.
The project will be completed in groups of 1-3.
For larger groups, we expect a commensurate larger amount of work, such as having more experiments, more analysis, etc.
Proposal (due the Monday after spring break: 3/23)
A 1-3 page document outlining the problem, your approach, possible dataset(s) and/or software systems to use. This proposal
- Provides a title for your project and lists all project members.
- Describes the scope of the proposed work, which we will use to help give feedback and define what is necessary to complete for the project.
- Proposes at least one dataset to use or try using. You must learn a bit about it and convince us that it is available for you, and that you can easily get it, and that it is appropriate for the task and research questions you care about.
- Proposes what pre-existing software will be used to accomplish the analysis task.
- Says whether human annotation will be required, and if so, how much. We encourage to make this only a small component of the overall project, since it can potentially take a lot of time.
- Proposes a preliminary experiment to run on the data (this will be reported on in the progress report), as well as the scope of the final total project.
In general, you should illustrate that you have learned about and thought through some of the problem space and possible avenues of analysis and approaches to the problem.
Ideally, try to answer the following questions as well.
- Will your project require human annotation, or will you use a ready-made dataset? We think it is absolutely great to make your own annotations or labels — this lets you be more creative with defining the task you care about.
- How will the train/test split be done? Some datasets have one built in. Some do not. Will you split examples by document? Author? Location? Book? Time?
- What baseline algorithm will you use? This includes at the very simple/trivial baselines like “predict the most common class,” or “tag all capitalized words as names,” or “select the first sentence in the document”, as well as simple machine learning models like logistic regresssion with bag-of-words features. It may also include a more sophisticated model from prvious work. Check out: “Always ask yourself, ‘What’s the simplest experiment I could do to (in)validate my hypothesis?’ Talented researchers have a knack for coming up with simple baselines.”
Progress Report (see Piazza for due dates)
You’ve had a few weeks to work on the project! You have now clarified and revised your proposed idea. You have started working on it and have some preliminary results to report.
The progress report is a 3-6 page document that describes your preliminary work and results. You should do and report on work including
- Acquire your dataset. Report its source, its basic statistics (source, size, number of words/sentences/documents) and other important properties.
- If your project involves annotation, you’ve started a pilot annotation experiment, annotating a few dozen or few hundred examples. What major issues have come up? Do you and your project partner agree or disagree on examples? (At this stage, qualitative findings about these questions are fine.)
- Run some sort of NLP algorithm — classifier, parser, etc. — on the data, and report its result. If you are using a ready-made dataset, you should define a train/test split, and you should have at least one accuracy number to report at this point.
Presentations
See Piazza for current status.
Final Report (due at end of semester - see Piazza for specifics)
The final report is an 8-15 page document (ACL format, please) that describes your project and final results. Unlike the proposal (which was only about a possible project and related work), or the progress report (which was only about results), the final report must be a complete, standalone document. Conceptually, it should include the content of both the proposal and progress report, though they will be changed. The final report describes and motivates the problem, places it in context of related work, describes the dataset and your approach, and reports results with discussion and thoughts for future work.
Submit your PDF on Gradescope. Also submit a zip file with your implementation code. (We'll probably take this on Moodle. Moodle limits the size of the zip file, so don’t include large data files, but feel free to provide us a URL to them.)
Here is a sample outline for your final report. There are different possible ways to structure it (for example, if you can, you can weave related work into the other sections), but we suggest you follow this outline unless you have substantial prior experience writing technical reports and research papers.
- Abstract: summarize the main components of your work in one paragraph (no more than 5 sentences). What problem are you solving? What is the key to your approach? What results did you achieve? Your abstract should draw the reader in and interest them in reading the rest of your paper to understand the details of your work.
- Introduction: explain the problem, motivate it (why is it important?), and briefly describe your approach. State a research question that your project seeks to answer: what are you trying to learn from this research project? You may also report some of your results without discussing the details of your method.
- Data: Describe the dataset that you are using.
- Method: Describe your approach to handling the problem. This should should include any models you used and any modeling assumptions you made. If you’ve developed new models for this project, you may even want to split a description/analysis of your models into its own section.
- Results: Describe the experiments you ran and identify your baseline method(s). Include the results you achieved with the various methods you are comparing making. This section will probably also include some figures that succinctly summarize your results. Analyze your results (including your models). If you did exploratory analysis or a significant amount of feature engineering, your analysis may merit its own section. After reading this section (and your dataset and methods), an interested reader should be able to duplicate your experiments and results.
- Discussion and Future Work: discuss any implications of your analysis for the problem as a whole, and what are the next steps for future work. Any other concluding remarks should go here.
Some things to remember:
- Write clearly. A good paper is direct, unambiguous and describes all stages of the work in complete—but not superfluous—detail.
Quality of writing will affect your grade.
- All plots, figures, and tables must have a title, labeled axes and a caption. There should be no confusion about what you’re trying to express, or why the plot was included.
- Do not include multiple figures that all convey the same information. Be succinct.
- Avoid redundancy. Do not describe the same component of your work twice.
- Do not include much code in your report. If your approach contains a new algorithm you developed, it may be appropriate for you to include its pseudocode, but do not copy and paste code from your editor. Do not include well-known algorithms. Cite when appropriate.
- You may include back-references to things you’ve already discussed. Avoid forward references.
- Provide a title for your report.
Writing a paper is like composing a piece of art. Be deliberate in your choice of what to include, when to include it and how to express it. For example, don’t include every plot you generated; pick the ones that best demonstrate your results. Work on the clarity of your writing. At the end of the day, your report is all the reader has to generate an opinion about your work. You don’t want good work to be obscured by poor writing.
Formatting: write using the ACL stylesheet.
Please divide your report clearly into sections/subsections.