Project Report

Due Monday 12/14, 10pm (70 points + 20 for CS294)

Your project report is the formal description of your project. The contents are similar to the presentation but we want you to fully elaborate on what you did. The report should be 6-10 pages in length. The structure is similar with sections on:

Problem Statement and Background

Give a clear and complete statement of the problem. Don't describe methods or tools yet. Where does the data come from, what are its characteristics? (4 points)

Include informal success measures (e.g. accuracy on cross-validated data, without specifying ROC or precision/recall etc) that you planned to use. (4 points)

Include background material as appropriate: who cares about this problem, what impact it has, what implications better solutions might have? Included a brief summary of any related work you know about. (2 points)

Methods

Describe the methods you explored (usually algorithms, or data cleaning or wrangling approaches). Justify your methods in terms of the problem statement.

It will be easiest to describe methods along your data pipeline. i.e. start with data collection, then cleaning and repair, then transformation, then analysis, and any visualizations you made. (10 points)

Parameter choices can have a big effect on performance. Make sure you know what parameters (especially defaults) that you used and that they worked reasonably well on your data. (5 points)

If your results seem inconsistent with prior work or other groups', try to figure out why, and explain in your report. But don't give a manufactured explanation. There are many plausible-sounding explanations of a data anomaly, almost all of which are wrong. Make sure the evidence really supports your explanation and not others.

Be sure to include every method you tried, even if it didn't "work" or perform as well as your final approach. When describing methods that didn't work, make clear how they failed and any evaluation metrics you used to decide so. (5 points)

Results

Give a detailed summary of the results of your work. Here is where you specify the exact performance measures you used. Be sure to justify your measure(s) in term of the goals of your project. Usually there will be some kind of accuracy or quality measure. There may also be a performance (runtime or throughput) measure. (10 points)

Ideally you should give results across some variations of your solution like different model types or different parameter choices. (5 points)

Please use visualizations and graphs whenever possible. Include links to interactive visualizations if you built them. (5 points)

It would be reasonable to submit your report as a notebook, but please make sure it runs on one of the two standard environments - the laptop VM or the EC2 instance, and that you include any required files. You can also submit a separated notebook as an appendix to your report if that makes the visualization/interaction task easier.

Similarly, its fine to link to (or submit, but the file size may be problematic for bCourses) a video showing your system running.

Tools

Describe the tools that you used and the reasons for their choice. Justify them in terms of the problem itself and the methods you want to use. (5 points)

Tools will probably include machine learning, and possibly data wrangling and visualization. Please discuss all of them.

How did you employ them? What features worked well and what didn't? (5 points)

Describe any tools that you tried and ended up not using. What was the problem?

Lessons Learned

In this section give a high-level summary of your results. If the reader only reads one section of the report, this one should be it, and it should be self-contained. You can refer back to the "Results" section for elaborations. This section should be less than a page. In particular emphasize any results that were surprising, and if so, what your exploration of them yielded. (10 points)

CS294-16 Students

If your team comprises students taking the mezzanine version of the class (CS294-16) you should evaluate a primary model and in addition a "baseline" model. The baseline is typically the simplest model that's applicable to that data problem, e.g. Naive Bayes for classification, or K-means on raw feature data for clustering. (10 points)

If there isn't a plausible automatic baseline model, you can e.g. compare with human performance by having someone hand-solve your problem on a small subset of data. You wont expect to achieve this level of performance, but it establishes a scale by which to measure your project's performance. Try to use labor efficiently, i.e. if the data is mostly negative instances, use your system to predict labels and then give the human a more balanced (or more likely to be balanced based on your model) selection of instances.

Compare the performance of your baseline model and primary model and discuss/explain the differences. (10 points)

Team Contributions

Please give a percentage breakdown of the effort from each team member, and what they worked on. Please discuss this within your team to make sure every member agrees with the breakdown.

Submission

Please submit your project report by Monday 10pm 12/14 using this link.