Introduction to Data Science Fall 2015

CS 194-16 Introduction to Data Science Fall 2015

Organizations use their data for decision support and to build data-intensive products and services. The collection of skills required by organizations to support these functions has been grouped under the term Data Science. This course will attempt to articulate the expected output of Data Scientists and then equip the students with the ability to deliver against these expectations. The assignments will involve programming, statistics, and the ability to manipulate data sets with code.

Notes on CS294-16: The graduate version of the class CS294-16 is a "mezannine" class which is part of the Master of Engineering curriculum. Its not normally available to other graduate students except under special circumstances.

New: Final Projects

Logistics

  • Course Number: CS 194-16, CS 294-16 Fall 2015, UC Berkeley
  • Instructor: John Canny
  • Time: MW 5pm - 6:30pm
  • Location: 155 Donner Lab through 9/15/2015, then 310 Jacobs Hall
  • Teaching Assistants: Haoyu Chen
  • Discussion: Join Piazza Links to an external site. for announcements and to ask questions about the course
  • Office hours:
    • John Canny - M 3-4, W 2-3 at 637 Soda
    • GSI - Tue 3-4, Fri 3-4 at 283H Soda

Pre-requisites

Pre-requisites for this course include 61A and 61B and basic programming skills. Knowledge of Python will be useful for the assignments, and several will also use the Scala Links to an external site. language. Students will also be expected to run VirtualBox Links to an external site. on their laptops for the assignments.

Please take the class survey here.

Please set up your machine according to these instructions. There are many issues with Windows 10 at this time. Dont upgrade if you can avoid it.

Make sure you set up and test your VM before the first lab on 9/2.

Texts

There is no single textbook, and readings will be posted lecture by lecture. However, there are a couple of books that are particularly useful and we will reference them repeatedly:

Grading

Midterm Info

The midterm is next Monday 11/23 from 5-6:30pm. Its closed-book but you can bring a single 8.5x11 sheet with notes on both sides.

Solutions

Homework Solutions: Homework Solutions

Lab Solutions: Lab Solutions

Midterm Fall 2014 and Solutions

Nano-Quizzes for each lecture

Schedule

Mon Lecture  Lecture Topic
Weds Lab Reading Assignments (Thursday)

W 8/26
Bunny 1
due 8/28

L1: Introduction/Data Science Process Download [pptx] Download [pdf]

No Lab
First lecture instead

Chapter 1 Links to an external site. of Data Science from Scratch

Enterprise Data Analysis and Visualization: An Interview Study Links to an external site.

none

M 8/31
Bunny 2
due 5pm 8/31

L2: Data Collection and Exploration Download [pptx] Download [pdf] Lab 1 Unix

CH 9 Links to an external site. of Data Science from Scratch

Sections 7.1-7.2 Links to an external site. and 12 Links to an external site. of Computational Biology 2nd ed. A Practical Intro... by Röbbe Wünschiers 

9/3 Homework 1 out. Due by 9/10
M 9/7
no lecture
Bunny 3
due 9/10
Labor Day Holiday Lab 2 Links to an external site. Exploratory Data Analysis

CH 3 Links to an external site. and CH 10 Links to an external site. of Data Science from Scratch

9/10 Homework 1 due by 10pm!

Homework 2 Links to an external site. out. Due by 9/17

M 9/14
155 Donner
Bunny 4
due 9/14

L3: Tabular Data Processing Download [pptx] Download [pdf]

Lab 3 Pandas Links to an external site.
in 110&120 Jacobs Hall

CH 5 Links to an external site. and CH 7 Links to an external site. of Python for Data Analysis

9/17 Homework 2 Links to an external site. due by 10pm

Project Proposal Out, due 9/25

M 9/21
155 Donner
Bunny 5
due 9/21

L4: Featurization and statistical tests Download [pptx] Download [pdf] Lab 4 Project Planning
in 110&120 Jacobs Hall
CH 5 Links to an external site. and CH 7 Links to an external site. of Data Science from Scratch. Please review CH 6 Links to an external site. on probability theory if needed.

9/25 Project Proposal due Midnight

Homework 3 Links to an external site. out, due 10/1

M 9/28
310 Jacobs!
Bunny 6
due 9/28

L5: Natural Language Processing Download [pptx] Download [pdf]

Lab 5 Stats and NLP tools Links to an external site.
in 310 Jacobs

Required: Analyzing Sentence Structure Links to an external site., CH 8 of the NLTK book (skip section 8.4).

Background: Stanford Dependencies Links to an external site., Stanford Parser online docs.

10/1 Homework 3 Links to an external site. Due by 10pm!

Project Data Exploration out, due 10/8

M 10/5
Bunny 7
due 10/5

L6: Supervised Learning: kNN, Naive Bayes Download [pptx] Download [pdf] Lab 6 Supervised Learning Links to an external site. CH 12 Links to an external site., and CH 13 Links to an external site. of Data Science from Scratch

10/8 Project Data Exploration Due!

Homework 4 Links to an external site.out, due 10/15

M 10/12
Bunny 8
due 10/12
L7: Supervised Learning: Linear and Logistic Regression, Trees and Forests Download pptx and Download pdf Lab 7 Supervised Learning Links to an external site.

CH 14 Links to an external site., CH 16 Links to an external site. and CH 17 Links to an external site. of of Data Science from Scratch

10/15 Homework 4 Links to an external site. due

Project Preliminary Data Analysis out, due 10/23

M 10/19
Bunny 9
due 10/19
L8: Unsupervised Learning: k-Means, DBSCAN, matrix factorization Download [pptx] and Download [pdf] Lab 8 Unsupervised Learning CH 19 Links to an external site.of Data Science from Scratch, Wikipedia entry on DBSCAN Links to an external site., and Tutorial (in Python) on matrix factorization Links to an external site..

10/23 Project Preliminary Data Analysis due

Homework 5 Links to an external site. out, due 10/30

M 10/26
Bunny 10
due 10/26
L9: Deep Learning for images and text, RNNs. [pdf slides]

Lab 9 CaffeNet and LSTMs

Introduction to Neural Nets Links to an external site., chapter 1 from Neural Networks and Deep Learning, Michael Nielsen, Deep Learning and Caffe Links to an external site., Evan Shelhammer, LSTM Tutorial Links to an external site.

10/30 Homework 5 Links to an external site. due

Homework 6 out, due 11/5

M 11/2
Bunny 11 due 11/2

L10: Scaling Up Analytics Download [pptx]

Download [pdf]

Lab 10 Spark/EC2 Links to an external site. "MapReduce," "Word Frequency Problem", and "Other Examples of MapReduce" sections from O'Reilly "Doing Data Science" book (available online Links to an external site. or from the library) and Spark Short paper Links to an external site.

11/5 Homework 6 due

Homework 7 Links to an external site. out, due 11/13

M 11/9
Bunny 12 due 11/9
L11: Interactive Visualization Download [pptx] and Download [pdf] No Lab (Veterans Day) Chapter 9 on Data Visualization Links to an external site. from "Doing Data Science" available online or from the library.
D3: Data Driven Documents Links to an external site. by Bostock et. al.

11/13 Homework 7 Links to an external site. due

M 11/16
Bunny 13 due 11/16

L12: Graph Processing Download [pdf]

Lab 11 Visualization Chapter 2 Links to an external site. from "Networks, Crowds, and Markets: Reasoning About a Highly Connected World"
M 11/23

Midterm - 5.00 to 6.30 pm
EC2 usage estimate

Non-instructional day- no lab    
M 11/30 Project Presentations Weds: Project Presentations  
Weds 12/9 12:30-2:30 Project Posters in 310 Jacobs (some poster examples)      
Mon 12/14 Final Project Reports due 10pm (see also the project comments) Archive Project Report?