Introduction to Data Science

CS 194-16 Introduction to Data Science Fall 2014

Organizations use their data for decision support and to build data-intensive products and services. The collection of skills required by organizations to support these functions has been grouped under the term Data Science. This course will attempt to articulate the expected output of Data Scientists and then equip the students with the ability to deliver against these expectations. The assignments will involve web programming, statistics, and the ability to manipulate data sets with code.

Note: Final project report deadline extended to 12/15

Midterm Solutions are Available here.

Midterm Histogram is here

Logistics

Pre-requisites

Pre-requisites for this course include 61A, 61B, 61C and basic programming skills. Knowledge of Python will be useful for the assignments, and several will also use the Scala Links to an external site. language. Students will also be expected to run VirtualBox Links to an external site. on their laptops for the assignments.

Please take the class survey here.

Please set up your machine according to these instructions.

Texts

There is no single textbook, and readings will be posted lecture by lecture. However, there are a couple of books that are particularly useful and we will reference them repeatedly:

Grading

  • Class Participation and in-class labs: 20%
  • Midterm: 20%
  • Final Project (in groups): 25% Final Project Information is Here. Submit team member lists here.
  • Homeworks : 30% (5 @ 6% each)
  • “Bunnies” : 5%

Schedule

Lecture Date Lecture Material Weds Lab Reading Assignments
W 9/3 L1: Introduction/Data Science Process Download [PPTX] Download [PDF] No Lab  Enterprise Data Analysis and Visualization: An Interview Study Links to an external site. Bunny 1 by 5pm on 9/5
M 9/8 L2: Data Preparation Download [PPTX] Download [PDF] Lab 1 Unix Sections 7.1-7.2 Links to an external site. and 12 Links to an external site. of Computational Biology 2nd ed. A Practical Introduction to BioData Processing and Analysis with Linux, MySQL, and R, 2013 Röbbe Wünschiers 

Bunny 2 by 5pm 9/8

Th 9/11       Homework 1 out. Due by 9/25
 9/15 L3: Tabular Data Download [PPTX] Download [PDF] Lab 2 Pandas From Databases to Dataspaces: A New Abstraction for Information Management, Schemaless SQL Links to an external site. and Schema on Write vs. Schema on Read Links to an external site. Bunny 3 by 5pm on 9/15
M 9/22 L4: Data Cleaning and Integration
Download [PPTX]  Download [PDF]
Lab 3 Links to an external site. Pandas  WebTables: Exploring the Power of Tables on the Web (Sections 1,2 and 4; others optional) Links to an external site. Bunny 4 by 5pm 9/22; 
Th 9/25       Homework 1 due by 10pm! Submit using glookup
Homework 2 out, due 10/9
F 9/26       Final Project Group Lists Due Midnight
M 9/29 L5: Natural Language Processing Download [PPTX] Download [PDF] Lab 4 Links to an external site. Stanford NLP tools

Analyzing Sentence Structure Links to an external site., CH 8 of the NLTK book (skip section 8.4).

Stanford Dependencies Links to an external site., Stanford Parser online docs.

Bunny 5 by
5pm 9/29
W 10/1       Project preferences due
F 10/3       Project assignments
M 10/6 L6: Exploratory Data Analysis Download [PPTX] Download [PDF] Lab 5 R Statistical Thinking in the Age of Big Data Links to an external site.
Exploratory Data Analysis Links to an external site. From the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN.
Introduction to Hypothesis Testing Links to an external site.

Bunny 6 by
5pm 10/6

Th 10/9       Homework 2 Due!
Homework 3 out, due 10/23
F 10/10       Project Proposals due
M 10/13 L7: kNN, Linear Regression, k-Means
Download [PPTX] Download [PDF]
Lab 6 Scala/BIDMach

Three Basic Algorithms Links to an external site. From the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN.

Bunny 7 by 5pm 10/13
M 10/20 L8: Naive Bayes, Logistic Regression, Trees and Forests Download [PPTX] Download [PDF] Lab 7 Naive Bayes and Logistic Regression Spam Filters, Naive Bayes, and Wrangling Links to an external site. up to Laplace Smoothing.
Logistic Regression Links to an external site. up to Newton's method.
Extracting Meaning from Data Links to an external site., Feature Selection section.
Bunny 8 by 5pm 10/20
Th 10/23       Homework 3 due (10PM)
Homework 4 out
M 10/27 L9: Deep Learning, guest lecture by Evan Shellhammer [PPTX] [PDF]

Lab 8 Caffe

Introduction to Neural Nets Links to an external site., chapter 1 from Neural Networks and Deep Learning, Michael Nielsen, Deep Learning and Caffe Links to an external site., Evan Shelhammer. No Bunny 10/27
M 11/3

L10: Scaling Up Analytics, Charles [Google Slides Links to an external site.][PDF Download PDF]

Lab 9 Spark/EC2 "MapReduce," "Word Frequency Problem", and "Other Examples of MapReduce" sections from O'Reilly "Doing Data Science" book (available online Links to an external site. or from the library) and Spark Short paper Links to an external site. Bunny 10 by 5pm 11/3

Th 11/6

     

Homework 4 Due
Homework 5 out

M 11/10 L11: Visualization
Download [PPTX] Download [PDF]
Lab 10 Visualization Lab Slides Chapter 9 on Data Visualization Links to an external site. from "Doing Data Science" available online or from the library.
D3: Data Driven Documents Links to an external site. by Bostock et. al.
Optional: Reading about how the challenger disaster may have been prevented with data visualization Links to an external site. by Edward Tufte

Bunny 11 by 5pm, 11/10

Th 11/13       Homework 5 due @ 10PM
M 11/17

L12: Graph Processing; Joseph Gonzales Download [PPTX]

Download [PDF]

Lab 11 Graphx Chapter 2 Links to an external site. from "Networks, Crowds, and Markets: Reasoning About a Highly Connected World" Bunny 12 by 5pm, 11/17
Fri 11/21 Midterm Review 5.00 pm, 3106 Etcheverry     Slides Links to an external site.
M 11/24 Midterm - 5.00 to 6.30 pm In-lab project work (no submission due)    
M 12/1 Project Presentations Weds: Project Presentations   Bunny 13 by 10pm Weds 12/3
Th 12/11 Project Posters 3:30-5pm BIDS      
Mon 12/15 Final Project Reports due