CS 194-16 Introduction to Data Science Fall 2014
Organizations use their data for decision support and to build data-intensive products and services. The collection of skills required by organizations to support these functions has been grouped under the term Data Science. This course will attempt to articulate the expected output of Data Scientists and then equip the students with the ability to deliver against these expectations. The assignments will involve web programming, statistics, and the ability to manipulate data sets with code.
Note: Final project report deadline extended to 12/15
Midterm Solutions are Available here.
Midterm Histogram is here
Logistics
Pre-requisites
Pre-requisites for this course include 61A, 61B, 61C and basic programming skills. Knowledge of Python will be useful for the assignments, and several will also use the Scala
Links to an external site. language. Students will also be expected to run VirtualBox
Links to an external site. on their laptops for the assignments.
Please take the class survey here .
Please set up your machine according to these instructions .
Texts
There is no single textbook, and readings will be posted lecture by lecture. However, there are a couple of books that are particularly useful and we will reference them repeatedly:
Grading
Class Participation and in-class labs: 20%
Midterm: 20%
Final Project (in groups): 25% Final Project Information is Here. Submit team member lists here .
Homeworks : 30% (5 @ 6% each)
“Bunnies” : 5%
Schedule
Lecture Date
Lecture Material
Weds Lab
Reading
Assignments
W 9/3
L1: Introduction/Data Science Process [PPTX]
Download [PPTX]
[PDF]
Download [PDF]
No Lab
Enterprise Data Analysis and Visualization: An Interview Study
Links to an external site.
Bunny 1 by 5pm on 9/5
M 9/8
L2: Data Preparation [PPTX]
Download [PPTX]
[PDF]
Download [PDF]
Lab 1 Unix
Sections 7.1-7.2
Links to an external site. and 12
Links to an external site. of Computational Biology 2nd ed. A Practical Introduction to BioData Processing and Analysis with Linux, MySQL, and R, 2013 Röbbe Wünschiers
Bunny 2 by 5pm 9/8
Th 9/11
Homework 1 out. Due by 9/25
9/15
L3: Tabular Data [PPTX]
Download [PPTX]
[PDF]
Download [PDF]
Lab 2 Pandas
From Databases to Dataspaces: A New Abstraction for Information Management , Schemaless SQL
Links to an external site. and Schema on Write vs. Schema on Read
Links to an external site.
Bunny 3 by 5pm on 9/15
M 9/22
L4: Data Cleaning and Integration [PPTX]
Download [PPTX]
[PDF]
Download [PDF]
Lab 3
Links to an external site. Pandas
WebTables: Exploring the Power of Tables on the Web (Sections 1,2 and 4; others optional)
Links to an external site.
Bunny 4 by 5pm 9/22;
Th 9/25
Homework 1 due by 10pm! Submit using glookup Homework 2 out, due 10/9
F 9/26
Final Project Group Lists Due Midnight
M 9/29
L5: Natural Language Processing [PPTX]
Download [PPTX]
[PDF]
Download [PDF]
Lab 4
Links to an external site. Stanford NLP tools
Analyzing Sentence Structure
Links to an external site. , CH 8 of the NLTK book (skip section 8.4).
Stanford Dependencies
Links to an external site. , Stanford Parser online docs.
Bunny 5 by 5pm 9/29
W 10/1
Project preferences due
F 10/3
Project assignments
M 10/6
L6: Exploratory Data Analysis [PPTX]
Download [PPTX]
[PDF]
Download [PDF]
Lab 5 R
Statistical Thinking in the Age of Big Data
Links to an external site. Exploratory Data Analysis
Links to an external site. From the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN. Introduction to Hypothesis Testing
Links to an external site.
Bunny 6 by 5pm 10/6
Th 10/9
Homework 2 Due!Homework 3 out, due 10/23
F 10/10
Project Proposals due
M 10/13
L7: kNN, Linear Regression, k-Means [PPTX]
Download [PPTX]
[PDF]
Download [PDF]
Lab 6 Scala/BIDMach
Three Basic Algorithms
Links to an external site. From the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN.
Bunny 7 by 5pm 10/13
M 10/20
L8: Naive Bayes, Logistic Regression, Trees and Forests [PPTX]
Download [PPTX]
[PDF]
Download [PDF]
Lab 7 Naive Bayes and Logistic Regression
Spam Filters, Naive Bayes, and Wrangling
Links to an external site. up to Laplace Smoothing.Logistic Regression
Links to an external site. up to Newton's method.Extracting Meaning from Data
Links to an external site. , Feature Selection section.
Bunny 8 by 5pm 10/20
Th 10/23
Homework 3 due (10PM)Homework 4 out
M 10/27
L9: Deep Learning, guest lecture by Evan Shellhammer [PPTX] [PDF]
Lab 8 Caffe
Introduction to Neural Nets
Links to an external site. , chapter 1 from Neural Networks and Deep Learning , Michael Nielsen, Deep Learning and Caffe
Links to an external site. , Evan Shelhammer.
No Bunny 10/27
M 11/3
L10: Scaling Up Analytics, Charles [Google Slides
Links to an external site. ][PDF
Download PDF ]
Lab 9 Spark/EC2
"MapReduce," "Word Frequency Problem", and "Other Examples of MapReduce" sections from O'Reilly "Doing Data Science" book (available online
Links to an external site. or from the library) and Spark Short paper
Links to an external site.
Bunny 10 by 5pm 11/3
Th 11/6
Homework 4 DueHomework 5 out
M 11/10
L11: Visualization[PPTX]
Download [PPTX]
[PDF]
Download [PDF]
Lab 10 Visualization Lab Slides
Chapter 9 on Data Visualization
Links to an external site. from "Doing Data Science" available online or from the library. D3: Data Driven Documents
Links to an external site. by Bostock et. al. Optional: Reading about how the challenger disaster may have been prevented with data visualization
Links to an external site. by Edward Tufte
Bunny 11 by 5pm, 11/10
Th 11/13
Homework 5 due @ 10PM
M 11/17
L12: Graph Processing; Joseph Gonzales [PPTX]
Download [PPTX]
[PDF]
Download [PDF]
Lab 11 Graphx
Chapter 2
Links to an external site. from "Networks, Crowds, and Markets: Reasoning About a Highly Connected World"
Bunny 12 by 5pm, 11/17
Fri 11/21
Midterm Review 5.00 pm, 3106 Etcheverry
Slides
Links to an external site.
M 11/24
Midterm - 5.00 to 6.30 pm
In-lab project work (no submission due)
M 12/1
Project Presentations
Weds: Project Presentations
Bunny 13 by 10pm Weds 12/3
Th 12/11
Project Posters 3:30-5pm BIDS
Mon 12/15
Final Project Reports due