6/28/2018

Data Science at Carroll College

What is Data Science?

Data Science Diagram
  • Computer Science is more than just programming
  • Statistics is more than just estimators and tests
  • Mathematics is more than just proofs and abstraction
  • Data Science is the interplay between computer science, statistics, and mathematics along with domain-specific knowledge to extract meaning and make predictions from data.

Data Science is Open Ended and Exploratory

Data Science Cycle

To support this, we use

  • Active and collaborative explorations
  • Real data
  • Open ended projects

The Active Learning Environment

Sandbox-Style Classrooms

Sandbox-Style Classrooms

Sandbox-Style Classrooms

Sandbox-Style Classrooms

What Makes an "Active" Data Science Course?

  • Students are actively coding during class
  • Students develop the algorithms organically
  • Students work collaboratively to build ideas, find bugs, and ask questions
  • Students are producing their own math/stats/cs

Active Learning: Clustering Exercise

Task: Build an algorithm to separate the three species of iris that are shown in this image? (Audience Participation)

Clustering Student Algorithm (slide 1):

Step 1: randomly assign a species to every point

Clustering Student Algorithm (slide 2):

Step 2: find the centroids of the data

Clustering Student Algorithm (slide 3):

Step 3: reassign all species based on distance to centroids

Clustering Student Algorithm (slide 4):

Step 4: find centroids again and iterate

Clustering Student Algorithm (slide 5):

Step N: Clusters are revealed after several iterations.

Other Active Data Science Tasks:

  • Develop a criterion for selecting variables in multiple regression
  • Write code to demonstrate that a neural network can approximate any function
  • Develop a way to classify univariate data like this:

Real Data

Why Use Real Data?

Small / simulated data can be useful:

  • Simple enough calculations to see what is happening
  • Develop first intuition about material
  • Creating simulated data with specific features is challenging

Why Use Real Data?

Small/simulated data is simultaneously ridiculous:

  • Not realistic
  • Can’t demonstrate the power of techniques
  • Can’t demonstrate the challenge of real problems
  • Real data often requires cleaning and visualization before anything else should be done
  • Real data is often contains surprises

Examples of Real Data

Pulse of the Nation Data: thepulseofthenation.com/

  • Blunt random telephone surveys (be careful)
  • Good for exploratory data analysis

Examples of Real Data

Examples of Real Data

Wine Quality Data: archive.ics.uci.edu/ml/datasets/wine+quality

  • Chemical compositions of wines and a taster's quality measurement
  • Good for classification, variable selection, and PCA
fixed.acidity volatile.acidity residual.sugar sulphates alcohol quality
7.4 0.70 1.9 0.56 9.4 5
7.8 0.88 2.6 0.68 9.8 5
7.8 0.76 2.3 0.65 9.8 5
11.2 0.28 1.9 0.58 9.8 6
7.4 0.70 1.9 0.56 9.4 5
7.4 0.66 1.8 0.56 9.4 5

Examples of Real Data

Sources for Real Data

Web Hosted Data Sets:

Textbooks:

Projects

Projects in Data Science

  • Mock consulting scenarios
    • Technical report or presentation
    • Non-technical summary for a "client"
  • Open ended
    • if there is one answer key then it isn't a good project
    • every student / group should come to different conclusions
  • Students follow the entire data science cycle
Data Science Cycle

Sample Projects

  • Build a classification scheme from the Wisconsin Breast Cancer data set and present your findings to a (ficticious) group of doctors.
  • Build an interactive app that allows users to explore a data set related to Parkinsons disease. Your app needs to allow for EDA, classification, and clustering.
  • Choose your own adventure final exam: Find a non-trivial data set and demonstrate what you've learned

Thank You