Course overview
STATS 413: Applied Regression Analysis
University of Michigan
Ex: Income prediction
- Data: income survey data for men from the US mid-Atlantic region
- Goal: predict income from demographic variables
Ex: Breast cancer subtype classification
- Data: expression levels of 534 genes in 122 breast tissue samples (7 non-malignant samples)
- Goal: predict breast cancer subtype from gene expression levels
Ex: Handwritten digits classification
- Data: 60k images of handwritten digits
- Goal: recognize handwritten digits from images
Ex: Self-driving car
Supervised learning setup
- Y: dependent variable, label, outcome, response, target etc.
- Y continuous: regression problem
- Y discrete: classification problem
- X: (vector of) covariates, features, independent variables, inputs, regressors etc.
- {(Xi,Yi)}ni=1: training data consisting of repeated observations of (X,Y)-pairs
Supervised learning goals
- prediction: predict unseen test cases;
- inference: understand how certain inputs affect the outcome;
- quantify the uncertainty in our predictions and inferences.
Unsupervised learning
- no special outcome variable;
- goals are less well-defined:
- discover groups of similar samples,
- discover groups of similar features,
- find combinations of features with the most variation,
- ...
- useful pre-processing step for supervised learning
Netflix prize
- Data: 100M ratings for 18k movies by 400k users (98% missing)
- Goal: predict 1.4M ratings that were held-out from the training data
- Netflix's in-house Cinematch algorithm has 0.9535 RMSE
- first team to design an algorithm with < 0.8572 RMSE gets $1M
- Q: Is this a supervised or unsupervised learning problem?
Course overview STATS 413: Applied Regression Analysis University of Michigan