Course overview

STATS 413: Applied Regression Analysis
University of Michigan

Ex: Income prediction

Data: income survey data for men from the US mid-Atlantic region
Goal: predict income from demographic variables

Plots of income vs age, education level, and year

Ex: Breast cancer subtype classification

Data: expression levels of 534 genes in 122 breast tissue samples (7 non-malignant samples)
Goal: predict breast cancer subtype from gene expression levels

Ex: Handwritten digits classification

Data: 60k images of handwritten digits
Goal: recognize handwritten digits from images

Images of handwritten digits from the MNIST dataset

Ex: Self-driving car

Supervised learning setup

Y: dependent variable, label, outcome, response, target etc.
- $Y$ continuous: regression problem
- $Y$ discrete: classification problem
$X$ : (vector of) covariates, features, independent variables, inputs, regressors etc.
$\{(X_i,Y_i)\}_{i=1}^n$ : training data consisting of repeated observations of $(X,Y)$ -pairs

Supervised learning goals

prediction: predict unseen test cases;
inference: understand how certain inputs affect the outcome;
quantify the uncertainty in our predictions and inferences.

Unsupervised learning

no special outcome variable;
goals are less well-defined:
1. discover groups of similar samples,
2. discover groups of similar features,
3. find combinations of features with the most variation,
4. ...
useful pre-processing step for supervised learning

Netflix prize

Data: 100M ratings for 18k movies by 400k users (98% missing)
Goal: predict 1.4M ratings that were held-out from the training data
Netflix's in-house Cinematch algorithm has 0.9535 RMSE
first team to design an algorithm with < 0.8572 RMSE gets $1M
Q: Is this a supervised or unsupervised learning problem?

Course overview STATS 413: Applied Regression Analysis University of Michigan