# Course overview STATS 413: Applied Regression Analysis University of Michigan
## Ex: Income prediction * **Data:** income survey data for men from the US mid-Atlantic region * **Goal:** predict income from demographic variables
## Ex: Breast cancer subtype classification * **Data:** expression levels of 534 genes in 122 breast tissue samples (7 non-malignant samples) * **Goal:** predict breast cancer subtype from gene expression levels
## Ex: Handwritten digits classification * **Data:** 60k images of handwritten digits * **Goal:** recognize handwritten digits from images
Ex: Self-driving car
## Supervised learning setup * $Y$: dependent variable, label, outcome, response, target etc. * $Y$ continuous: regression problem * $Y$ discrete: classification problem * $X$: (vector of) covariates, features, independent variables, inputs, regressors etc. * $\\{(X_i,Y_i)\\}_{i=1}^n$: training data consisting of *repeated* observations of $(X,Y)$-pairs
## Supervised learning goals 1. **prediction:** predict *unseen* test cases; 2. **inference:** understand how certain inputs affect the outcome; 3. quantify the uncertainty in our predictions and inferences.
## Unsupervised learning * no special outcome variable; * goals are less well-defined: 1. discover groups of similar samples, 2. discover groups of similar features, 3. find combinations of features with the most variation, 4. ... * useful pre-processing step for supervised learning
## Netflix prize * **Data:** 100M ratings for 18k movies by 400k users (98% missing) * **Goal:** predict 1.4M ratings that were held-out from the training data * Netflix's in-house *Cinematch* algorithm has 0.9535 RMSE * first team to design an algorithm with < 0.8572 RMSE gets $1M * **Q:** Is this a supervised or unsupervised learning problem?