26 Sep 2020
Use for outputs
local gelman_output esttab,b(1) se wide mtitle("Coef.") scalars("rmse sigma") ///
coef(_cons "Intercept" rmse "sigma") nonum noobs nostar var(15)
local gelman_output2 esttab,b(2) se wide mtitle("Coef.") scalars("rmse sigma") ///
coef(_cons "Intercept" rmse "sigma") nonum noobs nostar var(15)
clear
qui set obs 20
gen x=_n
local a=.2
local b=.3
local sigma=1
set seed 2141
gen y=`a'+`b'*x + rnormal(0,`sigma')
gen outlier=" <-- outlier" if _n==17
twoway lfit y x || scatter y x ,mlabel(outlier) mlabc(blue)|| lfit y x if _n!=17 ///
,legend(off) ytitle(y) ylab(0(2.5)7.5)
regress y x
predict yhat
predict yresid,resid
predict yhat_std,stdp
local yhat=yhat[17]
local yhat_std=yhat_std[17]
regress y x if _n!=17
predict yhat2
predict yresid2,resid
predict yhat2_std,stdp
local yhat2=yhat2[17]
local yhat2_std=yhat2_std[17]
twoway (function y=normalden(x,`yhat',`yhat_std'), range(0 8)) ///
(function y=normalden(x,`yhat2',`yhat2_std'), range(0 8)) ,yscale(off) ///
legend(off)
Prediction for outlier point from regression including outlier (solid line) and regression excluding outlier (dashed line)
There is a user developed program that does LOO cross-validation called cv_regress
If it is not installed you can install it (only once) from the SSC archive.
ssc install cv_regress
regress y x
predict resid,resid
cv_regress, generr(resid_loo)
twoway scatter resid x || scatter resid_loo x , yline(0,lp(dash)) ///
ylab(-1(1)2) xlab(5(5)20) legend(label(2 "Loo residuals"))
Source │ SS df MS Number of obs = 20 ─────────────┼────────────────────────────────── F(1, 18) = 48.10 Model │ 59.2923422 1 59.2923422 Prob > F = 0.0000 Residual │ 22.1874673 18 1.23263707 R-squared = 0.7277 ─────────────┼────────────────────────────────── Adj R-squared = 0.7126 Total │ 81.4798095 19 4.28841102 Root MSE = 1.1102 ─────────────┬──────────────────────────────────────────────────────────────── y │ Coef. Std. Err. t P>|t| [95% Conf. Interval] ─────────────┼──────────────────────────────────────────────────────────────── x │ .2985991 .0430533 6.94 0.000 .2081474 .3890508 _cons │ .2540735 .5157423 0.49 0.628 -.8294608 1.337608 ─────────────┴──────────────────────────────────────────────────────────────── Leave-One-Out Cross-Validation Results ─────────────────────────┬─────────────── Method │ Value ─────────────────────────┼─────────────── Root Mean Squared Errors │ 1.1601 Log Mean Squared Errors │ 0.2971 Mean Absolute Errors │ 0.9445 Pseudo-R2 │ 0.67117 ─────────────────────────┴─────────────── A new Variable -resid_loo- was created with (y-E(y_-i|X))
Using the children’s test score data
local gelman_output3 esttab,b(1) se wide mtitle("Coef.") scalars("rmse sigma") ///
stats(r2,fmt(2)) coef(_cons "Intercept" rmse "sigma") nonum noobs nostar var(15)
import delimited https://raw.githubusercontent.com/avehtari/ROS-Examples/master/KidIQ/data/kidiq.csv, clear
qui regress kid_score i.mom_hs c.mom_iq
`gelman_output3'
cv_regress
* With noise predictors
set seed 29922
drawnorm noise1 noise2 noise3 noise4 noise5
qui regress kid_score i.mom_hs c.mom_iq noise1-noise5
`gelman_output3'
───────────────────────────────────────── Coef. ───────────────────────────────────────── 0.mom_hs 0.0 (.) 1.mom_hs 6.0 (2.2) mom_iq 0.6 (0.1) Intercept 25.7 (5.9) ───────────────────────────────────────── r2 0.21 ───────────────────────────────────────── Standard errors in parentheses Leave-One-Out Cross-Validation Results ─────────────────────────┬─────────────── Method │ Value ─────────────────────────┼─────────────── Root Mean Squared Errors │ 18.2024 Log Mean Squared Errors │ 5.8031 Mean Absolute Errors │ 14.4870 Pseudo-R2 │ 0.20299 ─────────────────────────┴───────────────
Now a second regression adding only the noise predictors
───────────────────────────────────────── Coef. ───────────────────────────────────────── 0.mom_hs 0.0 (.) 1.mom_hs 5.9 (2.2) mom_iq 0.6 (0.1) noise1 -0.7 (0.9) noise2 0.6 (0.9) noise3 0.7 (0.9) noise4 0.6 (0.9) noise5 0.9 (0.9) Intercept 26.1 (5.9) ───────────────────────────────────────── r2 0.22 ───────────────────────────────────────── Standard errors in parentheses
Recognizing overfitting
R2 appears to go up slightly, R2 .22 but the regression also gives an adjusted R2= .21
which adds a penalty for extra noise predictors that do nothing.
Or from the cross-validation you can also see that the root mean squared error has actually increased and consequently the R2 has decreased
. cv_regress Leave-One-Out Cross-Validation Results ─────────────────────────┬─────────────── Method │ Value ─────────────────────────┼─────────────── Root Mean Squared Errors │ 18.3409 Log Mean Squared Errors │ 5.8183 Mean Absolute Errors │ 14.5802 Pseudo-R2 │ 0.19165 ─────────────────────────┴───────────────