Gelman Chapter 1-2 examples

Gelman Chapter 1-2 examples

Tim Hofer

5 Aug 2021

For printing outputs in the style that Gelman does, run the following macro

local gelman_output esttab,b(1) se wide mtitle("Coef.") scalars("rmse sigma") ///
    coef(_cons "Intercept" rmse "sigma") nonum noobs nostar var(15)

Chapter 1

Section 1.2 - Why learn regression

Load the data and make a scatterplot

import delimited https://raw.githubusercontent.com/avehtari/ROS-Examples/master/ElectionsEconomy/data/hibbs.dat,clear delim(" ")
li in 1/5
     ┌─────────────────────────────────────────────────┐
     │ year   growth    vote   inc_part~e   other_ca~e │
     ├─────────────────────────────────────────────────┤
  1. │ 1952      2.4    44.6    Stevenson   Eisenhower │
  2. │ 1956     2.89   57.76   Eisenhower    Stevenson │
  3. │ 1960      .85   49.91        Nixon      Kennedy │
  4. │ 1964     4.21   61.34      Johnson    Goldwater │
  5. │ 1968     3.02    49.6     Humphrey        Nixon │
     └─────────────────────────────────────────────────┘

Graph a scatterplot

scatter vote growth, xtitle("Economic growth") 

Now estimate the regression y = a + bx + error

. qui regress vote growth

Add a fitted line to the prior graph by drawing a function representing a straight line using the coefficients (constant and slope) extracted from the regression.

You can place more than one graph (of the twoway types) on a set of axes by putting two (or more) graph statements in separate paretheses as shown below

twoway (scatter vote growth, xtitle("Average recent income growth")) ///
    (function y = _b[_cons]+_b[growth]*x,range(-.5 4.5) lp(solid) )

After a regression you can refer to the coefficients by looking at the names in the table you get with the following command. regress,coefleg

. regress,coefleg

      Source │       SS           df       MS      Number of obs   =        16
─────────────┼──────────────────────────────────   F(1, 14)        =     19.32
       Model │  273.632269         1  273.632269   Prob > F        =    0.0006
    Residual │  198.272733        14  14.1623381   R-squared       =    0.5798
─────────────┼──────────────────────────────────   Adj R-squared   =    0.5498
       Total │  471.905002        15  31.4603335   Root MSE        =    3.7633

─────────────┬────────────────────────────────────────────────────────────────
        vote │ Coefficient  Legend
─────────────┼────────────────────────────────────────────────────────────────
      growth │   3.060528  _b[growth]
       _cons │   46.24765  _b[_cons]
─────────────┴────────────────────────────────────────────────────────────────

You can add the following lines the code for the graph above to get the formatted graph in the book (remember to add a line continuation symbol /// at the end of the prior line )

  ytitle("Incumbent party's vote share'") legend(off) ///
  ylab(45 "45%" 50 "50%" 55 "55%" 60 "60%")  ///
  xlab(0 "0%" 1 "1%" 2 "2%" 3 "3%" 4 "4%")  ///
  ysize(5) xsize(5)  ysc(fextend) xsc(fextend) ///
  subtitle(Data and linear fit) yline(50,lw(vthin)) ///
  text(53 2.5 "y= 46.3 + 3.0 x", place(e))

Now print the output in the style that Gelman uses using the macro we stored above.

`gelman_output'
─────────────────────────────────────────
                       Coef.             
─────────────────────────────────────────
growth                   3.1        (0.7)
Intercept               46.2        (1.6)
─────────────────────────────────────────
sigma                    3.8             
─────────────────────────────────────────
Standard errors in parentheses

You could also just issue the regress command which will replay the last results. (Note where the estimates featured in Gelman’s simplified table come from)

. regress

      Source │       SS           df       MS      Number of obs   =        16
─────────────┼──────────────────────────────────   F(1, 14)        =     19.32
       Model │  273.632269         1  273.632269   Prob > F        =    0.0006
    Residual │  198.272733        14  14.1623381   R-squared       =    0.5798
─────────────┼──────────────────────────────────   Adj R-squared   =    0.5498
       Total │  471.905002        15  31.4603335   Root MSE        =    3.7633

─────────────┬────────────────────────────────────────────────────────────────
        vote │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
      growth │   3.060528   .6962739     4.40   0.001     1.567169    4.553887
       _cons │   46.24765   1.621932    28.51   0.000     42.76895    49.72635
─────────────┴────────────────────────────────────────────────────────────────

Another way to add the straight line to the scatterplot is using the twoway lfit (or lfitci) graph procedure. This allows you to put confidence intervals around the line for either the regression line (which represents predicted conditional means observed if you were to draw another sample) , or in this case I asked for the confidence intervals for the points (option stdf for forecast). This is the CI for where the observed points might lie if you were to draw another sample of from the same population.

twoway (lfitci vote growth, lp(solid) stdf ) ///
    (scatter vote growth,m(O) xtitle("Average recent income growth"))

Section 1.6 Computing least squares and Bayesian regression

(page 16)

bayes: regress vote growth 

Burn-in ...
Simulation ...

Model summary
------------------------------------------------------------------------------
Likelihood: 
  vote ~ regress(xb_vote,{sigma2})

Priors: 
  {vote:growth _cons} ~ normal(0,10000)                                    (1)
             {sigma2} ~ igamma(.01,.01)
------------------------------------------------------------------------------
(1) Parameters are elements of the linear form xb_vote.

Bayesian linear regression                       MCMC iterations  =     12,500
Random-walk Metropolis–Hastings sampling         Burn-in          =      2,500
                                                 MCMC sample size =     10,000
                                                 Number of obs    =         16
                                                 Acceptance rate  =      .3159
                                                 Efficiency:  min =     .08521
                                                              avg =      .0914
Log marginal-likelihood = -56.640716                          max =     .09876
 
------------------------------------------------------------------------------
             |                                                Equal-tailed
             |      Mean   Std. dev.     MCSE     Median  [95% cred. interval]
-------------+----------------------------------------------------------------
vote         |
      growth |   3.06295   .7641382   .024315   3.074891   1.564028    4.54869
       _cons |   46.2363   1.805143   .060097   46.26667    42.5979   49.78299
-------------+----------------------------------------------------------------
      sigma2 |  16.43696   7.220643   .247357   14.84454   7.525531   35.19754
------------------------------------------------------------------------------
Note: Default priors are used for model parameters.

Chapter 2

Section 2.3 All graphs are comparisons

(page 25)

Import the data and then set up a binary variable flag (called exclude) so that we can leave out some of the countries in the graph for legibility

import delimited https://raw.githubusercontent.com/avehtari/ROS-Examples/master/HealthExpenditure/data/healthdata.txt,clear delim(" ")
gen exclude=inlist(country,"Netherlands", "Belgium", "Germany", ///
  "Ireland", "Iceland", "Greece", "Italy", "Sweden", "UK")

Then generate the graph in the book. The tilde before exclude tells graph scatter to include observations where exclude 1

twoway scatter lifespan spending if ~exclude, mlabel(country) m(i)