UP504 • Multiple RegressionMORE COMMENTS last updated: Wednesday, January 16, 2008 |
Assignment One
see also regression notes |
QUESTION: How do I start the assignment -- just jump
into running regression on SPSS?
ANSWER: I would strongly recommend that you first start with exploring the range/variation
of housing values across the county (see the thematic maps of this linked to
the assignment page). Then sit down and conceptually think of what variables
would most likely be useful in explaining this variation in housing value. Some
(but certainly not all) of these variables will be in the data set. (And remember:
the unit of analysis is the census tract, not the individual hhd: so be wary
of committing an ecological fallacy...).
QUESTION: how many variables in my model?
ANSWER You can use as many variables as you want -- or at least as many as make
up a good model. Too few variables may lead to an under-specified model. Remember:
there is only one dependent variable (median housing value), but two or more
ind. variables.
QUESTION: What is more important: coefficients that
are significant at the .000 level or a higher F score?
ANSWER That is ostensibly a trade-off, but the key is to make sure that all
the variables are significant at the .05 level. (.000 is even more stat. significant,
but .05 is certainly sufficient.) As long as the F is significant, I would not
hesitate about adding an additional variable to your model (even if it reduces
F somewhat) as long as all the ind. variables are sign. and that you note an
increase in your R2. (There are trade-offs between a parsimonious model and
one with lots of ind. variables.)
QUESTION: How to remove an outlier case (such as
"Census Tract 4229" in Washtenaw county ) -- that is, how do I "Run if Census tract
unequal to 4229"? we could not figure out how to do this. We simply deleted
this data "4229" from the data sheet.
ANSWER You can either simply delete the case (which is the easiest but
essentially a permanent solution). or, you can -- within the regression command
box -- click on "IF", and instruct SPSS to include only those cases
where census tract is not =4229 (or use some other criterion).
To check which cases are used in the analysis, select (also in the regression
command box), under residuals, a diagnostic of ALL CASES (and you an include
census tract as a case label). You can then see which cases are included, and
also what the residual values are of each case. (here the residual means the
gap or difference between the predicted value and the actual value for each
case).
QUESTION: How to create a dummy variable to specify a specific geographic areas (e.g., a set of census tracts)? i.e., identify census tracts which lie
in the Ypsilanti area? Could we create a dummy variable about this?
ANSWER For Ypsi, see the census tract map to identify those in that city.
(I think it is 4102-4103, and 4106-4112) -- but confirm. You can then create
a new variable that has the value of "1" for these cases and "0"
for the rest. (You can do this either through RECODE into new variable, or else
simply do it manually by creating a new variable and then typing in zeros and
ones.)
QUESTION: How to define the threshold for converting
an interval scale variable into a dummy variable (e.g., "high percentage
of seniors")?
ANSWER If you are converting an interval scale variable to a dummy variable,
you can use whatever threshold value (to separate the "0" and "1"
values) you wish. You might look for a natural break in the data, or use the
median value, or some other value that makes sense. You can use trial and error
here. (It is NOT necessary to have the 0 and 1 values evenly distributed.) That
is, you could try 10% or 15% or whatever seems to work and/or theoretically
makes sense. Sometimes a scatterplot helps you see a logical break point.