Statistical Article - USING CATEGORICAL VARIABLES IN REGRESSION

USING CATEGORICAL VARIABLES IN REGRESSION

David P. Nichols

From SPSS Keywords, Number 56, 1995

When we polled Keywords readers to find out what kinds of topics they most wanted to see covered in future Statistically Speaking articles, we found that many SPSS users are concerned about the proper use of categorical predictor variables in regression models. Since the interpretation of the estimated coefficients is a major part of the analysis of a regression model, and since this interpretation depends upon how the predictors have been coded (or in technical terms, how the model has been parameterized), this is indeed an important topic.

To begin with, we will assume that the model under consideration involves only first order or main effects of predictor variables. That is, no higher order polynomial terms such as squares or cubes are used, and no interactions between predictors are involved. Such higher order or product terms introduce complexities beyond those introduced by the presence of main effects involving categorical variables. We will avoid these complexities for the time being. We will further assume that we have complete data; that is, no missing values on any predictor or dependent variables. We begin with a brief review of the interpretation of estimated regression coefficients.

As you may remember, in a linear regression model the estimated raw or unstandardized regression coefficient for a predictor variable (referred to as B on the SPSS REGRESSION output) is interpreted as the change in the predicted value of the dependent variable for a one unit increase in the predictor variable. Thus a B coefficient of 1.0 would indicate that for every unit increase in the predictor, the predicted value of the dependent variable also increases by one unit. In the common case where there are two or more correlated predictors in the model, the B coefficient is known as a partial regression coefficient, and it represents the predicted change in the dependent variable when that predictor is increased by one unit while holding all other predictors constant. The intercept or constant term gives the predicted value of the dependent variable when all predictors are set to 0.

For our purposes the important distinction between different types of predictor variables is between those measured on at least an interval scale, where a change of one unit in the predictor has a constant meaning across the entire scale, and those where such consistency of unit differences is not assumed. Though these are theoretically distinct, in practice it often happens that the terms interval and subinterval are replaced by continuous and categorical. The interpretation of estimated regression coefficients given above applies in a fairly straightforward manner to interval predictors, continuous or not, and their use in procedures like REGRESSION is quite simple as a practical matter: just name them as independent variables and specify when you want them used. For subinterval variables, which is the assumption in SPSS for categorical variables, things are more complicated. Despite the fact that equating continuous with interval and categorical with subinterval is an abuse of language, we'll proceed to do just that, to avoid confusion related to use of SPSS procedures.

One reason that the handling of categorical predictors is so important is that by the time one gets to the actual computation of the regression equation, no distinction is made between subinterval and interval variables. To put it another way, a matrix algebra routine knows nothing about different types of numbers; they're all just numbers. Some SPSS procedures used to analyze linear and generalized linear regression models are designed to handle the translation from categorical to interval representations with only minimal guidance from the user. These include the T-TEST procedure, the analysis of variance procedures ONEWAY, ANOVA and MANOVA, and the newer nonlinear regression procedures LOGISTIC REGRESSION and COX REGRESSION. However, even when such automatic handling of categorical predictors is available, it is still incumbent upon the user to make sure that he or she understands categorical variable representations well enough to produce useful results and to be able to interpret these results.

The simplest possible regression involving categorical predictors is one with a single dichotomous (two level) independent variable. An example of such a regression model would be the prediction of 1990 murder rates in each of the 50 states in the U.S.A. based upon whether or not each state had a death penalty statute in force just prior to and during that time. The data are compiled from almanac sources; murder rates are measured in number per 100,000 population. The variable of interest, denoted MURDER90, has a mean value of about 4.97 for the fourteen states without a death penalty statute, and about 7.86 for the 36 states with the death penalty.

Figure 1 presents the results of a dummy variable regression of MURDER90 on DEATHPEN, a categorical variable taking on a value of 0 for the no death penalty states and 1 for the death penalty states. 0-1 coding, known as dummy or indicator coding, is quite popular, as it often lends itself to the simplest possible interpretation.

Figure 1 --------------------------------------------------------------------------- Multiple R .33556 R Square .11260 Adjusted R Square .09411 Standard Error 3.72103 Analysis of Variance DF Sum of Squares Mean Square Regression 1 84.33257 84.33257 Residual 48 664.61163 13.84608 F = 6.09072 Signif F = .0172 ------------------ Variables in the Equation ------------------ Variable B SE B Beta T Sig T DEATHPEN 2.892460 1.172015 .335562 2.468 .0172 (Constant) 4.971429 .994488 4.999 .0000 ---------------------------------------------------------------------------

Here we have two coefficients, a constant or intercept term, and a "slope" coefficient for the DEATHPEN variable. Recall that the interpretation is that the constant is the predicted value when all predictors are set to 0, which here simply represents those states with no death penalty. Thus the constant coefficient is equal to the mean murder rate for this group. The DEATHPEN coefficient is the predicted increase in murder rate for a unit increase in the DEATHPEN variable. Since those states with a DEATHPEN value of 1 are those states with a death penalty statute, this coefficient represents the change in estimated or predicted murder rate for these states relative to those without the death penalty. The 2.89 value is exactly the difference between the two means, so that adding it to the constant produces the mean for the death penalty states. Since we are considering the entire population of states, the significance level is not necessary of particular interest, though if we were to conceptualize the current situation as resulting from a sampling from some hypothetical populations, the p-value of .0172 would indicate that so large a coefficient is unlikely to result from chance were random samples of this size drawn from hypothetical populations with equal means.

Other results of note are that the p-value for the t-test for the MURDER90 coefficient is the same as that for the overall regression F-test. This is due to the fact that the t-test tests the null hypothesis that this coefficient is 0 in the population, while the F-test tests the null hypothesis that all coefficients other than intercept are 0 in the population, and with only one predictor, these hypotheses are the same. The F-value is precisely the square of the t-value. This holds only for a simple regression involving one predictor. Also of note is the fact that the the Multiple R, which reduces to the absolute value of the correlation between the predictor and the dependent variable in a simple regression, is equal to the standardized regression coefficient (Beta). In a simple regression, the standardized coefficient is the correlation between the predictor and dependent variables, and is thus constrained to be between -1 and +1. Note that this generally holds true only for a simple regression, and that with correlated predictor variables, the standardized coefficients may be larger than 1 in absolute value.

This correlation between a dichotomous variable and a continuous variable is sometimes known as a point-biserial correlation. No special formula is required; special computational formulas in texts are simply special cases of the general Pearson product moment correlation coefficient formula applied to this combination of variable types. If both variables are dichotomous, the standard formula reduces further to that for a phi coefficient.

Finally, note that there are a number of ways in SPSS to achieve the same results we obtained from REGRESSION, if our purpose were to test the null hypothesis of equality of means between the two groups of states drawn from our hypothetical populations. Precisely the same t-statistic (or the negative of the value from REGRESSION, which means the same thing, given the variable codings) could be obtained from T-TEST, the CONTRAST option in ONEWAY or parameter estimate output in MANOVA, and the F-statistic could be duplicated in ONEWAY, ANOVA or MANOVA. In ONEWAY or ANOVA we would have to use the dummy variable for DEATHPEN as a two level factor, while in MANOVA we could either specify it as a factor or as a covariate. The results in any case would be the same in terms of test statistics and p-values. One example is given in Figure 2, using default DEVIATION contrasts in MANOVA:

Figure 2 --------------------------------------------------------------------------- Tests of Significance for MURDER90 using UNIQUE sums of squares Source of Variation SS DF MS F Sig of F WITHIN+RESIDUAL 664.61 48 13.85 DEATHPEN 84.33 1 84.33 6.09 .017 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Estimates for MURDER90 --- Individual univariate .9500 confidence intervals CONSTANT Parameter Coeff. Std. Err. t-Value Sig. t Lower -95% CL- Upper 1 6.41765873 .58601 10.95150 .00000 5.23941 7.59591 DEATHPEN Parameter Coeff. Std. Err. t-Value Sig. t Lower -95% CL- Upper 2 -1.4462302 .58601 -2.46794 .01721 -2.62448 -.26798 ---------------------------------------------------------------------------

We see that the significance level for the t-test for the DEATHPEN parameter estimate is the same as we obtained in REGRESSION. However, it is opposite in sign to our earlier coefficient, and only half the size. Our constant has also changed; note that it's value is now halfway between the means of the two groups. The differences we see here are due to the use of a different set of predictor codings being used internally by MANOVA. That is, MANOVA has parameterized the model somewhat differently than we did earlier. The default DEVIATION contrasts in MANOVA are designed to compare each level of a factor to the mean of all levels. In this case the DEATHPEN coefficient compares the no death penalty mean to the simple average of the two group means. The CONSTANT coefficient is this simple mean of group means. The F-statistic remains the same, the square of the t-value, as was the case in REGRESSION.

These results point to three important features of the regression model. One is that the interpretation of the estimated model coefficients is dependent upon the parameterization of the model; in order to know how to interpret the coefficients, we must be aware of how the predictor values have been coded. Second, despite having used two different parameterizations, we obtained the same results in terms of the test statistics for DEATHPEN. This result would occur regardless of the two numerical values used to represent the groups; all we could do by changing these values would be to flip the sign of the coefficient and inflate or deflate the coefficient and it's standard error by an equal factor, so that the absolute value of the ratio remains the same. This is true because the existence of only two groups means that a numerical representation of a comparison between them can only be done in one way; any differences in practical results are due to scaling considerations. Another way of saying this is to note that since we have only two groups, there can be only one degree of freedom in any test used to compare them, and the results must therefore always be the same. Finally, though the identical error sums of squares only intimate this and do not necessarily prove it, it is the case that the predicted values produced by the two approaches are identical. In other words, we have really fitted the same overall model in two slightly different ways.

We have yet to identify the codings given to the two levels of DEATHPEN that resulted in the MANOVA parameter estimates. In MANOVA we specified DEATHPEN as a categorical factor variable with codes of 0 and 1, and had the procedure internally create the design or basis matrix required for the model fitting. In REGRESSION, only the constant or intercept column of 1's is provided automatically by the procedure; the other columns are provided by the user in the form of the predictor variables specified. In MANOVA, the procedure automatically creates a set of predictor variables to represent a factor instead of requiring the user to do so. In the case of a dichotomous factor, MANOVA creates only one predictor in addition to the constant term, and by default it gives this variable values of 1 and -1, respectively. In our example, the states without the death penalty are the first group (having factor variable value 0), and are coded 1, while states with the death penalty receive a value of -1.

If we recall the interpretation of the regression coefficient as the increase in the predicted value of the dependent variable for a unit increase in the predictor, we can see why the DEATHPEN coefficient in MANOVA is -1/2 that of the one in REGRESSION. First, the directionality has been changed. That is, an increase in the predictor means moving from the death penalty group toward the no death penalty group. Thus the change in sign. Second, in order to compare the two groups in this parameterization, we must move two units, from -1 to 1, rather than from 0 to 1. Thus the two parameterizations are really telling us exactly the same thing. This is further illustrated by using the MANOVA results to predict the murder rates of the two groups. For the states with no death penalty, we add the CONSTANT and DEATHPEN coefficients, giving us a predicted value of about 4.97. For the death penalty group, we subtract the DEATHPEN coefficient from the CONSTANT, and obtain a predicted value of about 7.86. These are of course the same values obtained using REGRESSION.

What if we wanted to produce the same estimates in MANOVA that we obtained in REGRESSION? The only straightforward way to produce exactly the same estimates would be to enter the DEATHPEN predictor as a covariate coded 0-1. (There is a way to trick MANOVA into providing the same coefficients as REGRESSION even with DEATHPEN as a factor, but we'll ignore that here.) The reason for this is that in it's automatic reparameterization or internal recoding of the factor(s), MANOVA enforces a sum to 0 restriction on the values of the category codings. Thus 0-1 coding is not available. We can still obtain the same parameter value for the difference between the two groups of states however. This can be obtained by using SIMPLE contrasts with the first category as the reference category. This uses category codes of -1/2 and 1/2, so that an increase of one unit in the predictor would mean a change from no death penalty to the death penalty, and the resulting coefficient would be the same in both magnitude and sign as that given in REGRESSION. However, the constant or intercept term would still be the unweighted mean of the two group means. Using the CONSTANT coefficient from the MANOVA output plus or minus 1/2 times the DEATHPEN coefficient from the REGRESSION output, you can verify that this parameterization again produces exactly the same predicted values as our earlier approaches.

So much for the simple situation of a dichotomous predictor. As we have seen, in this situation the coding of the variable is important in interpreting the value of the regression coefficient, but not when we want to test whether the predictor has a nonzero population relationship with the dependent variable. One way to think about this fact is that when there are only two values of a predictor, there is only one interval between those values, so the assumption of equal meanings of intervals is automatically satisfied. However, once we move to predictors with more than two levels, things become more complicated. We'll save those complications for the next issue.