Statistical Article - BASICS OF PARAMETERIZATION

FROM MANOVA TO GLM: BASICS OF PARAMETERIZATION

David P. Nichols

From SPSS Keywords, Number 64, 1997

In Release 7.0, SPSS introduced a new GLM (General Linear Models) procedure. In Release 7.5, dialog box support for the MANOVA procedure was removed (it remains available via command syntax). This article is the first of a planned series introducing GLM and aimed at easing the transition for users from MANOVA to GLM. Though perhaps of most immediate relevance to users of 7.0 and above releases, the statistical topics discussed are of relevance to users of any version of SPSS.

As regular readers of the Statistically Speaking section of SPSS Keywords are no doubt aware, I believe that understanding how a model is parameterized is crucial in interpreting the results of any statistical modeling procedure. Understanding the differences between GLM and MANOVA's approaches to parameterizing models is important both for understanding how to use the procedures and for what it can show us about aspects of modeling that apply to any approach.

The topics we will undertake are quite complicated, so we're going to start at the beginning and move to more complex situations in time. The very simplest linear model we might fit to a set of data is one in which each value is predicted to be the same; that is, the model contains only a constant term. The basis or design matrix for a constant only model contains a single column, with the value 1 for every case. The least squares solution for such a model predicts that each case is equal to the mean of all cases. Obviously, this is not a model that we are likely to seriously entertain in the great majority of circumstances. However, it is quite useful as a baseline model, against which to compare models one step up in complexity.

Moving up one level, we have models in which the dependent variable is presumed to be a function of a single predictor variable. If the predictor is quantitative (sometimes loosely called "continuous"), parameterization is quite simple: we simply add a column containing the measured values on the quantitative predictor to the constant column. However, if the predictor is categorical, things become more difficult, and alternative approaches to parameterization are possible.

For example, if we have a single three level factor (A), there are a number of different ways to represent the model. A representation that could be referred to as canonical or basic, would be to add a parameter for each level of factor A, in addition to that for the constant term. The design matrix, after deleting redundant rows (so that we have only one row per cell or unique predicted value), would look as shown in Figure 1. We refer to this design matrix as an overparameterized indicator design matrix.

Figure 1: Overparameterized Indicator Design Matrix ------------------------------------------------------------------------------- Level of A C A1 A2 A3 1 1 1 0 0 2 1 0 1 0 3 1 0 0 1 -------------------------------------------------------------------------------

Here we have a 3x4 matrix, with one row for each possible value of A. There is one parameter for each level of A, in addition to that for the constant. The problem we confront here is that the final column, that for A3, can be expressed as a linear combination of the preceding columns. Specifically,

A3 = C - A1 - A2 .

Because A3 can be expressed as a linear combination of preceding columns, it is mathematically redundant. That is, estimation of the fourth parameter in the model cannot add any information to that provided by estimation of the previous three parameters.

The GLM procedure handles computation of overparameterized models via a "sweep" operator that produces a generalized inverse of X'X (where X is the original basis or design matrix). The practical effect of this computational method is to alias redundant model parameters to 0. That is, no parameter estimate is produced for a redundant column. The non-aliased estimates produced for a one factor model are the same as those produced in a linear regression model by using indicator or dummy coding, with the last category of the factor representing the reference category.

Unlike MANOVA, GLM does not use specifications of contrast types from the user to determine what parameters are to be estimated (contrast results are available in GLM, but they are produced after the initial parameter estimation phase; in MANOVA, the design matrix is actually built from the contrasts specified by the user or by defaults). The use of a single set of basic or canonical parameters allows GLM to approach problem situations in a more systematic manner. It also allows for simpler application of the parameter estimates. Regardless of the model, when reproducing predicted values each parameter estimate for a factor is either used or not used (multiplied by 1 or 0, respectively). Recall from earlier articles that the basis or design matrices used in MANOVA often include a variety of other values, depending on the type of contrasts specified and the number of levels of the factor(s) involved.

In the next installment, we'll begin to illustrate the differences between the two approaches, and highlight the essential commonalities that are at the heart of what we wish to discover when we use quantitative models.