MY TESTS DON'T AGREE!

David P. Nichols

From SPSS Keywords, Number 66, 1998

A common question asked of SPSS Statistical Support is how to interpret a set of tests that are testing the same or logically related null hypotheses, yet produce different conclusions. The prime example of this would be a situation where an omnibus F-test in an analysis of variance (ANOVA) produces a significance level (p-value) less than a critical alpha (such as .05), but follow up tests comparing levels of the factor do not produce any p-values less than alpha, or conversely, where the omnibus F-test is not significant at the given alpha level, while one or more pairwise comparisons are significant.

For an example, consider a one way ANOVA model of a very simple form: three groups, equal sample sizes, with standard independence, normality, and homogeneity of variance assumptions met. Further, assume that our numbers have been measured without error. The null hypothesis tested by the omnibus F-test is that all three population means are equal:

μ1 = μ2 = μ3.

The null hypothesis tested by a pairwise comparison of groups i and j is that these two population means are equal:

μi = μj.

The hypotheses tested by the omnibus test and pairwise comparisons are thus logically related: the omnibus null hypothesis is the intersection of all pairwise null hypotheses. That is, all population means can only be equal if any two chosen from the set are equal. However, as many readers may know, the results of significance tests using sample data frequently produce logical contradictions. How can this be?

The reason that such contradictory results can occur is that when we are making inferences about population parameters (such as population means) using sample data, our estimates are subject to sampling error. Were we dealing with the entireties of finite populations, we could simply compute the mean or other parameter(s) of interest in each population, and compare the results. There would be no sampling error, and hence no need for measures of precision of estimation, such as standard errors. Our decisions with regard to the above stated null hypotheses would then be logically consistent: the numbers would all be equal, or else some would differ from others.

Since we generally do not have the luxury of addressing problems where we can identify entire finite populations and measure all values, we are forced to work with samples and to make inferences about the unknown population values. The mean or other parameter values that we compute are estimates of the true unknown values, and these estimates are subject to sampling error. Thus, the means computed from several random samples from populations with the same mean will not generally be equal. We are not able to specify what the value of a sample mean will be even if we know the population value. What we can specify is the distribution of sample means and various related statistics under such circumstances.

Thus, the logic behind the standard F-test in an ANOVA is that if all of the assumptions are met, the distribution of F-values in repeated samples will follow the theoretical central F distribution with appropriate degrees of freedom if the null hypothesis is true. The logic behind the pairwise comparison tests is identical: if the model assumptions are met and the two population means of interest are equal, the t or F statistics produced by repeated sampling will follow the appropriate theoretical central t or F distributions. The important point is that the methodology of statistical inference does not allow us to state what will happen in a particular case, only the distributions of results in repeated random samples. It thus does not preclude the possibility of logically contradictory results. This state of affairs, while disconcerting to many, is simply part of the price we pay when we seek to make inferences based on samples.

In the case of a significant omnibus F-statistic and nonsignificant pairwise comparisons, some people have proposed the explanation that while no two means are different, some more complicated contrast among the means is nonzero, leading to the significant omnibus F. Such an explanation mistakes the mechanics of the methodology of the F-statistic for the hypothesis being tested. That is, while the F-statistic can be constructed as a function of the maximal single degree of freedom contrast computable from the sample data, the hypothesis tested is still that the population means are all equal, and the contrast value can only be nonzero in the population if at least one population mean is different from the others.

To broaden the discussion a bit and reinforce the point, consider a simple two way crosstabulation or contingency table. The two most popular test statistics for testing the null hypothesis of no population association between rows and columns are the Pearson and the Likelihood Ratio (LR) chi-squared tests. These statistics are testing the same null hypothesis and follow the same theoretical distribution under that null hypothesis, but they will sometimes yield different conclusions for a set of sample data. Again, the reason is that sampling variability means that we can only know about the distributions of the test statistics, not what they will be in particular cases.

What can we do about this? We cannot generally avoid the problem, as we are not usually in a position to identify finite populations of interest and measure all members of these populations. The best we can do is to understand the true nature of the problem and accept its implications. One is that the problem will always be with us in standard situations. The other is that we can minimize it by using larger samples, which provide us with greater levels of precision, and reduce the probability of seeing such results. As sample sizes increase to infinity, sampling errors converge to 0. Though we cannot achieve infinite sample sizes, the larger our samples, all other things being equal, the firmer our results.