1 | 2 | 3 | ||
1 | p11 | p12 | p13 | p1. |
2 | p21 | p22 | p23 | p2. |
3 | p31 | p32 | p33 | p3. |
p.1 | p.2 | p.3 | 1.0 |
Here pij denotes the proportion of all cases assigned to category i Rater 1 and category j by Rater 2. (The table elements could as easily be frequencies.) The terms p1., p2., and p3. denote the marginal proportions for Rater 1--i.e. the total proportion of times Rater 1 uses categories 1, 2 and 3, respectively. Similarly, p.1, p.2, and p.3 are the marginal proportions for Rater 2.
Marginal homogeneity refers to equality (lack of significant difference) between one or more of the row marginal proportions and the corresponding column proportion(s). Testing marginal homogeneity is often useful in analyzing rater agreement. One reason raters disagree is because of different propensities to use each rating category. When such differences are observed, it may be possible to provide feedback or improve instructions to make raters' marginal proportions more similar and improve agreement.
Differences in raters' marginal rates can be formally assessed with statistical tests of marginal homogeneity (Barlow, 1998; Bishop, Fienberg & Holland, 1975; Ch. 8). If each rater rates different cases, testing marginal homogeneity is straightforward: one can compare the marginal frequencies of different raters with a simple chi-squared test. However this cannot be done when different raters rate the same cases--the usual situation with rater agreement studies; then the ratings of different raters are not statistically independent and this must be accounted for.
Several statistical approaches to this problem are available.
Alternatives include:
|
Marginal Distributions of Categories for Rater 1 (**) and Rater 2 (==) 0.304 + ** | ** == | ** == == | ** == ** == ** == | ** == ** == ** == | ** == ** == ** == | ** == ** == ** == ** == | ** == ** == ** == ** == ** == | ** == ** == ** == ** == ** == ** == | ** == ** == ** == ** == ** == ** == 0 +----+-------+-------+-------+-------+-------+---- 1 2 3 4 5 6 Notes: x-axis is category number or level. y-axis is proportion of cases. |
Vertical or horizontal stacked-bar histograms are good ways to summarize
the data. With ordered-category ratings, a related type of figure shows
the cumulative proportion of cases below each rating level for each
rater. An example, again from the MH program, is as follows:
Proportion of cases below each level 1 2 3 4 5 6 *---*-*-*-----*-------------------*-------------------------- Rater 1 *---*-*-*--------*------------*------------------------------ Rater 2 1 2 3 4 5 6 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ Scale 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .1 |
These are merely examples. Many other ways to graphically compare
marginal distributions are possible.
 
(Top of Page)
Nonparametric tests
The main nonparametric test for assessing marginal homogeneity is the McNemar test. The McNemar test assesses marginal
homogeneity in a 2×2 table. Suppose, however, that one has an
N×N crossclassification frequency table that
summarizes ratings by two raters for an N-category rating system.
By collapsing the N×N table into various 2×2
tables, one can use the McNemar test to assess marginal homogeneity of
each rating category. With ordered-category data one can also collapse
the N×N table in other ways to test rater equality
of category thresholds, or test raters for overall bias (i.e., a
tendency to make higher or lower rating than other raters.)
The Stuart-Maxwell test can be used to test marginal homogeneity between two raters across all categories simultaneously. It thus complements McNemar tests of individual categories by providing an overall significance value.
Further explanation of these methods and their calculation can be found by clicking on the test names above.
MH, a computer program for testing marginal homogeneity with these methods is available online. For more information, click here.
These tests are remarkably easy to use and are usually just as effective as more complex methods. Because the tests are nonparametric, they make few or no assumptions about the data. While some of the methods described below are potentially more powerful, this comes at the price of making assumptions which may or may not be true. The simplicity of the nonparametric tests lends persuasiveness to their results.
A mild limitation is that these tests apply only for comparisons of two
raters. With more than two raters, of course, one can apply the tests
for each pair of raters.
 
(Top of Page)
Bootstrapping
Bootstrap and related jackknife methods (Efron, 1982; Efron &
Tibshirani, 1993) provide a very general and flexible framework for
testing marginal homogeneity. Again, suppose one has an
N×N crossclassification frequency table summarizing
agreement between two raters on an N-category rating. Using what
is termed the nonparametric bootstrap, one would repeatedly
sample from this table to produce a large number (e.g., 500) of
pseudo-tables, each with the same total frequency as the original table.
Various measures of marginal homogeneity would be calculated for each pseudo-table; for example, one might calculate the difference between the row marginal proportion and the column marginal proportion for each category, or construct an overall measure of row vs. column marginal differences.
Let d* denote such a measure calculated for a given pseudo-table, and let d denote the same measure calculated for the original table. From the pseudo-tables, one can empirically calculate the standard deviation of d*, or sd*. Let d' denote the true population value of d. Assuming that d' = 0 corresponds to the null hypothesis of marginal homogeneity, one can test this null hypothesis by calculating the z value:
and determining the significance of the standard normal deviate z by usual methods (e.g., a table of z value probabilities).
The method above is merely an example. Many variations are possible within the framework of bootstrap and jackknife methods.
An advantage of bootstrap and jackknife methods is their flexibility. For example, one could potentially adapt them for simultaneous comparisons among more than two raters.
A potential disadvantage of these methods is that the user may need to
write a computer program to apply them. However, such a program could
also be used for other purposes, such as providing bootstrap
significance tests and/or confidence intervals for various
raw agreement indices.
 
(Top of Page)
Loglinear, association and quasi-symmetry modeling
If one is using a loglinear, association or
quasi-symmetry model to analyze agreement data, one can adapt the
model to test marginal homogeneity.
For each type of model the basic approach is the same. First one estimates a general form of the model--that is, one without assuming marginal homogeneity; let this be termed the "unrestricted model." Next one adds the assumption of marginal homogeneity to the model. This is done by applying equality restrictions to some model parameters so as to require homogeneity of one or more marginal probabilities (Barlow, 1998). Let this be termed the "restricted model."
Marginal homogeneity can then be tested using the difference G2 statistic, calculated as:
where
G2(restricted) and G2(unrestricted) are the likelihood-ratio chi-squared model fit statistics (Bishop, Fienberg & Holland, 1975) calculated for the restricted and unrestricted models.
The difference G2 can be interpreted as a chi-squared value and its significance determined from a table of chi-squared probabilities. The df are equal to the difference in df for the unrestricted and restricted models. A significant value implies that the rater marginal probabilities are not homogeneous.
An advantage of this approach is that one can test marginal homogeneity for one category, several categories, or all categories using a unified approach. Another is that, if one is already analyzing the data with a loglinear, association, or quasi-symmetry model, the addition of marginal homogeneity tests may require relatively little extra work.
A possible limitation is that loglinear, association, and quasi-symmetry models are only well-developed for analysis of two-way tables. Another is that use of the difference G2 test typically requires that the unrestricted model fit the data, which sometimes might not be the case.
For an excellent discussion of these and related models (including
linear-by-linear models), see Agresti (2002).
 
(Top of Page)
Latent trait and related models
Latent trait models and related methods such as
the tetrachoric and polychoric correlation
coefficients can be used to test marginal homogeneity for
dichotomous or ordered-category ratings. The general strategy using
these methods is similar to that described for loglinear and related
models. That is, one estimates both an unrestricted version of the
model and a restricted version that assumes marginal homogeneity, and
compares the two models with a difference G2 test.
With latent trait and related models, the restricted models are usually constructed by assuming that the thresholds for one or more rating levels are equal across raters.
A variation of this method tests overall rater bias. That is done by estimating a restricted model in which the thresholds of one rater are equal to those of another plus a fixed constant. A comparison of this restricted model with the corresponding unrestricted model tests the hypothesis that the fixed constant, which corresponds to bias of a rater, is 0.
Another way to test marginal homogeneity using latent trait models is with the asymptotic standard errors of estimated category thresholds. These can be used to estimate the standard error of the difference between the thresholds of two raters for a given category, and this standard error used to test the significance of the observed difference.
An advantage of the latent trait approach is that it can be used to assess marginal homogeneity among any number of raters simultaneously. A disadvantage is that these methods require more computation than nonparametric tests. If one is only interested in testing marginal homogeneity, the nonparametric methods might be a better choice. However, if one is already using latent trait models for other reasons, such as to estimate accuracy of individual raters or to estimate the correlation of their ratings, one might also use them to examine marginal homogeneity; however, even in this case, it might be simpler to use the nonparametric tests of marginal homogeneity.
If there are many raters and categories, data may be sparse
(i.e., many possible patterns of ratings across raters with 0 observed
frequencies). With very sparse data, the difference
G2 statistic is no longer distributed as chi-squared,
so that standard methods cannot be used to determine its statistical
significance.
 
(Top of Page)
Agresti A. Categorical data analysis. New York: Wiley, 2002.
Barlow W. Modeling of categorical agreement. The encyclopedia of biostatistics, P. Armitage, T. Colton, eds., pp. 541-545. New York: Wiley, 1998.
Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis: theory and practice. Cambridge, Massachusetts: MIT Press, 1975
Efron B. The jackknife, the bootstrap and other resampling plans. Philadelphia: Society for Industrial and Applied Mathematics, 1982.
Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman and Hall, 1993.
Last updated: 31 August 2006 (added reference, counter)