1  2  3  
1  p_{11}  p_{12}  p_{13}  p_{1.} 
2  p_{21}  p_{22}  p_{23}  p_{2.} 
3  p_{31}  p_{32}  p_{33}  p_{3.} 
p_{.1}  p_{.2}  p_{.3}  1.0 
Here p_{ij} denotes the proportion of all cases assigned to category i Rater 1 and category j by Rater 2. (The table elements could as easily be frequencies.) The terms p_{1.}, p_{2.}, and p_{3.} denote the marginal proportions for Rater 1i.e. the total proportion of times Rater 1 uses categories 1, 2 and 3, respectively. Similarly, p_{.1}, p_{.2}, and p_{.3} are the marginal proportions for Rater 2.
Marginal homogeneity refers to equality (lack of significant difference) between one or more of the row marginal proportions and the corresponding column proportion(s). Testing marginal homogeneity is often useful in analyzing rater agreement. One reason raters disagree is because of different propensities to use each rating category. When such differences are observed, it may be possible to provide feedback or improve instructions to make raters' marginal proportions more similar and improve agreement.
Differences in raters' marginal rates can be formally assessed with statistical tests of marginal homogeneity (Barlow, 1998; Bishop, Fienberg & Holland, 1975; Ch. 8). If each rater rates different cases, testing marginal homogeneity is straightforward: one can compare the marginal frequencies of different raters with a simple chisquared test. However this cannot be done when different raters rate the same casesthe usual situation with rater agreement studies; then the ratings of different raters are not statistically independent and this must be accounted for.
Several statistical approaches to this problem are available.
Alternatives include:

Marginal Distributions of Categories for Rater 1 (**) and Rater 2 (==) 0.304 + **  ** ==  ** == ==  ** == ** == ** ==  ** == ** == ** ==  ** == ** == ** ==  ** == ** == ** == ** ==  ** == ** == ** == ** == ** ==  ** == ** == ** == ** == ** == ** ==  ** == ** == ** == ** == ** == ** == 0 +++++++ 1 2 3 4 5 6 Notes: xaxis is category number or level. yaxis is proportion of cases. 
Vertical or horizontal stackedbar histograms are good ways to summarize
the data. With orderedcategory ratings, a related type of figure shows
the cumulative proportion of cases below each rating level for each
rater. An example, again from the MH program, is as follows:
Proportion of cases below each level 1 2 3 4 5 6 ****** Rater 1 ****** Rater 2 1 2 3 4 5 6 +++++++++++ Scale 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .1 
These are merely examples. Many other ways to graphically compare
marginal distributions are possible.
(Top of Page)
The StuartMaxwell test can be used to test marginal homogeneity between two raters across all categories simultaneously. It thus complements McNemar tests of individual categories by providing an overall significance value.
Further explanation of these methods and their calculation can be found by clicking on the test names above.
MH, a computer program for testing marginal homogeneity with these methods is available online. For more information, click here.
These tests are remarkably easy to use and are usually just as effective as more complex methods. Because the tests are nonparametric, they make few or no assumptions about the data. While some of the methods described below are potentially more powerful, this comes at the price of making assumptions which may or may not be true. The simplicity of the nonparametric tests lends persuasiveness to their results.
A mild limitation is that these tests apply only for comparisons of two
raters. With more than two raters, of course, one can apply the tests
for each pair of raters.
(Top of Page)
Various measures of marginal homogeneity would be calculated for each pseudotable; for example, one might calculate the difference between the row marginal proportion and the column marginal proportion for each category, or construct an overall measure of row vs. column marginal differences.
Let d^{*} denote such a measure calculated for a given pseudotable, and let d denote the same measure calculated for the original table. From the pseudotables, one can empirically calculate the standard deviation of d^{*}, or s_{d*}. Let d' denote the true population value of d. Assuming that d' = 0 corresponds to the null hypothesis of marginal homogeneity, one can test this null hypothesis by calculating the z value:
and determining the significance of the standard normal deviate z by usual methods (e.g., a table of z value probabilities).
The method above is merely an example. Many variations are possible within the framework of bootstrap and jackknife methods.
An advantage of bootstrap and jackknife methods is their flexibility. For example, one could potentially adapt them for simultaneous comparisons among more than two raters.
A potential disadvantage of these methods is that the user may need to
write a computer program to apply them. However, such a program could
also be used for other purposes, such as providing bootstrap
significance tests and/or confidence intervals for various
raw agreement indices.
(Top of Page)
For each type of model the basic approach is the same. First one estimates a general form of the modelthat is, one without assuming marginal homogeneity; let this be termed the "unrestricted model." Next one adds the assumption of marginal homogeneity to the model. This is done by applying equality restrictions to some model parameters so as to require homogeneity of one or more marginal probabilities (Barlow, 1998). Let this be termed the "restricted model."
Marginal homogeneity can then be tested using the difference G^{2} statistic, calculated as:
where
G^{2}(restricted) and G^{2}(unrestricted) are the likelihoodratio chisquared model fit statistics (Bishop, Fienberg & Holland, 1975) calculated for the restricted and unrestricted models.
The difference G^{2} can be interpreted as a chisquared value and its significance determined from a table of chisquared probabilities. The df are equal to the difference in df for the unrestricted and restricted models. A significant value implies that the rater marginal probabilities are not homogeneous.
An advantage of this approach is that one can test marginal homogeneity for one category, several categories, or all categories using a unified approach. Another is that, if one is already analyzing the data with a loglinear, association, or quasisymmetry model, the addition of marginal homogeneity tests may require relatively little extra work.
A possible limitation is that loglinear, association, and quasisymmetry models are only welldeveloped for analysis of twoway tables. Another is that use of the difference G^{2} test typically requires that the unrestricted model fit the data, which sometimes might not be the case.
For an excellent discussion of these and related models (including
linearbylinear models), see Agresti (2002).
(Top of Page)
With latent trait and related models, the restricted models are usually constructed by assuming that the thresholds for one or more rating levels are equal across raters.
A variation of this method tests overall rater bias. That is done by estimating a restricted model in which the thresholds of one rater are equal to those of another plus a fixed constant. A comparison of this restricted model with the corresponding unrestricted model tests the hypothesis that the fixed constant, which corresponds to bias of a rater, is 0.
Another way to test marginal homogeneity using latent trait models is with the asymptotic standard errors of estimated category thresholds. These can be used to estimate the standard error of the difference between the thresholds of two raters for a given category, and this standard error used to test the significance of the observed difference.
An advantage of the latent trait approach is that it can be used to assess marginal homogeneity among any number of raters simultaneously. A disadvantage is that these methods require more computation than nonparametric tests. If one is only interested in testing marginal homogeneity, the nonparametric methods might be a better choice. However, if one is already using latent trait models for other reasons, such as to estimate accuracy of individual raters or to estimate the correlation of their ratings, one might also use them to examine marginal homogeneity; however, even in this case, it might be simpler to use the nonparametric tests of marginal homogeneity.
If there are many raters and categories, data may be sparse
(i.e., many possible patterns of ratings across raters with 0 observed
frequencies). With very sparse data, the difference
G^{2} statistic is no longer distributed as chisquared,
so that standard methods cannot be used to determine its statistical
significance.
(Top of Page)
Agresti A. Categorical data analysis. New York: Wiley, 2002.
Barlow W. Modeling of categorical agreement. The encyclopedia of biostatistics, P. Armitage, T. Colton, eds., pp. 541545. New York: Wiley, 1998.
Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis: theory and practice. Cambridge, Massachusetts: MIT Press, 1975
Efron B. The jackknife, the bootstrap and other resampling plans. Philadelphia: Society for Industrial and Applied Mathematics, 1982.
Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman and Hall, 1993.
Last updated: 31 August 2006 (added reference, counter)