Note. This program is almost ridiculously easy to use. It runs in a Command Prompt (DOS) window under Windows 95/98/NT/2000/XP. If you're not familiar with using the Command Prompt window, check my Command Prompt Quick and Handy Guide.
MH can be run in several ways. The easiest is just to navigate within Windows to the folder where the program (mh.exe) resides, and click its icon.
Some users may prefer to open a Command Prompt window themselves, navigate to the folder where mh.exe resides, and at the prompt type:
mhand press the Enter key.
The program will prompt for the name of the input file. Supply the input file name, and include a path if the file is in a different folder, for example:
If you press Enter without supplying a file name, the default input file name of input.txt will be assumed.
If the input file is in a different folder than mh.exe, then unless you supply the path with the file name as shown above, you will get an error message and the program will not run.
Because the only program file is mh.exe (i.e., there are no additional profile, configuration or library files) you can copy this small file and place it in any folder, have several versions on your machine, etc. (Hint: if you place a copy in the same folder as your data files, then there is no need to type path names.)
There are also several clever ways of sending filenames to your Command Prompt window without typing anything. See the Quick and Handy Command Prompt guide for details.
Next supply an output file name, with optional path, in response to the program prompt. If no output file name is specified, a default file name of output.txt will be assumed.
The input and output file names must each not exceed 60 characters, including the path specification, if supplied. (If it seems that you have supplied the file/path names correctly but the program does not appear to run, try limiting the file name to 7 characters or less and the file extension to 3 characters or less, and use no path.)
The program will then read and process the input file and write results to the output file.
(Note: If by any chance you have a pre-Pentium-era machine without a math co-processor,
this version of MH will not run. Please contact the author to obtain a
(Top of Page)
3 Input File
The input file contains five command lines plus the data to be analyzed.
All command lines must be present, even if some are left blank. The
five command lines are as follows:
ordin the first three columns. If the categories are purely nominal (i.e., unordered), specify
nomin the first three columns.
|Important! Be sure to hit the Enter key after entering the last number. This will ensure placement of an ASCII end-of-record mark on the last line. (Many editors do this automatically, but others, including Notepad, do not). Otherwise data may not be read completely, producing a fortran error message. To be really safe, you can add an extra line, with just a blank character or two, following your data.|
An example input file is as follows:
Classification of 113 screening mammograms (Source: Barlow, 1998)
75 1 3 1 0
1 1 0 0 1
5 2 4 0 1
0 0 2 1 3
0 0 0 0 12
In constructing the input file, verify that the row and column variables
are correctly labeled. Note that with, for example, rater agreement
data, some sources make Rater 1 the row variable, whereas others make
Rater 2 the column variable.
The largest table MH will analyze is 50×50. If this is insufficient, please contact the author.
MH requires a square data table. That is consistent with the premise of
testing marginal homogeneity--i.e., that the row and column categories
are exactly the same. If a row or column has all 0 frequencies, include
the row or column as required to maintain a square table. However, this
should be done only if the row or column could potentially have
had non-zero frequencies.
(Top of Page)
4 Tests Performed
MH always performs the following tests:
+--------------------------------------------+ | | | Table 1 | | | | Rater 2 | | - + | | +-------+-------+ | | - | a | b | a + b | | Rater 1 +-------+-------+ | | + | c | d | c + d | | +-------+-------+ | | a + c b + d total | | | +--------------------------------------------+
The MH program calculates the McNemar statistic as
The value X2 can be viewed as a chi-squared statistic with 1 df. A significant value (e.g., p < .05) implies that the marginal rates significantly differ between the rows and columns. The chi-squared test is inherently two-tailed. In theory, one could adapt the method to perform a one-tailed McNemar test.
If (b + c) < 10, a two-tailed exact test, based on the
cumulative binomial distribution, is performed instead of calculating
4.2 McNemar tests for each category
MH first tests marginal homogeneity separately for each category. For
each of these tests the N×N table is collapsed to
form a 2×2 table. Specifically, for each rating category k
(k = 1, ..., N), all categories other than k are
combined, producing a 2×2 table for the k vs. not-k
distinction. The McNemar test is then performed on this table.
N such tests are performed.
Of these, N - 1 are independent. To
account for the multiple tests, one may wish to adjust (decrease) the
p value required for statistical significance. The MH program
reports a Bonferroni-adjusted significance level, calculated as
.05/(N - 1). However the user may instead wish
to use a less conservative adjustment, or no adjustment.
4.3 Bhapkar and Stuart-Maxwell tests
As overall tests of marginal homogeneity (i.e., across all categories
simultaneously) MH performs the Bhapkar test (Bhapkar, 1966) and
the Stuart-Maxwell test (Stuart, 1955; Maxwell, 1970; Everitt, 1977; for more details
The Stuart-Maxwell statistic is interpreted as a chi-squared value. The df are ordinarily N - 1, where N is the number of categories. If, for any category k, all frequencies in Row k and Column k are 0, except possibly for the main diagonal element (e.g., for agreement data, if there is perfect agreement for category k or the category is never used), then the category is not included in the test. The df for the test then could be considered to be N - m - 1, where m is the number of categories dropped from the test. However, a more conservative approach is to regard the df as N - 1 even though some categories were not included in the calculations. MH reports the p values associated with both df.
The Bhapkar test is a more powerful alternative to the Stuart-Maxwell test. It is similar to the latter in computational details, and again produces a test statistic which is interpreted as a chi-squared value. The df are as described above. See Agresti (2002, p. 422) for details.
The Bhapkar and Stuart-Maxwell tests are asymptotically equivalent.
With a large N, both will produce the same chi-squared value.
As it is more powerful, the Bhapkar test is preferred in most circumstances. The
Stuart-Maxwell statistic is included mainly for comparison with
other results of models one might apply to the data, such as
4.4 Bowker symmetry test
This tests symmetry of the table above and below the main diagonal. The null
hypothesis is that p(i,j) = p(j,i)
for all i ≠ j, where p(i,j) is the probability
of an observation falling in row category i and column category j.
The statistic is calculated as:
The tests described above are always performed by MH. If data are
specified as ordered-categorical, the following tests are also
4.5 McNemar test of overall bias or directional change
This compares the total frequency of cases above the main diagonal of
the data table with the total frequency of cases below the main diagonal
using the McNemar test (Bishop, Fienberg & Holland, 1975; pp. 284-285).
The test's interpretation depends on the particular application. For
example, with rater agreement data a significant result implies that one
rater's ratings are generally higher or lower than the other rater's
ratings, indicating overall bias. If the row and column
variables are pre- and post-treatment measures, a significant result
implies overall improvement or worsening of cases associated with
4.6 McNemar tests for equal thresholds
Ordered categories often result from the discretization of a trait that
is fundamentally continuous. When this is true, there is a connection
between the cumulative proportion of cases below various levels of the
variable and graded thresholds associated with each
Levels 1 < k < N result when a case exceeds the threshold for with level k but does not exceed the threshold for level k + 1.
Level k = N results if a case exceeds the threshold for level N.
MH tests homogeneity of row and column cumulative proportions of cases below each level k = 2, ..., N. Each test is done by collapsing the N×N table into a 2×2 table and performing the McNemar test. For a given level k, the 2×2 table is constructed by combining all rows/columns less than k and all rows/columns greater than or equal to k.
This produces N - 1 separate tests. For each test, a significant
chi-squared value implies that the row and column variables have
different cumulative proportions below level k and therefore that
the row and column variables have different thresholds for level
k. As before, a Bonferroni or similar adjustment to the alpha
level may be made to account for the multiple comparisons.
(Top of Page)
5 Output File
The program output has four sections: the Input section,
the Basic Tests section, the Tests for Ordered-Category Data section and
the Graphic Output section.
5.1 Input section
This section prints the command file and the total number of cases.
5.2 Basic tests
This section first prints the frequencies (i.e., a, b, c, d, in that order, of Table 1) for the collapsed tables associated with the marginal homogeneity test of each category (see section 4.2 above).
Marginal homogeneity tests for each category
The next table the shows the results of the McNemar test of row/column marginal homogeneity for each category:
Bhapkar and Stuart-Maxwell tests
Next the results of the Bhapkar and Stuart-Maxwell tests of overall marginal homogeneity appear. Reported are the calculated chi-squared values, the df, and the associated p value.
If categories were not included in the tests because of reasons discussed in Section 4.3, the number of such categories is reported; df and p values for both the conservative (i.e., with respect to all categories) and the nonconservative (i.e., with respect only to the categories used for calculations) interpretations of the tests are shown.
Bowker symmetry test
This section shows the chi-squared value, the df, and the p value
for the test of table symmetry.
5.3 Tests for Ordered-Category Data
Test of overall bias or direction of change
This section reports the number of cases above the main diagonal of the data table, the number of cases below the main diagonal, and the chi-squared value, df, and p value for the McNemar test of overall bias or directional change. Fourfold tables
This section first prints the frequencies (i.e., a, b, c, d, in that order, of Table 1) for the fourfold table associated with test of equal thresholds for each level of the variable.
Tests of equal thresholds
This table shows the results of the McNemar test of equality of the row and column thresholds for each level of the variable:
Ck is the proportion of cases with levels less than Level k for the row or column variable; and
F-1() is the inverse of the standard normal cumulative distribution function (probit function).
Note that the assumption of a normally distributed trait is not tested. The threshold values are printed for comparison purposes only and should generally not be reported. (If one wishes to test the assumption of a normally distributed underlying trait and estimate thresholds under this assumption, one can calculate the polychoric correlation (using, for example, programs that can be downloaded from that page.)
Note, however, that the normality assumption does not enter into the calculation of p values--the tests themselves are nonparametric. A significant p value implies that the row and column thresholds for Level k differ, even though the actual values of the thresholds may be unknown. Thresholds, regardless of distributional assumptions, are monotonically related to the cumulative proportions. In reporting results, then, one may give the cumulative proportions below Level k for the row and column variables and note that the threshold of the variable with the larger cumulative proportion significantly exceeds the threshold of the other variable.
Marginal distribution histogram
If there are 12 or fewer categories, a histogram is printed comparing the marginal distributions of categories for the row and column variables.
Cumulative proportions figure
If there are 20 or fewer levels, MH will print figures showing the cumulative proportions of cases below each level and the associated probit-based category thresholds.
The graph of cumulative proportions shows the proportion of cases below levels k = 1, ..., N for the row and column variables. Levels 1 to 9 are labeled with the integers 1 to 9. Level 10 is labeled with a 0. Levels 11-20 are labeled with the lower-case letters a, b, ..., j. Note that some labels may overprint others.
Category thresholds figure
The graph of probit-based thresholds shows the estimated thresholds of
levels k = 2, ..., N for the row and column variables.
Thresholds are labeled as described above. The scale is relative to the
standard normal curve (e.g., -3 means three standard deviations below
the mean, etc.).
(Top of Page)
This program is distributed as-is. It has not undergone extensive testing. The author does not guarantee accuracy and assumes no responsibility for unintended consequences of its use.
Barlow W. Modeling of categorical agreement. The encyclopedia of biostatistics, P. Armitage, T. Colton, eds., pp. 541-545. New York: Wiley, 1998.
Bhapkar VP. A note on the equivalence of two test criteria for hypotheses in categorical data. Journal of the American Statistical Association, 1966, 61, 228-235.
Bishop YMM, Fienberg SE, Holland PW. Discrete multivariate analysis: theory and practice. Cambridge, Massachusetts: MIT Press, 1975
Bowker AH. A test for symmetry in contingency tables. Journal of the American Statististical Association, 1948, 43, 572-574.
Everitt BS. The analysis of contingency tables. London: Chapman & Hall, 1977.
Fleiss JL. Statistical methods for rates and proportions (second ed.) New York: Wiley, 1981.
Maxwell AE. Comparing the classification of subjects by two independent judges. British Journal of Psychiatry, 1970, 116, 651-655.
McNemar Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 1947, 12, 153-157.
Sheskin DJ. Handbook of parametric and nonparametric statistical procedures (second edition). Boca Raton: Chapman & Hall, 2000.
Somes G. McNemar test. Encyclopedia of statistical sciences, vol. 5, S. Kotz & N. Johnson, eds., pp. 361-363. New York: Wiley, 1983.
Stuart AA. A test for homogeneity of the marginal distributions in a two-way classification. Biometrika, 1955, 42, 412-416.
This manual is available online at: http://john-uebersax.com/stat/mh.htm
I hope you find the MH program helpful. Please let me know if the program does not work correctly. If so, please include the input file you tried to process along with your email.
Either of the following formats may be used to cite the MH program (or this page):
Uebersax JS. User Guide for the MH Program (Vers. 1.2). Computer program documentation. 2006.
Uebersax JS. User guide for the MH program (vers. 1.2). Statistical Methods for Rater Agreement website. 2006. Available at: http://john-uebersax.com/stat/mh.htm. Accessed: month dd, yyyy.
Last updated: 10 April 2007