Much neglected, raw agreement indices are important descriptive statistics. They have unique common-sense value. A study that reports only simple agreement rates can be very useful; a study that omits them but reports complex statistics may fail to inform readers at a practical level.
Raw agreement measures and their calculation are explained below. We examine first the case of agreement between two raters on dichotomous ratings.
Consider the ratings of two raters (or experts, judges, diagnostic procedures, etc.) summarized by Table 1:
Rater 1 | Rater 2 | ||
---|---|---|---|
+ | - | total | |
+ | a | b | a + b |
- | c | d | c + d |
total | a + c | b + d | N |
The values a, b, c and d here denote the observed frequencies for
each possible combination of ratings by Rater 1 and Rater 2.
 
 
Proportion of overall agreement
The proportion of overall agreement (po) is the proportion of cases for which Raters 1 and 2 agree. That is:
a + d a + d po = ------------- = -----. (1) a + b + c + d NThis proportion is informative and useful, but, taken by itself, has possible has limitations. One is that it does not distinguish between agreement on positive ratings and agreement on negative ratings.
Consider, for example, an epidemiological application where a positive rating corresponds to a positive diagnosis for a very rare disease -- one, say, with a prevalence of 1 in 1,000,000. Here we might not be much impressed if po is very high -- even above .99. This result would be due almost entirely to agreement on disease absence; we are not directly informed as to whether diagnosticians agree on disease presence.
Further, one may consider Cohen's (1960) criticism of po: that it can be high even with hypothetical raters who randomly guess on each case according to probabilities equal to the observed base rates. In this example, if both raters simply guessed "positive" the large majority of times they would usually agree on the diagnosis. Cohen proposed to remedy this by comparing po to a corresponding quantity, pc, the proportion of agreement expected by raters who randomly guess. As described on the kappa coefficients page, this logic is questionable; in particular, it is not clear what advantage there is to compare an actual level of agreement, po, with a hypothetical value, pc, which would occur under an obviously unrealistic model.
A much simpler way to address this issue is described immediately below.
 
 
Positive agreement and negative agreement
We may also compute observed agreement relative to each rating category individually.
Generically the resulting indices are called the proportions of specific agreement (Ciccetti & Feinstein, 1990;
Spitzer & Fleiss, 1974).
With binary ratings, there are two such indices, positive agreement (PA) and
negative agreement (NA).
They are calculated as follows:
2a 2d
PA = ----------; NA = ----------. (2)
2a + b + c 2d + b + c
.
PA, for example, estimates the conditional probability, given that one of the raters,
randomly selected, makes a positive rating, the other rater will also do so.
A joint consideration of PA and NA addresses the potential concern that,
when base rates are extreme, po is liable to chance-related inflation or bias.
Such inflation, if it exists at all, would affect only the more frequent category. Thus if both PA and
NA are satisfactorily large, there is arguably less need or purpose in comparing actual to chance-
predicted agreement using a kappa statistic.
But in any case, PA and NA provide more information relevant to understanding and improving ratings
than a single omnibus index (see Cicchetti and Feinstein, 1990).
 
 
Significance, standard errors, interval estimation
Statistical significance. In testing the significance of po, the null hypothesis is that raters are independent, with their marginal assignment probabilities equal to the observed marginal proportions. For a 2×2 table, the test is the same as a usual test of statistical independence in a contingency table. Any of the following could potentially be used:
Standard error.
One can use standard methods applicable to proportions to estimate the standard error and confidence
limits of po.
For a sample size N, the standard error of
po is:
SE(po) = sqrt[po(1 - po)/N] (3.1)
One can alternatively estimate SE(po) using resampling methods, e.g., the
nonparametric bootstrap or the jackknife, as described the next
section.
Confidence intervals. The Wald or "normal approximation" method
estimates confidence limits of a proportion as follows:
CL = po - SE × zcrit (3.2)
CU = po + SE × zcrit (3.3)
where SE here is SE(po) as estimated by Eq. (3.1), CL and CU are the lower and
upper confidence limits, and zcrit is the z-value associated with a confidence
range with coverage probability crit. For a 95% confidence range, zcrit = 1.96;
for a 90% confidence range, zcrit = 1.645.
When po is either very large or very small (and especially with small sample sizes) the Wald method may produce confidence limits less than 0 or greater than 1; in this case better approximate methods (see Agresti, 1996), exact methods, or resampling methods (see below) can be used instead.
Statistical significance. Logically, there is only one test of independence in a 2×2 table; therefore if PA significantly differs from chance, so too would NA, and vice versa. Spitzer and Fleiss (1974) described kappa tests for specific rating levels; in a 2×2 there are two such "specific kappas", but both have the same value and statistical significance as the overall kappa.
Standard errors.
SE(PA) = sqrt[4a (c + b)(a + c + b)] / (2a + b + c)^2 (3.4) SE(NA) = sqrt[4d (c + b)(d + c + b)] / (2d + b + c)^2 (3.5)
Confidence intervals.
An advantage of bootstrapping is that one can use the same simulated data sets to estimate not only the standard errors and confidence limits of PA and NA, but also those of po or any other statistic defined for the 2×2 table.
A SAS program to estimate the asymptotic standard errors and asymptotic confidence limits of PA and NA has been written. A standalone program (executable program and fortran 90 source code) that supplies both bootstrap and asymptotic standard errors and confidence limits can be downloaded here.
Readers are referred to Graham and Bull (1998) for fuller coverage of this
topic, including a comparison of different methods for estimating confidence
intervals for PA and NA.
 
(Top on Page)
We now consider results for two raters making polytomous (either ordered category or purely nominal) ratings. Let C denote the number of rating categories or levels. Results for the two raters may be summarized as a C × C table such as Table 2.
Rater 1 | Rater 2 | ||||
---|---|---|---|---|---|
1 | 2 | ... | C | total | |
1 | n11 | n12 | ... | n1C | n1. |
2 | n21 | n22 | ... | n2C | n2. |
. . |
. . |
. . |
... | . . |
. . |
C | nC1 | nC2 | ... | nCC | nC. |
total | n.1 | n.2 | ... | n.C | N |
Here nij denotes the number of cases assigned rating category i by
Rater 1 and category j by Rater j, with i, j = 1, ..., C. When a "."
appears in a subscript, it denotes a marginal sum over the corresponding
index; e.g., ni. is the sum of nij for j = 1, ...,
c, or the row marginal sum for category i; n.. = N
denotes the total number of cases.
 
Overall Agreement
For this design, po is the sum of frequencies of the main
diagonal of table {nij} divided by sample size, or
C
po = 1/N SUM nii (4)
i=1
Statistical significance
ni.n.j πij = ------. (5) N
and the tabulates overall agreement, denoted p*o, for each simulated sample. The po for the actual data is considered statistically significant if it exceeds a specified percentage (e.g., 5%) of the p*o values.
If one already has a computer program for nonparametric bootstrapping only slight modifications are needed to adapt it to perform a parametric bootstrap significance test.
With respect to Table 2, the proportion of agreement specific to category i
is:
2nii
ps(i) = ---------. (6)
ni. + n.i
Statistical significance
Eq. (6) amounts to collapsing the C × C table into a 2×2 table relative to category i, considering this category a 'positive' rating, and then computing the positive agreement (PA) index of Eq. (2). This is done for each category i successively. In each reduced table one may perform a test of statistical independence using Cohen's kappa, the odds ratio, or chi-squared, or use a Fisher exact test.
Standard errors and confidence limits
Before proceeding to the fully general case, it will help to look at the simpler situation of estimating specific positive agreement given multiple binary ratings.
For a given case with two or more binary (positive/negative) ratings, let n and m denote the number of ratings and the number of positive ratings, respectively. For this given case there are exactly y = m (m − 1) observed pairwise agreements on a positive rating, and x = m (n − 1) opportunities for such an agreement. If we compute x and y for each case and sum both terms over all cases, then the sum of x divided by the sum of y is the proportion of specific positive agreement in the entire sample.
This SAS program illustrates the calculations.
We may now proceed to fully generalized formulas for the proportions of overall and
specific agreement. They apply to binary, ordered category, or nominal
ratings and permit any number of raters, with potentially different
numbers of raters or different raters for each case.
 
Let there be K rated cases indexed by k = 1, ..., K. The ratings made
on case k are summarized as:
{njk} (j = 1, ..., C) = {n1k, n2k, ..., nCk}
where njk is the number of times category j (j = 1, ..., C) is applied to case k. For example, if a case k is rated five times and receives ratings of 1, 1, 1, 2, and 2, then n1k = 3, n2k = 2, and {njk} = {3, 2}.
Let nk denote the total number of ratings made on case k; that is,
C
nk = SUM njk. (7)
j=1
For case k, the number of actual agreements on rating level j is
njk (njk - 1). (8)
The total number of agreements specifically on rating level j, across all
cases is
K
S(j) = SUM njk (njk - 1). (9)
k=1
The number of possible agreements specifically on category j for case k is
equal to
njk (nk - 1) (10)
and the number of possible agreements on category j across all cases is:
K
Sposs(j) = SUM njk (nk - 1). (11)
k=1
The proportion of agreement specific to category j is equal to the total
number of agreements on category j divided by the
total number of opportunities for agreement on category j, or
S(j)
ps(j) = -------. (12)
Sposs(j)
The total number of actual agreements, regardless of category,
is equal to the sum of Eq. (9) across all categories, or
C
O = SUM S(j). (13)
j=1
The total number of possible agreements is
K
Oposs = SUM nk (nk - 1). (14)
k=1
Dividing Eq. (13) by Eq. (14) gives the overall proportion of
observed agreement, or
O
po = ------. (15)
Oposs
The jackknife or, preferably, the nonparametric bootstrap can be used to estimate standard errors of ps(j) and po in the generalized case. The bootstrap is uncomplicated if one assumes cases are independent and identically distributed (iid). In general, this assumption will be accepted when:
In these cases, one may construct each simulated sample by repeated random sampling with replacement from the set of K cases.
If cases cannot be assumed iid (for example, if ratings are not missing at random, or, say, a study systematically rotates raters), simple modifications of the bootstrap method--such as two-stage sampling, can be made.
The parametric bootstrap can be used for significance testing. A variation of this method, patterned after the Monte Carlo approach described by Uebersax (1982), is as follows:
Loop through s, where s indexes simulated data sets Loop through all cases k Loop through all ratings on case k   For each actual rating, generate a random simulated rating, chosen such that:   Pr(Rating category=j|Rater=i) = base rate of category j for Rater i.   If rater identities are unknown or for a reproducibility study, the total base rate for category j is used.   End loop through case k's ratings End loop through cases Calculate p*o and p*s(j) (and any other statistics of interest) for sample s. End main loop
The significance of po, ps(j), or any other statistic
calculated, is determined with reference to the distribution of corresponding
values in the simulated data sets. For example, po is significant at
the .05 level (1-tailed) if it exceeds 95% of the p*o
values obtained for the simulated data sets.
 
(Top of Page)
Cicchetti DV. Feinstein AR. High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 1990, 43, 551-558.
Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960, 20, 37-46.
Cook RJ, Farewell VT. Conditional inference for subject-specific and marginal agreement: two families on agreement measures. Canadian Journal on Statistics, 1995, 23, 333-344.
Efron B. The jackknife, the bootstrap and other resampling plans. Philadelphia: Society for Industrial and Applied Mathematics, 1982.
Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman and Hall, 1993.
Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin, 1971, 76, 378-381.
Fleiss JL. Statistical methods for rates and proportions, 2nd Ed. New York: John Wiley, 1981.
Graham P, Bull B. Approximate standard errors and confidence intervals for indices of positive and negative agreement. J Clin Epidemiol, 1998, 51(9), 763-771.
Mackinnon, A. A spreadsheet for the calculation of comprehensive statistics for the assessment of diagnostic tests and inter-rater agreement. Computers in Biology and Medicine, 2000, 30, 127-134.
Spitzer R, Fleiss J. A re-analysis of the reliability of psychiatric diagnosis. British Journal on Psychiatry, 1974, 341-47.
Uebersax JS. A design-independent method for measuring the reliability of psychiatric diagnosis. Journal on Psychiatric Research, 1982-1983, 17(4), 335-342.
Rev: 19 Sep 2018 (SAS program for multiple binary ratings)