Much neglected, raw agreement indices are important descriptive statistics. They have unique common-sense value. A study that reports only simple agreement rates can be very useful; a study that omits them but reports complex statistics may fail to inform readers at a practical level.
Raw agreement measures and their calculation are explained below. We examine first the case of agreement between two raters on dichotomous ratings.
Consider the ratings of two raters (or experts, judges, diagnostic procedures, etc.) summarized by Table 1:
Rater 1 | Rater 2 | ||
---|---|---|---|
+ | - | total | |
+ | a | b | a + b |
- | c | d | c + d |
total | a + c | b + d | N |
The values a, b, c and d here denote the observed frequencies for
each possible combination of ratings by Rater 1 and Rater 2.
The proportion of overall agreement (p_{o}) is the proportion of cases for which Raters 1 and 2 agree. That is:
a + d a + d p_{o} = ------------- = -----. (1) a + b + c + d NThis proportion is informative and useful, but, taken by itself, has possible has limitations. One is that it does not distinguish between agreement on positive ratings and agreement on negative ratings.
Consider, for example, an epidemiological application where a positive rating corresponds to a positive diagnosis for a very rare disease -- one, say, with a prevalence of 1 in 1,000,000. Here we might not be much impressed if p_{o} is very high -- even above .99. This result would be due almost entirely to agreement on disease absence; we are not directly informed as to whether diagnosticians agree on disease presence.
Further, one may consider Cohen's (1960) criticism of p_{o}: that it can be high even with hypothetical raters who randomly guess on each case according to probabilities equal to the observed base rates. In this example, if both raters simply guessed "positive" the large majority of times they would usually agree on the diagnosis. Cohen proposed to remedy this by comparing p_{o} to a corresponding quantity, p_{c}, the proportion of agreement expected by raters who randomly guess. As described on the kappa coefficients page, this logic is questionable; in particular, it is not clear what advantage there is to compare an actual level of agreement, p_{o}, with a hypothetical value, p_{c}, which would occur under an obviously unrealistic model.
A much simpler way to address this issue is described immediately below.
2a 2d PA = ----------; NA = ----------. (2) 2a + b + c 2d + b + c.
A joint consideration of PA and NA addresses the potential concern that,
when base rates are extreme, p_{o} is liable to chance-related inflation or bias.
Such inflation, if it exists at all, would affect only the more frequent category. Thus if both PA and
NA are satisfactorily large, there is arguably less need or purpose in comparing actual to chance-
predicted agreement using a kappa statistic.
But in any case, PA and NA provide more information relevant to understanding and improving ratings
than a single omnibus index (see Cicchetti and Feinstein, 1990).
Statistical significance. In testing the significance of p_{o}, the null hypothesis is that raters are independent, with their marginal assignment probabilities equal to the observed marginal proportions. For a 2×2 table, the test is the same as a usual test of statistical independence in a contingency table. Any of the following could potentially be used:
Standard error. One can use standard methods applicable to proportions to estimate the standard error and confidence limits of p_{o}. For a sample size N, the standard error of p_{o} is:
SE(p_{o}) = sqrt[p_{o}(1 - p_{o})/N] (3.1)One can alternatively estimate SE(p_{o}) using resampling methods, e.g., the nonparametric bootstrap or the jackknife, as described the next section.
Confidence intervals. The Wald or "normal approximation" method estimates confidence limits of a proportion as follows:
CL = p_{o} - SE × z_{crit} (3.2) CU = p_{o} + SE × z_{crit} (3.3)where SE here is SE(p_{o}) as estimated by Eq. (3.1), CL and CU are the lower and upper confidence limits, and z_{crit} is the z-value associated with a confidence range with coverage probability crit. For a 95% confidence range, z_{crit} = 1.96; for a 90% confidence range, z_{crit} = 1.645.
When p_{o} is either very large or very small (and especially with small sample sizes) the Wald method may produce confidence limits less than 0 or greater than 1; in this case better approximate methods (see Agresti, 1996), exact methods, or resampling methods (see below) can be used instead.
Statistical significance. Logically, there is only one test of independence in a 2×2 table; therefore if PA significantly differs from chance, so too would NA, and vice versa. Spitzer and Fleiss (1974) described kappa tests for specific rating levels; in a 2×2 there are two such "specific kappas", but both have the same value and statistical significance as the overall kappa.
Standard errors.
SE(PA) = sqrt[4a (c + b)(a + c + b)] / (2a + b + c)^2 (3.4) SE(NA) = sqrt[4d (c + b)(d + c + b)] / (2d + b + c)^2 (3.5)
Confidence intervals.
An advantage of bootstrapping is that one can use the same simulated data sets to estimate not only the standard errors and confidence limits of PA and NA, but also those of p_{o} or any other statistic defined for the 2×2 table.
A SAS program to estimate the asymptotic standard errors and asymptotic confidence limits of PA and NA has been written. A standalone program (executable program and fortran 90 source code) that supplies both bootstrap and asymptotic standard errors and confidence limits can be downloaded here.
Readers are referred to Graham and Bull (1998) for fuller coverage of this
topic, including a comparison of different methods for estimating confidence
intervals for PA and NA.
(Top on Page)
We now consider results for two raters making polytomous (either ordered category or purely nominal) ratings. Let C denote the number of rating categories or levels. Results for the two raters may be summarized as a C × C table such as Table 2.
Rater 1 | Rater 2 | ||||
---|---|---|---|---|---|
1 | 2 | ... | C | total | |
1 | n_{11} | n_{12} | ... | n_{1C} | n_{1.} |
2 | n_{21} | n_{22} | ... | n_{2C} | n_{2.} |
. . |
. . |
. . |
... | . . |
. . |
C | n_{C1} | n_{C2} | ... | n_{CC} | n_{C.} |
total | n_{.1} | n_{.2} | ... | n_{.C} | N |
Here n_{ij} denotes the number of cases assigned rating category i by
Rater 1 and category j by Rater j, with i, j = 1, ..., C. When a "."
appears in a subscript, it denotes a marginal sum over the corresponding
index; e.g., n_{i.} is the sum of n_{ij} for j = 1, ...,
c, or the row marginal sum for category i; n_{..} = N
denotes the total number of cases.
For this design, p_{o} is the sum of frequencies of the main diagonal of table {n_{ij}} divided by sample size, or
C p_{o} = 1/N SUM n_{ii} (4) i=1
Statistical significance
n_{i.}n_{.j} π_{ij} = ------. (5) N
and the tabulates overall agreement, denoted p^{*}_{o}, for each simulated sample. The p_{o} for the actual data is considered statistically significant if it exceeds a specified percentage (e.g., 5%) of the p^{*}_{o} values.
If one already has a computer program for nonparametric bootstrapping only slight modifications are needed to adapt it to perform a parametric bootstrap significance test.
With respect to Table 2, the proportion of agreement specific to category i is:
2n_{ii} p_{s}(i) = ---------. (6) n_{i.} + n_{.i}
Statistical significance
Eq. (6) amounts to collapsing the C × C table into a 2×2 table relative to category i, considering this category a 'positive' rating, and then computing the positive agreement (PA) index of Eq. (2). This is done for each category i successively. In each reduced table one may perform a test of statistical independence using Cohen's kappa, the odds ratio, or chi-squared, or use a Fisher exact test.
Standard errors and confidence limits
We now consider generalized formulas for the proportions of overall and
specific agreement. They apply to binary, ordered category, or nominal
ratings and permit any number of raters, with potentially different
numbers of raters or different raters for each case.
Let there be K rated cases indexed by k = 1, ..., K. The ratings made on case k are summarized as:
{n_{jk}} (j = 1, ..., C) = {n_{1k}, n_{2k}, ..., n_{Ck}}
where n_{jk} is the number of times category j (j = 1, ..., C) is applied to case k. For example, if a case k is rated five times and receives ratings of 1, 1, 1, 2, and 2, then n_{1k} = 3, n_{2k} = 2, and {n_{jk}} = {3, 2}.
Let n_{k} denote the total number of ratings made on case k; that is,
C n_{k} = SUM n_{jk}. (7) j=1
For case k, the number of actual agreements on rating level j is
n_{jk} (n_{jk} - 1). (8)
The total number of agreements specifically on rating level j, across all cases is
K S(j) = SUM n_{jk} (n_{jk} - 1). (9) k=1
The number of possible agreements specifically on category j for case k is equal to
n_{jk} (n_{k} - 1) (10)
and the number of possible agreements on category j across all cases is:
K S_{poss}(j) = SUM n_{jk} (n_{k} - 1). (11) k=1
The proportion of agreement specific to category j is equal to the total number of agreements on category j divided by the total number of opportunities for agreement on category j, or
S(j) p_{s}(j) = -------. (12) S_{poss}(j)
The total number of actual agreements, regardless of category, is equal to the sum of Eq. (9) across all categories, or
C O = SUM S(j). (13) j=1The total number of possible agreements is
K O_{poss} = SUM n_{k} (n_{k} - 1). (14) k=1Dividing Eq. (13) by Eq. (14) gives the overall proportion of observed agreement, or
O p_{o} = ------. (15) O_{poss}
The jackknife or, preferably, the nonparametric bootstrap can be used to estimate standard errors of p_{s}(j) and p_{o} in the generalized case. The bootstrap is uncomplicated if one assumes cases are independent and identically distributed (iid). In general, this assumption will be accepted when:
In these cases, one may construct each simulated sample by repeated random sampling with replacement from the set of K cases.
If cases cannot be assumed iid (for example, if ratings are not missing at random, or, say, a study systematically rotates raters), simple modifications of the bootstrap method--such as two-stage sampling, can be made.
The parametric bootstrap can be used for significance testing. A variation of this method, patterned after the Monte Carlo approach described by Uebersax (1982), is as follows:
Loop through s, where s indexes simulated data sets Loop through all cases k Loop through all ratings on case k For each actual rating, generate a random simulated rating, chosen such that: Pr(Rating category=j|Rater=i) = base rate of category j for Rater i. If rater identities are unknown or for a reproducibility study, the total base rate for category j is used. End loop through case k's ratings End loop through cases Calculate p^{*}_{o} and p^{*}_{s}(j) (and any other statistics of interest) for sample s. End main loop
The significance of p_{o}, p_{s}(j), or any other statistic
calculated, is determined with reference to the distribution of corresponding
values in the simulated data sets. For example, p_{o} is significant at
the .05 level (1-tailed) if it exceeds 95% of the p^{*}_{o}
values obtained for the simulated data sets.
(Top of Page)
Cicchetti DV. Feinstein AR. High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 1990, 43, 551-558.
Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960, 20, 37-46.
Cook RJ, Farewell VT. Conditional inference for subject-specific and marginal agreement: two families on agreement measures. Canadian Journal on Statistics, 1995, 23, 333-344.
Efron B. The jackknife, the bootstrap and other resampling plans. Philadelphia: Society for Industrial and Applied Mathematics, 1982.
Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman and Hall, 1993.
Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin, 1971, 76, 378-381.
Fleiss JL. Statistical methods for rates and proportions, 2nd Ed. New York: John Wiley, 1981.
Graham P, Bull B. Approximate standard errors and confidence intervals for indices of positive and negative agreement. J Clin Epidemiol, 1998, 51(9), 763-771.
Mackinnon, A. A spreadsheet for the calculation of comprehensive statistics for the assessment of diagnostic tests and inter-rater agreement. Computers in Biology and Medicine, 2000, 30, 127-134.
Spitzer R, Fleiss J. A re-analysis of the reliability of psychiatric diagnosis. British Journal on Psychiatry, 1974, 341-47.
Uebersax JS. A design-independent method for measuring the reliability of psychiatric diagnosis. Journal on Psychiatric Research, 1982-1983, 17(4), 335-342.
Rev: 05 Aug 2014