Raw Agreement Indices

Introduction
Two raters, dichotomous ratings
Two raters, polytomous ratings
Generalized case
References

Introduction

Much neglected, raw agreement indices are important descriptive statistics. They have unique common-sense value. A study that reports only simple agreement rates can be very useful; a study that omits them but reports complex statistics may fail to inform readers at a practical level.

Raw agreement measures and their calculation are explained below. We examine first the case of agreement between two raters on dichotomous ratings.

Two Raters, Dichotomous Ratings

Consider the ratings of two raters (or experts, judges, diagnostic procedures, etc.) summarized by Table 1:

Table 1
Summary of binary ratings by two raters
Rater 1	Rater 2
Rater 1	+	-	total
+	a	b	a + b
-	c	d	c + d
total	a + c	b + d	N

The values a, b, c and d here denote the observed frequencies for each possible combination of ratings by Rater 1 and Rater 2.

Proportion of overall agreement

The proportion of overall agreement (p_o) is the proportion of cases for which Raters 1 and 2 agree. That is:

                a + d         a + d
     p_o  =  -------------  =  -----.    (1)
            a + b + c + d       N

This proportion is informative and useful, but, taken by itself, has possible has limitations. One is that it does not distinguish between agreement on positive ratings and agreement on negative ratings.

Consider, for example, an epidemiological application where a positive rating corresponds to a positive diagnosis for a very rare disease -- one, say, with a prevalence of 1 in 1,000,000. Here we might not be much impressed if p_o is very high -- even above .99. This result would be due almost entirely to agreement on disease absence; we are not directly informed as to whether diagnosticians agree on disease presence.

Further, one may consider Cohen's (1960) criticism of p_o: that it can be high even with hypothetical raters who randomly guess on each case according to probabilities equal to the observed base rates. In this example, if both raters simply guessed "positive" the large majority of times they would usually agree on the diagnosis. Cohen proposed to remedy this by comparing p_o to a corresponding quantity, p_c, the proportion of agreement expected by raters who randomly guess. As described on the kappa coefficients page, this logic is questionable; in particular, it is not clear what advantage there is to compare an actual level of agreement, p_o, with a hypothetical value, p_c, which would occur under an obviously unrealistic model.

A much simpler way to address this issue is described immediately below.

Positive agreement and negative agreement

We may also compute observed agreement relative to each rating category individually. Generically the resulting indices are called the proportions of specific agreement (Ciccetti & Feinstein, 1990; Spitzer & Fleiss, 1974). With binary ratings, there are two such indices, positive agreement (PA) and negative agreement (NA). They are calculated as follows:

2a 2d PA = ----------; NA = ----------. (2) 2a + b + c 2d + b + c
.
PA, for example, estimates the conditional probability, given that one of the raters, randomly selected, makes a positive rating, the other rater will also do so.

A joint consideration of PA and NA addresses the potential concern that, when base rates are extreme, p_o is liable to chance-related inflation or bias. Such inflation, if it exists at all, would affect only the more frequent category. Thus if both PA and NA are satisfactorily large, there is arguably less need or purpose in comparing actual to chance- predicted agreement using a kappa statistic. But in any case, PA and NA provide more information relevant to understanding and improving ratings than a single omnibus index (see Cicchetti and Feinstein, 1990).

Significance, standard errors, interval estimation

Proportion of overall agreement

Statistical significance. In testing the significance of p_o, the null hypothesis is that raters are independent, with their marginal assignment probabilities equal to the observed marginal proportions. For a 2×2 table, the test is the same as a usual test of statistical independence in a contingency table. Any of the following could potentially be used:

test of a nonzero kappa coefficient
test of a nonzero log-odds ratio
a Pearson chi-squared (X²) or likelihood-ratio chi-squared (G²) test of independence
the Fisher exact test
test of fit of a loglinear model with main effects only

A potential advantage of a kappa significance test is that the magnitude of kappa can be interpreted as approximately an intra-class correlation coefficient. All of these tests, except the last, can be done with SAS PROC FREQ.

Standard error. One can use standard methods applicable to proportions to estimate the standard error and confidence limits of p_o. For a sample size N, the standard error of p_o is:

     SE(p_o) = sqrt[p_o(1 - p_o)/N]    (3.1)

One can alternatively estimate SE(p_o) using resampling methods, e.g., the nonparametric bootstrap or the jackknife, as described the next section.

Confidence intervals. The Wald or "normal approximation" method estimates confidence limits of a proportion as follows:

     CL = p_o - SE × z_crit    (3.2)
     CU = p_o + SE × z_crit    (3.3)

where SE here is SE(p_o) as estimated by Eq. (3.1), CL and CU are the lower and upper confidence limits, and z_crit is the z-value associated with a confidence range with coverage probability crit. For a 95% confidence range, z_crit = 1.96; for a 90% confidence range, z_crit = 1.645.

When p_o is either very large or very small (and especially with small sample sizes) the Wald method may produce confidence limits less than 0 or greater than 1; in this case better approximate methods (see Agresti, 1996), exact methods, or resampling methods (see below) can be used instead.

Positive agreement and negative agreement

Statistical significance. Logically, there is only one test of independence in a 2×2 table; therefore if PA significantly differs from chance, so too would NA, and vice versa. Spitzer and Fleiss (1974) described kappa tests for specific rating levels; in a 2×2 there are two such "specific kappas", but both have the same value and statistical significance as the overall kappa.

Standard errors.

Graham and Bull (1998) and Mackinnon (2000) used the delta method to derive formulas for the asymptotic (large sample) standard errors of PA and NA. As given by Mackinnon (2000; p. 130), the formulas are:
```
     SE(PA) = sqrt[4a (c + b)(a + c + b)] / (2a + b + c)^2    (3.4)

     SE(NA) = sqrt[4d (c + b)(d + c + b)] / (2d + b + c)^2    (3.5)
```

Alternatively, one can estimate standard errors using the nonparametric bootstrap or the jackknife. These are described with reference to PA as follows:

With the nonparametric bootstrap (Efron & Tibshirani, 1993), one constructs a large number of simulated data sets of size N by sampling with replacement from the observed data. For a 2×2 table, this can be done simply by using random numbers to assign simulated cases to cells with probabilities of a/N, b/N, c/N and d/N (however, with large N, is more efficient algorithms are preferable.) One then computes the proportion of positive agreement for each simulated data set -- which we denote PA*. The standard deviation of PA* across all simulated data sets estimates the standard error SE(PA).
The delete-1 (Efron, 1982) jackknife works by calculating PA for four alternative tables where 1 is subtracted from each of the four cells of the original 2 × 2 table. A few simple calculations then provide an estimate of the standard error SE(PA). The delete-1 jackknife requires less computation, but the nonparametric bootstrap is usually considered more accurate.

Confidence intervals.

Asymptotic confidence limits for PA and NA can be obtained as in Eqs. 3.2 and 3.3., substituting PA and NA for p_o and using the asymptotic standard errors given by Eqs. 3.4 and 3.5.
Alternatively, the nonparametric bootstrap can be used. Again, we describe the method for PA. As with bootstrap standard error estimation, ones generate a large number (e.g., 100,000) of simulated data sets, computing an estimate PA* for each one. Results are then sorted by increasing value of PA*. Confidence limits of PA are obtained with reference to the percentiles of this ranking. For example, the 95% confidence range of PA is estimated by the values of PA* that correspond to the 2.5 and 97.5 percentiles of this distribution.

An advantage of bootstrapping is that one can use the same simulated data sets to estimate not only the standard errors and confidence limits of PA and NA, but also those of p_o or any other statistic defined for the 2×2 table.

A SAS program to estimate the asymptotic standard errors and asymptotic confidence limits of PA and NA has been written. A standalone program (executable program and fortran 90 source code) that supplies both bootstrap and asymptotic standard errors and confidence limits can be downloaded here.

Readers are referred to Graham and Bull (1998) for fuller coverage of this topic, including a comparison of different methods for estimating confidence intervals for PA and NA.

(Top on Page)

Two Raters, Polytomous Ratings

We now consider results for two raters making polytomous (either ordered category or purely nominal) ratings. Let C denote the number of rating categories or levels. Results for the two raters may be summarized as a C × C table such as Table 2.

Table 2
Summary of polytomous ratings by two raters
Rater 1	Rater 2
Rater 1	1	2	...	C	total
1	n₁₁	n₁₂	...	n_1C	n_1.
2	n₂₁	n₂₂	...	n_2C	n_2.
. .	. .	. .	...	. .	. .
C	n_C1	n_C2	...	n_CC	n_C.
total	n_.1	n_.2	...	n_.C	N

Here n_ij denotes the number of cases assigned rating category i by Rater 1 and category j by Rater j, with i, j = 1, ..., C. When a "." appears in a subscript, it denotes a marginal sum over the corresponding index; e.g., n_i. is the sum of n_ij for j = 1, ..., c, or the row marginal sum for category i; n_.. = N denotes the total number of cases.

Overall Agreement

For this design, p_o is the sum of frequencies of the main diagonal of table {n_ij} divided by sample size, or

                  C
     p_o  =  1/N  SUM  n_ii    (4)
                 i=1

Statistical significance

One may test the statistical significance of p_o with Cohen's kappa. If kappa is significant/nonsignificant, then p_o may be assumed significant/nonsignificant, and vice versa. Note that the numerator of kappa is the difference between p_o and the level of agreement expected under the null hypothesis of statistical independence.
The parametric bootstrap can also be used to test statistical significance. This is like the nonparametric bootstrap already described, except that samples are generated from the null hypothesis distribution. Specifically, one constructs many -- say 5000 -- simulated samples of size N from the probability distribution {π_ij}, where n_i.n_.j π_ij = ------. (5) N

and the tabulates overall agreement, denoted p^*_o, for each simulated sample. The p_o for the actual data is considered statistically significant if it exceeds a specified percentage (e.g., 5%) of the p^*_o values.
If one already has a computer program for nonparametric bootstrapping only slight modifications are needed to adapt it to perform a parametric bootstrap significance test.

Standard error and confidence limits. Here the standard error and confidence intervals of p_o can again be calculated with the methods described for 2×2 tables.

Specific agreement

With respect to Table 2, the proportion of agreement specific to category i is:

                  2n_ii
     p_s(i)  =  ---------.    (6)
               n_i. + n_.i

Statistical significance

Eq. (6) amounts to collapsing the C × C table into a 2×2 table relative to category i, considering this category a 'positive' rating, and then computing the positive agreement (PA) index of Eq. (2). This is done for each category i successively. In each reduced table one may perform a test of statistical independence using Cohen's kappa, the odds ratio, or chi-squared, or use a Fisher exact test.

Standard errors and confidence limits

Again, for each category i, we may collapse the original C × C table into a 2×2 table, taking i as the 'positive' rating level. The asymptotic standard error formula Eq. (3.4) for PA may then be used, and the Wald method confidence limits given by Eqs. (3.1) and (3.2) may be computed.
Alternatively, one can use the nonparametric bootstrap to estimate standard errors and/or confidence limits. Note that this does not require a successive collapsing of the original table.
The delete-1 jackknife can be used to estimate standard errors, but this does require successive collapsings of the C × C table.

(Top of Page)

Generalized Case

Before proceeding to the fully general case, it will help to look at the simpler situation of estimating specific positive agreement given multiple binary ratings.

For a given case with two or more binary (positive/negative) ratings, let n and m denote the number of ratings and the number of positive ratings, respectively. For this given case there are exactly y = m (m − 1) observed pairwise agreements on a positive rating, and x = m (n − 1) opportunities for such an agreement. If we compute x and y for each case and sum both terms over all cases, then the sum of x divided by the sum of y is the proportion of specific positive agreement in the entire sample.

This SAS program illustrates the calculations.

We may now proceed to fully generalized formulas for the proportions of overall and specific agreement. They apply to binary, ordered category, or nominal ratings and permit any number of raters, with potentially different numbers of raters or different raters for each case.

Specific agreement

Let there be K rated cases indexed by k = 1, ..., K. The ratings made on case k are summarized as:

     {n_jk} (j = 1, ..., C) = {n_1k, n_2k, ..., n_Ck}

where n_jk is the number of times category j (j = 1, ..., C) is applied to case k. For example, if a case k is rated five times and receives ratings of 1, 1, 1, 2, and 2, then n_1k = 3, n_2k = 2, and {n_jk} = {3, 2}.

Let n_k denote the total number of ratings made on case k; that is,

           C
     n_k = SUM  n_jk.     (7)
          j=1

For case k, the number of actual agreements on rating level j is

     n_jk (n_jk - 1).     (8)

The total number of agreements specifically on rating level j, across all cases is

              K
     S(j) =  SUM n_jk (n_jk - 1).    (9)
             k=1

The number of possible agreements specifically on category j for case k is equal to

     n_jk (n_k - 1)     (10)

and the number of possible agreements on category j across all cases is:

                   K
     S_poss(j)  =  SUM n_jk (n_k - 1).     (11)
                  k=1

The proportion of agreement specific to category j is equal to the total number of agreements on category j divided by the total number of opportunities for agreement on category j, or

                S(j)
     p_s(j)  =  -------.     (12)
               S_poss(j)

Overall agreement

The total number of actual agreements, regardless of category, is equal to the sum of Eq. (9) across all categories, or

            C
     O  =  SUM  S(j).     (13)
           j=1

The total number of possible agreements is K O_poss = SUM n_k (n_k - 1). (14) k=1
Dividing Eq. (13) by Eq. (14) gives the overall proportion of observed agreement, or O p_o = ------. (15) O_poss

Standard errors, interval estimation, significance

The jackknife or, preferably, the nonparametric bootstrap can be used to estimate standard errors of p_s(j) and p_o in the generalized case. The bootstrap is uncomplicated if one assumes cases are independent and identically distributed (iid). In general, this assumption will be accepted when:

the same raters rate each case, and either there are no missing ratings or ratings are missing completely at random.
the raters for each case are randomly sampled and the number of rating per case is constant or random.
in a replicate rating (reproducibility) study, each case is rated by the procedure the same number of times or else the number of replications for any case is completely random.

In these cases, one may construct each simulated sample by repeated random sampling with replacement from the set of K cases.

If cases cannot be assumed iid (for example, if ratings are not missing at random, or, say, a study systematically rotates raters), simple modifications of the bootstrap method--such as two-stage sampling, can be made.

The parametric bootstrap can be used for significance testing. A variation of this method, patterned after the Monte Carlo approach described by Uebersax (1982), is as follows:

   Loop through s, where s indexes simulated data sets
        Loop through all cases k
             Loop through all ratings on case k
 
                  For each actual rating, generate a
                  random simulated rating, chosen such that:
 
                     Pr(Rating category=j|Rater=i) = base
                     rate of category j for Rater i.
 
                  If rater identities are unknown or for a
                  reproducibility study, the total base rate
                  for category j is used.
 
             End loop through case k's ratings
        End loop through cases
        Calculate p^*_o and p^*_s(j)
(and any other statistics
        of interest) for sample s.
   End main loop

The significance of p_o, p_s(j), or any other statistic calculated, is determined with reference to the distribution of corresponding values in the simulated data sets. For example, p_o is significant at the .05 level (1-tailed) if it exceeds 95% of the p^*_o values obtained for the simulated data sets.

(Top of Page)

References

An introduction to categorical data analysis.

Cicchetti DV. Feinstein AR. High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 1990, 43, 551-558.

Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1960, 20, 37-46.

Cook RJ, Farewell VT. Conditional inference for subject-specific and marginal agreement: two families on agreement measures. Canadian Journal on Statistics, 1995, 23, 333-344.

Efron B. The jackknife, the bootstrap and other resampling plans. Philadelphia: Society for Industrial and Applied Mathematics, 1982.

Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman and Hall, 1993.

Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin, 1971, 76, 378-381.

Fleiss JL. Statistical methods for rates and proportions, 2nd Ed. New York: John Wiley, 1981.

Graham P, Bull B. Approximate standard errors and confidence intervals for indices of positive and negative agreement. J Clin Epidemiol, 1998, 51(9), 763-771.

Mackinnon, A. A spreadsheet for the calculation of comprehensive statistics for the assessment of diagnostic tests and inter-rater agreement. Computers in Biology and Medicine, 2000, 30, 127-134.

Spitzer R, Fleiss J. A re-analysis of the reliability of psychiatric diagnosis. British Journal on Psychiatry, 1974, 341-47.

Uebersax JS. A design-independent method for measuring the reliability of psychiatric diagnosis. Journal on Psychiatric Research, 1982-1983, 17(4), 335-342.

(Top on Page)
Go to Agreement Statistics
Go to Latent Structure Analysis

Rev: 19 Sep 2018 (SAS program for multiple binary ratings)