Raw Agreement Indices


Back to Agreement Statistics

Introduction

Much neglected, raw agreement indices are important descriptive statistics. They have unique common-sense value. A study that reports only simple agreement rates can be very useful; a study that omits them but reports complex statistics may fail to inform readers at a practical level.

Raw agreement measures and their calculation are explained below. We examine first the case of agreement between two raters on dichotomous ratings.


Two Raters, Dichotomous Ratings

Consider the ratings of two raters (or experts, judges, diagnostic procedures, etc.) summarized by Table 1:

Table 1
Summary of binary ratings by two raters
Rater 1 Rater 2
+ - total
+ a b a + b
- c d c + d
total a + c b + d N

The values a, b, c and d here denote the observed frequencies for each possible combination of ratings by Rater 1 and Rater 2.  
 

Proportion of overall agreement

The proportion of overall agreement (po) is the proportion of cases for which Raters 1 and 2 agree. That is:

                a + d         a + d
     po  =  -------------  =  -----.    (1)
            a + b + c + d       N
This proportion is informative and useful, but, taken by itself, has possible has limitations. One is that it does not distinguish between agreement on positive ratings and agreement on negative ratings.

Consider, for example, an epidemiological application where a positive rating corresponds to a positive diagnosis for a very rare disease -- one, say, with a prevalence of 1 in 1,000,000. Here we might not be much impressed if po is very high -- even above .99. This result would be due almost entirely to agreement on disease absence; we are not directly informed as to whether diagnosticians agree on disease presence.

Further, one may consider Cohen's (1960) criticism of po: that it can be high even with hypothetical raters who randomly guess on each case according to probabilities equal to the observed base rates. In this example, if both raters simply guessed "positive" the large majority of times they would usually agree on the diagnosis. Cohen proposed to remedy this by comparing po to a corresponding quantity, pc, the proportion of agreement expected by raters who randomly guess. As described on the kappa coefficients page, this logic is questionable; in particular, it is not clear what advantage there is to compare an actual level of agreement, po, with a hypothetical value, pc, which would occur under an obviously unrealistic model.

A much simpler way to address this issue is described immediately below.  
 

Positive agreement and negative agreement

We may also compute observed agreement relative to each rating category individually. Generically the resulting indices are called the proportions of specific agreement (Ciccetti & Feinstein, 1990; Spitzer & Fleiss, 1974). With binary ratings, there are two such indices, positive agreement (PA) and negative agreement (NA). They are calculated as follows:
 
                2a                    2d
     PA  =  ----------;    NA  =  ----------.    (2)
            2a + b + c            2d + b + c
.
PA, for example, estimates the conditional probability, given that one of the raters, randomly selected, makes a positive rating, the other rater will also do so.

A joint consideration of PA and NA addresses the potential concern that, when base rates are extreme, po is liable to chance-related inflation or bias. Such inflation, if it exists at all, would affect only the more frequent category. Thus if both PA and NA are satisfactorily large, there is arguably less need or purpose in comparing actual to chance- predicted agreement using a kappa statistic. But in any case, PA and NA provide more information relevant to understanding and improving ratings than a single omnibus index (see Cicchetti and Feinstein, 1990).  
 

Significance, standard errors, interval estimation

Proportion of overall agreement

Statistical significance. In testing the significance of po, the null hypothesis is that raters are independent, with their marginal assignment probabilities equal to the observed marginal proportions. For a 2×2 table, the test is the same as a usual test of statistical independence in a contingency table. Any of the following could potentially be used:

A potential advantage of a kappa significance test is that the magnitude of kappa can be interpreted as approximately an intra-class correlation coefficient. All of these tests, except the last, can be done with SAS PROC FREQ.

Standard error. One can use standard methods applicable to proportions to estimate the standard error and confidence limits of po. For a sample size N, the standard error of po is:

     SE(po) = sqrt[po(1 - po)/N]    (3.1) 
One can alternatively estimate SE(po) using resampling methods, e.g., the nonparametric bootstrap or the jackknife, as described the next section.

Confidence intervals. The Wald or "normal approximation" method estimates confidence limits of a proportion as follows:

     CL = po - SE × zcrit    (3.2)
     CU = po + SE × zcrit    (3.3)
where SE here is SE(po) as estimated by Eq. (3.1), CL and CU are the lower and upper confidence limits, and zcrit is the z-value associated with a confidence range with coverage probability crit. For a 95% confidence range, zcrit = 1.96; for a 90% confidence range, zcrit = 1.645.

When po is either very large or very small (and especially with small sample sizes) the Wald method may produce confidence limits less than 0 or greater than 1; in this case better approximate methods (see Agresti, 1996), exact methods, or resampling methods (see below) can be used instead.

Positive agreement and negative agreement

Statistical significance. Logically, there is only one test of independence in a 2×2 table; therefore if PA significantly differs from chance, so too would NA, and vice versa. Spitzer and Fleiss (1974) described kappa tests for specific rating levels; in a 2×2 there are two such "specific kappas", but both have the same value and statistical significance as the overall kappa.

Standard errors.

Alternatively, one can estimate standard errors using the nonparametric bootstrap or the jackknife. These are described with reference to PA as follows:

Confidence intervals.

An advantage of bootstrapping is that one can use the same simulated data sets to estimate not only the standard errors and confidence limits of PA and NA, but also those of po or any other statistic defined for the 2×2 table.

A SAS program to estimate the asymptotic standard errors and asymptotic confidence limits of PA and NA has been written. A standalone program (executable program and fortran 90 source code) that supplies both bootstrap and asymptotic standard errors and confidence limits can be downloaded here.

Readers are referred to Graham and Bull (1998) for fuller coverage of this topic, including a comparison of different methods for estimating confidence intervals for PA and NA.
 
(Top on Page)


Two Raters, Polytomous Ratings

We now consider results for two raters making polytomous (either ordered category or purely nominal) ratings. Let C denote the number of rating categories or levels. Results for the two raters may be summarized as a C × C table such as Table 2.

Table 2
Summary of polytomous ratings by two raters
Rater 1 Rater 2
1 2 ... C total
1 n11 n12 ... n1C n1.
2 n21 n22 ... n2C n2.
 .
 .
 .
 .
 .
 .
...  .
 .
 .
 .
C nC1 nC2 ... nCC nC.
total n.1 n.2 ... n.C N

Here nij denotes the number of cases assigned rating category i by Rater 1 and category j by Rater j, with i, j = 1, ..., C. When a "." appears in a subscript, it denotes a marginal sum over the corresponding index; e.g., ni. is the sum of nij for j = 1, ..., c, or the row marginal sum for category i; n.. = N denotes the total number of cases.  

Overall Agreement

For this design, po is the sum of frequencies of the main diagonal of table {nij} divided by sample size, or

                  C
     po  =  1/N  SUM  nii    (4)
                 i=1

Statistical significance

Standard error and confidence limits. Here the standard error and confidence intervals of po can again be calculated with the methods described for 2×2 tables.  

Specific agreement

With respect to Table 2, the proportion of agreement specific to category i is:

                  2nii
     ps(i)  =  ---------.    (6)
               ni. + n.i

Statistical significance

Eq. (6) amounts to collapsing the C × C table into a 2×2 table relative to category i, considering this category a 'positive' rating, and then computing the positive agreement (PA) index of Eq. (2). This is done for each category i successively. In each reduced table one may perform a test of statistical independence using Cohen's kappa, the odds ratio, or chi-squared, or use a Fisher exact test.

Standard errors and confidence limits


 
(Top of Page)

Generalized Case

Before proceeding to the fully general case, it will help to look at the simpler situation of estimating specific positive agreement given multiple binary ratings.

For a given case with two or more binary (positive/negative) ratings, let n and m denote the number of ratings and the number of positive ratings, respectively. For this given case there are exactly y = m (m − 1) observed pairwise agreements on a positive rating, and x = m (n − 1) opportunities for such an agreement. If we compute x and y for each case and sum both terms over all cases, then the sum of x divided by the sum of y is the proportion of specific positive agreement in the entire sample.

This SAS program illustrates the calculations.

We may now proceed to fully generalized formulas for the proportions of overall and specific agreement. They apply to binary, ordered category, or nominal ratings and permit any number of raters, with potentially different numbers of raters or different raters for each case.  

Specific agreement

Let there be K rated cases indexed by k = 1, ..., K. The ratings made on case k are summarized as:

     {njk} (j = 1, ..., C) = {n1k, n2k, ..., nCk}

where njk is the number of times category j (j = 1, ..., C) is applied to case k. For example, if a case k is rated five times and receives ratings of 1, 1, 1, 2, and 2, then n1k = 3, n2k = 2, and {njk} = {3, 2}.

Let nk denote the total number of ratings made on case k; that is,

           C
     nk = SUM  njk.     (7)
          j=1

For case k, the number of actual agreements on rating level j is

     njk (njk - 1).     (8)

The total number of agreements specifically on rating level j, across all cases is

              K
     S(j) =  SUM njk (njk - 1).    (9)
             k=1

The number of possible agreements specifically on category j for case k is equal to

     njk (nk - 1)     (10)

and the number of possible agreements on category j across all cases is:

                   K
     Sposs(j)  =  SUM njk (nk - 1).     (11)
                  k=1

The proportion of agreement specific to category j is equal to the total number of agreements on category j divided by the total number of opportunities for agreement on category j, or

                S(j)
     ps(j)  =  -------.     (12)
               Sposs(j)

Overall agreement

The total number of actual agreements, regardless of category, is equal to the sum of Eq. (9) across all categories, or

            C
     O  =  SUM  S(j).     (13)
           j=1
The total number of possible agreements is
                K
     Oposs  =  SUM  nk (nk - 1).     (14)
               k=1
Dividing Eq. (13) by Eq. (14) gives the overall proportion of observed agreement, or
             O
     po =  ------.     (15)
            Oposs

Standard errors, interval estimation, significance

The jackknife or, preferably, the nonparametric bootstrap can be used to estimate standard errors of ps(j) and po in the generalized case. The bootstrap is uncomplicated if one assumes cases are independent and identically distributed (iid). In general, this assumption will be accepted when:

In these cases, one may construct each simulated sample by repeated random sampling with replacement from the set of K cases.

If cases cannot be assumed iid (for example, if ratings are not missing at random, or, say, a study systematically rotates raters), simple modifications of the bootstrap method--such as two-stage sampling, can be made.

The parametric bootstrap can be used for significance testing. A variation of this method, patterned after the Monte Carlo approach described by Uebersax (1982), is as follows:

   Loop through s, where s indexes simulated data sets
        Loop through all cases k
             Loop through all ratings on case k
 
                  For each actual rating, generate a
                  random simulated rating, chosen such that:
 
                     Pr(Rating category=j|Rater=i) = base
                     rate of category j for Rater i.
 
                  If rater identities are unknown or for a
                  reproducibility study, the total base rate
                  for category j is used.
 
             End loop through case k's ratings
        End loop through cases
        Calculate p*o and p*s(j)
(and any other statistics
        of interest) for sample s.
   End main loop

The significance of po, ps(j), or any other statistic calculated, is determined with reference to the distribution of corresponding values in the simulated data sets. For example, po is significant at the .05 level (1-tailed) if it exceeds 95% of the p*o values obtained for the simulated data sets.
 
(Top of Page)


References


(Top on Page)
Go to Agreement Statistics
Go to Latent Structure Analysis

Rev: 19 Sep 2018 (SAS program for multiple binary ratings)


(c) 2000-2014 John Uebersax PhD   email