## Raw Agreement Indices

Back to Agreement Statistics

### Introduction

Much neglected, raw agreement indices are important descriptive statistics. They have unique common-sense value. A study that reports only simple agreement rates can be very useful; a study that omits them but reports complex statistics may fail to inform readers at a practical level.

Raw agreement measures and their calculation are explained below. We examine first the case of agreement between two raters on dichotomous ratings.

### Two Raters, Dichotomous Ratings

Consider the ratings of two raters (or experts, judges, diagnostic procedures, etc.) summarized by Table 1:

Table 1
Summary of binary ratings by two raters
Rater 1 Rater 2
+ - total
+ a b a + b
- c d c + d
total a + c b + d N

The values a, b, c and d here denote the observed frequencies for each possible combination of ratings by Rater 1 and Rater 2.

#### Proportion of overall agreement

The proportion of overall agreement (po) is the proportion of cases for which Raters 1 and 2 agree. That is:

```                a + d         a + d
po  =  -------------  =  -----.    (1)
a + b + c + d       N
```
This proportion is informative and useful, but, taken by itself, has possible has limitations. One is that it does not distinguish between agreement on positive ratings and agreement on negative ratings.

Consider, for example, an epidemiological application where a positive rating corresponds to a positive diagnosis for a very rare disease -- one, say, with a prevalence of 1 in 1,000,000. Here we might not be much impressed if po is very high -- even above .99. This result would be due almost entirely to agreement on disease absence; we are not directly informed as to whether diagnosticians agree on disease presence.

Further, one may consider Cohen's (1960) criticism of po: that it can be high even with hypothetical raters who randomly guess on each case according to probabilities equal to the observed base rates. In this example, if both raters simply guessed "positive" the large majority of times they would usually agree on the diagnosis. Cohen proposed to remedy this by comparing po to a corresponding quantity, pc, the proportion of agreement expected by raters who randomly guess. As described on the kappa coefficients page, this logic is questionable; in particular, it is not clear what advantage there is to compare an actual level of agreement, po, with a hypothetical value, pc, which would occur under an obviously unrealistic model.

A much simpler way to address this issue is described immediately below.

#### Positive agreement and negative agreement

We may also compute observed agreement relative to each rating category individually. Generically the resulting indices are called the proportions of specific agreement (Ciccetti & Feinstein, 1990; Spitzer & Fleiss, 1974). With binary ratings, there are two such indices, positive agreement (PA) and negative agreement (NA). They are calculated as follows:

```                2a                    2d
PA  =  ----------;    NA  =  ----------.    (2)
2a + b + c            2d + b + c```
.
PA, for example, estimates the conditional probability, given that one of the raters, randomly selected, makes a positive rating, the other rater will also do so.

A joint consideration of PA and NA addresses the potential concern that, when base rates are extreme, po is liable to chance-related inflation or bias. Such inflation, if it exists at all, would affect only the more frequent category. Thus if both PA and NA are satisfactorily large, there is arguably less need or purpose in comparing actual to chance- predicted agreement using a kappa statistic. But in any case, PA and NA provide more information relevant to understanding and improving ratings than a single omnibus index (see Cicchetti and Feinstein, 1990).

#### Proportion of overall agreement

Statistical significance. In testing the significance of po, the null hypothesis is that raters are independent, with their marginal assignment probabilities equal to the observed marginal proportions. For a 2×2 table, the test is the same as a usual test of statistical independence in a contingency table. Any of the following could potentially be used:

A potential advantage of a kappa significance test is that the magnitude of kappa can be interpreted as approximately an intra-class correlation coefficient. All of these tests, except the last, can be done with SAS PROC FREQ.

Standard error. One can use standard methods applicable to proportions to estimate the standard error and confidence limits of po. For a sample size N, the standard error of po is:

`     SE(po) = sqrt[po(1 - po)/N]    (3.1) `
One can alternatively estimate SE(po) using resampling methods, e.g., the nonparametric bootstrap or the jackknife, as described the next section.

Confidence intervals. The Wald or "normal approximation" method estimates confidence limits of a proportion as follows:

```     CL = po - SE × zcrit    (3.2)
CU = po + SE × zcrit    (3.3)
```
where SE here is SE(po) as estimated by Eq. (3.1), CL and CU are the lower and upper confidence limits, and zcrit is the z-value associated with a confidence range with coverage probability crit. For a 95% confidence range, zcrit = 1.96; for a 90% confidence range, zcrit = 1.645.

When po is either very large or very small (and especially with small sample sizes) the Wald method may produce confidence limits less than 0 or greater than 1; in this case better approximate methods (see Agresti, 1996), exact methods, or resampling methods (see below) can be used instead.

#### Positive agreement and negative agreement

Statistical significance. Logically, there is only one test of independence in a 2×2 table; therefore if PA significantly differs from chance, so too would NA, and vice versa. Spitzer and Fleiss (1974) described kappa tests for specific rating levels; in a 2×2 there are two such "specific kappas", but both have the same value and statistical significance as the overall kappa.

Standard errors.

• Graham and Bull (1998) and Mackinnon (2000) used the delta method to derive formulas for the asymptotic (large sample) standard errors of PA and NA. As given by Mackinnon (2000; p. 130), the formulas are:
```     SE(PA) = sqrt[4a (c + b)(a + c + b)] / (2a + b + c)^2    (3.4)

SE(NA) = sqrt[4d (c + b)(d + c + b)] / (2d + b + c)^2    (3.5)
```
Alternatively, one can estimate standard errors using the nonparametric bootstrap or the jackknife. These are described with reference to PA as follows:

• With the nonparametric bootstrap (Efron & Tibshirani, 1993), one constructs a large number of simulated data sets of size N by sampling with replacement from the observed data. For a 2×2 table, this can be done simply by using random numbers to assign simulated cases to cells with probabilities of a/N, b/N, c/N and d/N (however, with large N, is more efficient algorithms are preferable.) One then computes the proportion of positive agreement for each simulated data set -- which we denote PA*. The standard deviation of PA* across all simulated data sets estimates the standard error SE(PA).

• The delete-1 (Efron, 1982) jackknife works by calculating PA for four alternative tables where 1 is subtracted from each of the four cells of the original 2 × 2 table. A few simple calculations then provide an estimate of the standard error SE(PA). The delete-1 jackknife requires less computation, but the nonparametric bootstrap is usually considered more accurate.

Confidence intervals.

• Asymptotic confidence limits for PA and NA can be obtained as in Eqs. 3.2 and 3.3., substituting PA and NA for po and using the asymptotic standard errors given by Eqs. 3.4 and 3.5.

• Alternatively, the nonparametric bootstrap can be used. Again, we describe the method for PA. As with bootstrap standard error estimation, ones generate a large number (e.g., 100,000) of simulated data sets, computing an estimate PA* for each one. Results are then sorted by increasing value of PA*. Confidence limits of PA are obtained with reference to the percentiles of this ranking. For example, the 95% confidence range of PA is estimated by the values of PA* that correspond to the 2.5 and 97.5 percentiles of this distribution.

An advantage of bootstrapping is that one can use the same simulated data sets to estimate not only the standard errors and confidence limits of PA and NA, but also those of po or any other statistic defined for the 2×2 table.

A SAS program to estimate the asymptotic standard errors and asymptotic confidence limits of PA and NA has been written. A standalone program (executable program and fortran 90 source code) that supplies both bootstrap and asymptotic standard errors and confidence limits can be downloaded here.

Readers are referred to Graham and Bull (1998) for fuller coverage of this topic, including a comparison of different methods for estimating confidence intervals for PA and NA.

(Top on Page)

### Two Raters, Polytomous Ratings

We now consider results for two raters making polytomous (either ordered category or purely nominal) ratings. Let C denote the number of rating categories or levels. Results for the two raters may be summarized as a C × C table such as Table 2.

Table 2
Summary of polytomous ratings by two raters
Rater 1 Rater 2
1 2 ... C total
1 n11 n12 ... n1C n1.
2 n21 n22 ... n2C n2.
.
.
.
.
.
.
...  .
.
.
.
C nC1 nC2 ... nCC nC.
total n.1 n.2 ... n.C N

Here nij denotes the number of cases assigned rating category i by Rater 1 and category j by Rater j, with i, j = 1, ..., C. When a "." appears in a subscript, it denotes a marginal sum over the corresponding index; e.g., ni. is the sum of nij for j = 1, ..., c, or the row marginal sum for category i; n.. = N denotes the total number of cases.

#### Overall Agreement

For this design, po is the sum of frequencies of the main diagonal of table {nij} divided by sample size, or

```                  C
po  =  1/N  SUM  nii    (4)
i=1
```

Statistical significance

• One may test the statistical significance of po with Cohen's kappa. If kappa is significant/nonsignificant, then po may be assumed significant/nonsignificant, and vice versa. Note that the numerator of kappa is the difference between po and the level of agreement expected under the null hypothesis of statistical independence.

• The parametric bootstrap can also be used to test statistical significance. This is like the nonparametric bootstrap already described, except that samples are generated from the null hypothesis distribution. Specifically, one constructs many -- say 5000 -- simulated samples of size N from the probability distribution {πij}, where
```             ni.n.j
πij  =  ------.    (5)
N
```

and the tabulates overall agreement, denoted p*o, for each simulated sample. The po for the actual data is considered statistically significant if it exceeds a specified percentage (e.g., 5%) of the p*o values.

If one already has a computer program for nonparametric bootstrapping only slight modifications are needed to adapt it to perform a parametric bootstrap significance test.

Standard error and confidence limits. Here the standard error and confidence intervals of po can again be calculated with the methods described for 2×2 tables.

#### Specific agreement

With respect to Table 2, the proportion of agreement specific to category i is:

```                  2nii
ps(i)  =  ---------.    (6)
ni. + n.i
```

Statistical significance

Eq. (6) amounts to collapsing the C × C table into a 2×2 table relative to category i, considering this category a 'positive' rating, and then computing the positive agreement (PA) index of Eq. (2). This is done for each category i successively. In each reduced table one may perform a test of statistical independence using Cohen's kappa, the odds ratio, or chi-squared, or use a Fisher exact test.

Standard errors and confidence limits

• Again, for each category i, we may collapse the original C × C table into a 2×2 table, taking i as the 'positive' rating level. The asymptotic standard error formula Eq. (3.4) for PA may then be used, and the Wald method confidence limits given by Eqs. (3.1) and (3.2) may be computed.

• Alternatively, one can use the nonparametric bootstrap to estimate standard errors and/or confidence limits. Note that this does not require a successive collapsing of the original table.

• The delete-1 jackknife can be used to estimate standard errors, but this does require successive collapsings of the C × C table.

(Top of Page)

### Generalized Case

We now consider generalized formulas for the proportions of overall and specific agreement. They apply to binary, ordered category, or nominal ratings and permit any number of raters, with potentially different numbers of raters or different raters for each case.

#### Specific agreement

Let there be K rated cases indexed by k = 1, ..., K. The ratings made on case k are summarized as:

```     {njk} (j = 1, ..., C) = {n1k, n2k, ..., nCk}
```

where njk is the number of times category j (j = 1, ..., C) is applied to case k. For example, if a case k is rated five times and receives ratings of 1, 1, 1, 2, and 2, then n1k = 3, n2k = 2, and {njk} = {3, 2}.

Let nk denote the total number of ratings made on case k; that is,

```           C
nk = SUM  njk.     (7)
j=1
```

For case k, the number of actual agreements on rating level j is

```     njk (njk - 1).     (8)
```

The total number of agreements specifically on rating level j, across all cases is

```              K
S(j) =  SUM njk (njk - 1).    (9)
k=1
```

The number of possible agreements specifically on category j for case k is equal to

```     njk (nk - 1)     (10)
```

and the number of possible agreements on category j across all cases is:

```                   K
Sposs(j)  =  SUM njk (nk - 1).     (11)
k=1
```

The proportion of agreement specific to category j is equal to the total number of agreements on category j divided by the total number of opportunities for agreement on category j, or

```                S(j)
ps(j)  =  -------.     (12)
Sposs(j)
```

#### Overall agreement

The total number of actual agreements, regardless of category, is equal to the sum of Eq. (9) across all categories, or

```            C
O  =  SUM  S(j).     (13)
j=1
```
The total number of possible agreements is
```                K
Oposs  =  SUM  nk (nk - 1).     (14)
k=1
```
Dividing Eq. (13) by Eq. (14) gives the overall proportion of observed agreement, or
```             O
po =  ------.     (15)
Oposs
```

#### Standard errors, interval estimation, significance

The jackknife or, preferably, the nonparametric bootstrap can be used to estimate standard errors of ps(j) and po in the generalized case. The bootstrap is uncomplicated if one assumes cases are independent and identically distributed (iid). In general, this assumption will be accepted when:

• the same raters rate each case, and either there are no missing ratings or ratings are missing completely at random.

• the raters for each case are randomly sampled and the number of rating per case is constant or random.

• in a replicate rating (reproducibility) study, each case is rated by the procedure the same number of times or else the number of replications for any case is completely random.

In these cases, one may construct each simulated sample by repeated random sampling with replacement from the set of K cases.

If cases cannot be assumed iid (for example, if ratings are not missing at random, or, say, a study systematically rotates raters), simple modifications of the bootstrap method--such as two-stage sampling, can be made.

The parametric bootstrap can be used for significance testing. A variation of this method, patterned after the Monte Carlo approach described by Uebersax (1982), is as follows:

```   Loop through s, where s indexes simulated data sets
Loop through all cases k
Loop through all ratings on case k

For each actual rating, generate a
random simulated rating, chosen such that:

Pr(Rating category=j|Rater=i) = base
rate of category j for Rater i.

If rater identities are unknown or for a
reproducibility study, the total base rate
for category j is used.

End loop through case k's ratings
End loop through cases
Calculate p*o and p*s(j)
(and any other statistics
of interest) for sample s.
End main loop
```

The significance of po, ps(j), or any other statistic calculated, is determined with reference to the distribution of corresponding values in the simulated data sets. For example, po is significant at the .05 level (1-tailed) if it exceeds 95% of the p*o values obtained for the simulated data sets.

(Top of Page)

## References

(Top on Page)
Go to Agreement Statistics
Go to Latent Structure Analysis

Rev: 05 Aug 2014

(c) 2000-2014 John Uebersax PhD   email