Summary: In measuring accuracy of a diagnostic test, we don't correct sensitivity (Se) or specificity (Sp) for the effects of chance; why do so in measuring rater agreement?
Despite its reputation as a chance-corrected agreement measure, kappa does not correct for chance agreement. Neither has the need for such adjustment been convincingly shown.
To begin we should ask what chance agreement is. A plausible view is this: when raters are uncertain about the correct classification, a degree of guessing may occur. Guessing can be total (as with "I'm making a complete guess here") or partial (e.g., "My choice is based partly on guesswork"). When two raters both guess, sometimes they'll agree. The question is then whether such agreements should count in a statistical index of agreement.
In theory, if one could estimate how many agreements are produced by joint guessing, their effect could be removed to produce a more accurate measure of "true" agreement. This is what the kappa coefficient supposedly does, but really doesn't.
The formula for kappa is:
kappa = (po - pc) / (1 - pc),
where po denotes observed agreement, and pc denotes chance agreement and is computed based on marginal probabilities under the assumption of complete statistical independence of raters. That is, pc estimates the proportion of times raters would agree if they (1) guessed completely on every case, and (2) guessed with probabilities that match the marginal proportions of the observed ratings.
Assumption (1) is completely untenable. Raters might resort to guessing sometimes -- probably only in a minority of cases, and certainly not in all cases. Thus the basic logic behind viewing pc as an explicit chance-correction term is flawed.
By using explicit models of rater decisionmaking, one could potentially apply a valid chance correction (Uebersax, 1987). This would require both a theoretically defensible model and sufficient data to empirically verify conformity of observed data with the model. In any case this becomes an exercise in modeling rater agreement (Agresti, 1992; Uebersax, 1993) -- rather than merely calculating a simple index.
However an even stronger argument against making a chance correction to observed agreement is as follows. A problem analogous to measuring rater agreement is the assessment of correspondence between a diagnostic test and a criterion or gold standard, i.e., a rating accuracy paradigm. For example, when evaluating a diagnostic test, one commonly estimates its sensitivity (Se) and its specificity (Sp). With Se and Sp no adjustment is made for possible chance agreements between the test and the gold standard.
Yet, from a logical standpoint, there is not much difference between this situation and a rater agreement paradigm. If chance correction is deemed necessary for agreement, why not for measuring accuracy? If a disease has a very high prevalence and a diagnostic test has a high rate of positive results, then Se will be large even if the test and diagnosis are statistically independent. Consider the data summarized in this table:
Se for this table is 81/90 = .90, which is reasonably large. Yet here the test and the criterion are statistically independent: the level of Se is exactly what one would expect if test results were random -- e.g., made by flipping a biased coin with Pr(heads) = .90. This example shows that when marginal rates are extreme, chance alone may produce a high level of Se. This is precisely the reasoning originally used to justify using kappa as an index of agreement rather than simply reporting the raw agreement rate, po.
Why, then, is a similar chance correction not considered in the case of Se? The answer is probably that when one estimates Se one also generally estimates Sp. The use of both these indices together avoids the possibility that an extreme marginal split might cause a poor diagnostic test to appear accurate. When a test and gold standard are independent or weakly associated, and if the base rates are extreme -- the usual situation in which a chance correction becomes a potential issue -- Se and Sp will not both be high.
In Table 1, for example, Sp = 1/10 = .1, which is extremely low. A low Sp would make one view skeptically a high value of Se. By considering both Se and Sp together there is no obvious, compelling need to correct for possible effects of chance (especially given that to do so properly might require considerable effort).
The same principle should logically apply to assessing agreement between two raters or tests. In this case we have the option to compute the proportions of specific positive agreement (PA) and specific negative agreement (NA), which are closely analogous to Se and Sp. By verifying that both PA and NA are acceptably high one is protected against unfairly capitalizing on extreme base rates when evaluating the level of rater agreement.
It is clear the Se and Sp are widely used and trusted indices, and if there were a need to correct them for chance this would have been mentioned long ago. There is simply no need, and the same principle should apply to measuring agreement between two tests or raters.
This argument is so compelling that it ought to settle the issue completely. The problem is that many researchers continue use kappa coefficients imitatively and without thinking about the issues involved. It is also unfortunate that statistics texts neglect to explain PA and NA.
Agresti A. Modelling patterns of agreement and disagreement. Statistical Methods in Medical Research, 1992, 1(2), 201-218.
Uebersax JS. Diversity of decision-making models and the measurement of interrater agreement. Psychological Bulletin, 1987, 101, 140-146.
Uebersax JS. Modeling approaches for the analysis of observer agreement. Investigative Radiology, 27(9), 1992, 738-743.