Kappa Coefficients:

A Critical Appraisal


This is only one page of a larger website. For more information, including alternatives to kappa, visit the Agreement Statistics main page

Summary

There is wide disagreement about the usefulness of kappa statistics to assess rater agreement. At the least, it can be said that (1) kappa statstics should not be viewed as the unequivocal standard or default way to quantify agreement; (2) one should be concerned about using a statistic that is the source of so much controversy; and (3) oneshould consider alternatives and make an informed choice.

One can distinguish between two possible uses of kappa: as a way to test rater independence (i.e. as a test statistic), and as a way to quantify the level of agreement (i.e., as an effect-size measure). The first use involves testing the null hypothesis that there is no more agreement than might occur by chance given random guessing; that is, one makes a qualitative, "yes or no" decision about whether raters are independent or not. Kappa is appropriate for this purpose (although to know that raters are not independent is not very informative; raters are dependent by definition, inasmuch as they are rating the same cases).

It is the second use of kappa--quantifying actual levels of agreement--that is the source of concern. Kappa's calculation uses a term called the proportion of chance (or expected) agreement. This is interpreted as the proportion of times raters would agree by chance alone. However, the term is relevant only under the conditions of statistical independence of raters. Since raters are clearly not independent, the relevance of this term, and its appropriateness as a correction to actual agreement levels, is very questionable.

Thus, the common statement that kappa is a "chance-corrected measure of agreement" misleading. As a test statistic, kappa can verify that agreement exceeds chance levels. But as a measure of the level of agreement, kappa is not "chance-corrected"; indeed, in the absence of some explicit model of rater decisionmaking, it is by no means clear how chance affects the decisions of actual raters and how one might correct for it.

A better case for using kappa to quantify rater agreement is that, under certain conditions, it approximates the intra-class correlation. But this too is problematic in that (1) these conditions are not always met, and (2) one could instead directly calculate the intraclass correlation.

(Top of page)
(Back to Agreement main page)


Pros and Cons

Pros

Cons

(Top of page)
(Back to Agreement main page)


Calculating Kappa and Weighted Kappa with SAS®

(Top of page)
(Back to Agreement main page)


Bibliography: Kappa Coefficient

(Top of page)
(Back to Agreement main page)


Where to Start

(Top of Bibliography)


Overviews

(Top of Bibliography)


Calculation of the Kappa Coefficient

(Top of Bibliography)


Weighted Kappa

(Top of Bibliography)


Issues and Problems

(Top of Bibliography)


Significance, Standard Errors, Interval Estimates, Comparing Kappas

(Top of Bibliography)


Extensions and Variations of Kappa

(Top of Bibliography)


Software for Estimation of Kappa

Top of Bibliography
Top of Page


Go to Agreement Statistics

Updated:
01 Oct 2009 (Myth of chance correction)
18 Mar 2010 (link updated)


John Uebersax Enterprises LLC
(c) 2000-2010 John Uebersax PhD    email