Intraclass Correlation and Related Methods


This page discusses use of the ICC to assess reliability of ordered-category and Likert-type ratings. Some comments also apply to the ICC for continuous-level data. For information on other ways to analyze rater agreement, visit the Agreement Statistics main page.

Introduction

The Intraclass Correlation (ICC) assesses rating reliability by comparing the variability of different ratings of the same subject to the total variation across all ratings and all subjects.

The theoretical formula for the ICC is:

s 2(b)
ICC    =   ------------    [1]
s 2(b) + s 2 (w)

where s 2(w) is the pooled variance within subjects, and s 2(b) is the variance of the trait between subjects.

It is easily shown that s 2(b) + s 2(w) = the total variance of ratings--i.e., the variance for all ratings, regardless of whether they are for the same subject or not. Hence the interpretation of the ICC as the proportion of total variance accounted for by within-subject variation.

Equation [1] would apply if we knew the true values, s 2 (w) and s 2(b). But we rarely do, and must instead estimate them from sample data. For this we wish to use all available information; this adds terms to Equation [1].

For example, s 2(b) is the variance of true trait levels between subjects. Since we do not know a subject's true trait level, we estimate it from the subject's mean rating across the raters who rate the subject. Each mean rating is subject to sampling variation--deviation from the subject's true trait level, or it's surrogate, the mean rating that would be obtained from a very large number of raters. Since the actual mean ratings are often based on two or a few ratings, these deviations are appreciable and inflate the estimate of between-subject variance.

We can estimate the amount and correct for this extra, error variation. If all subjects have k ratings, then for the Case 1 ICC (see definition below) the extra variation is estimated as (1/k) s 2(w), where s 2(w) is the pooled estimate of within-subject variance. When all subjects have k ratings, s2(w) equals the average variance of the k ratings of each subject (each calculated using k-1 as denominator). To get the ICC we then:

For the various other types of ICC's, different corrections are used, each producing it's own equation. Unfortunately, these formulas are usually expressed in their computational form--with terms arranged in a way that facilitates calculation, rather than their derivational form--which would make clear the nature and rationale of the correction terms.

Different Types of ICC

In their important paper, Shrout and Fleiss (1979) describe three classes of ICC for reliability, which they term Case 1, Case 2 and Case 3. Each Case applies to a different rater agreement study design.

Case 1 Raters for each subject are selected at random
Case 2 The same raters rate each case. These are a random sample.
Case 3 The same raters rate each case. These are the only raters.

Case 1. One has a pool of raters. For each subject, one randomly samples from the rater pool k different raters to rate this subject. Therefore the raters who rate one subject are not necessarily the same as those who rate another. This design corresponds to a 1-way Analysis of Variance (ANOVA) in which Subject is a random effect, and Rater is viewed as measurement error.

Case 2. The same set of k raters rate each subject. This corresponds to a fully-crossed (Rater × Subject), 2-way ANOVA design in which both Subject and Rater are separate effects. In Case 2, Rater is considered a random effect; this means the k raters in the study are considered a random sample from a population of potential raters. The Case 2 ICC estimates the reliability of the larger population of raters.

Case 3. This is like Case 2--a fully-crossed, 2-way ANOVA design. But here one estimates the ICC that applies only to the k raters in the study. Since this does not permit generalization to other raters, the Case 3 ICC is not often used.

Shrout and Fleiss (1981) also show that for each of the three Cases above, one can use the ICC in two ways:

For each of the Cases, then, there are two forms, producing a total of 6 different versions of the ICC.

(Top of Page)       (Agreement Statistics Page)


Pros and Cons

Pros

Cons

(Top of Page)       (Agreement Statistics Page)


The Comparability Issue

Above it was noted that the ICC is strongly dependent on the trait variance within the population for which it is measured. This can complicate comparisons of ICCs measured in different populations, or in generalizing results from a single population.

Some suggest avoiding this problem by eliminating or holding constant the "problematic" term, s 2(b).

Holding the term constant would mean choosing some fixed value for s 2(b), and using this in place of the different value estimated in each population. For example, one might pick as s 2(b) the trait variance in the general adult population--regardless of what population the ICC is measured in.

However, if one is going to hold s 2(b) constant, one may well question using it at all! Why not simply report as the index of unreliability the value of s 2(w) for a study? Indeed, this has been suggested, though not used in practice much.

But if one is going to disregard s 2(b) because it complicates comparisons, why not go a step further and express reliability simply as raw agreement rates--for example, the percent of times two raters agree on the exact same category, and the percent of time they are within on level of one another?

An advantage of including s 2(b) is that it automatically controls for the scaling factor of an instrument. Thus (at least within the same population), ICCs for instruments with different numbers of categories can be meaningfully compared. Such is not the case with raw agreement measures or with s 2 (w) alone. Therefore, someone reporting reliability of a new scale may wish to include the ICC along with other measures if they expect later researchers might compare their results to those of a new or different instrument with fewer or more categories.

(Top of Page)       (Agreement Statistics Page)


Software

SPSS

SPSS has excellent features for calculating the ICC. The sources below explain them:

SAS

SAS does not have a built-in ICC procedure, but these user-written macros and fill the gap:

(Top of Page)       (Agreement Statistics Page)


ICC Online Calculators

(Top of Page)       (Agreement Statistics Page)


Websites


Articles

Barrett P. Assessing the reliability of rating data.

Fleiss JL. Statistical methods for rates and proportions. 2nd ed. New York: John Wiley, 1981, 38-46.

Garson D. Intraclass correlation in relation to inter-rater reliability.

Landis JR Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977a; 33: 159-174.

Landis JR, Koch GG. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 1977b; 33: 363-374.

McGraw KO, Wong SP. Forming inferences about some intraclass correlations. Psychological Methods 1996;1:30-46.

Muller R, Buttner P. A critical discussion of intraclass correlation coefficients. Stat Med 1994 Dec 15-30;13(23-24):2465-76.

Shrout PE, Fleiss JL. Intraclass correlations: Uses in assessing rater reliability. Psychol Bulletin 1979;86:420-427.

Shrout PE. Measurement reliability and agreement in psychiatry. Statistical Methods in Medical Research. 7(3):301-17, 1998 Sep.


(Top of Page)
Go to Agreement Statistics
Go to Latent Structure Analysis

Last updated: 2 April 2007 (removed background)


(c) 2006 John Uebersax PhD    email