Analyzing Agreement on Interval-Level Ratings

Agreement on Interval-Level Ratings

Introduction

Here we depart from the main subject of our inquiry--agreement on categorical ratings--to consider interval-level ratings.

Methods for analysis of interval-level rating agreement are better established than is true with categorical data. Still, there is far from complete agreement about which methods are best, and many, if not most published studies use less than ideal methods.

Our view is that the basic premise outlined for analyzing categorical data, that there are different components of agreement and that these should be separately measured, applies equally to interval-level data. It is only by separating the different components of agreement that we can tell what steps are needed to improve agreement.

Before considering specific methods, it is helpful to consider a common question: When should ratings be treated as ordered-categorical and when as interval-level data?

Some guidelines are as follows:

If the rating levels have integer anchors on a Likert-type scale, treat the ratings as interval data. By a Likert-type scale, it is meant that the actual form on which raters record ratings contains an explicit scale such as the following two exmples:
```
      lowest                              highest
      level                               level
        1     2     3     4     5     6     7    (circle one)





    strongly                              strongly
    disagree                              agree
        1     2     3     4     5     6     7      check level
       __    __    __    __    __    __    __      that applies
```
These examples contain verbal anchors only for the two extreme levels, but there are other examples where each integer is anchored with a verbal label. The basic principle here is that the format itself strongly implies that raters should regard the rating levels as exactly or approximately evenly-spaced.
If rating categories have been chosen by the researcher to represent at least approximately equally-spaced levels, strongly consider treating the data as interval level. For example, for rating the level of cigarette consumption, one has latitude in defining categories such as "1-2 cigarettes per day," "1/2 pack per day," "1 pack per day," etc. It is my observation at least, that researchers instinctively choose categories that represent more or less equal increments in a construct of "degree of smoking involvement," justifying treatment of the data as interval level.
If the categories are few in number and the rating level anchors are chosen/worded/formatted in such a way that does not imply any kind of equal spacing to rating levels, treat the data as ordered categorical. This may apply even when response labels are labelled with integers--for example, response levels of "1. None," "2. Mild," "3. Moderate," and "4. Severe." Note that here one could as easily substitute the letters A, B, C and D for the integers 1, 2, 3 and 4.
If the ratings will, in subsequent research, be statistically analyzed as interval-level data, then treat them as interval-level data for the reliability study. Conversely, if they will be analyzed as ordered-categorical in subsequent research, treat them as ordered-categorical in the reliability study.

Some who are statistically sophisticated may insist that nearly all ratings of this type should be treated as ordered-categorical and analyzed with nonparametric methods. However, this view fails to consider that one may also err by applying nonparametric methods when ratings do meet the assumptions of interval-level data; specifically, by using nonparametric methods, significant statistical power may be lost.

General Issues

In this section we consider two general issues. The first is an explanation of three different components of disagreement on interval-level ratings. The second issue concerns the general strategy for examining rater differences.

Different causes may result in rater disagreement on a given case. With interval-level data, these various causes have effects that can be broadly grouped into three categories: effects on the correlation or association of raters' ratings, rater bias, and rater differences in the distribution of ratings.

2.1 Rater Association

In making a rating, raters typically consider many factors. For example, in rating life quality, a rater may consider separate factors of satisfaction with social relationships, job satisfaction, economic security, health, etc. Judgments on these separate factors are combined by the rater to produce a single overall rating.

Raters may vary in what factors they consider. Moreover, different raters may weight the same factors differently, or they may use different "algorithms" to combine information on each factor to produce a final rating.

Finally, a certain amount of random error affects a rating process. A patient's symptoms may vary over time, raters may be subject to distractions, or the focus of a rater may vary. Because of such random error, we would not even expect two ratings by a single rater of the same case to always agree.

The combined effect of these issues is to reduce the correlation of ratings by different raters. (This can be easily shown with formulas and a formal measurement model.) Said another way, to the extent that raters' ratings correlate less than 1, we have evidence that the raters are considering or weighting different factors and/or of random error and noise in the rating process. When rater association is low, it implies that the study coordinator needs to apply training methods to improve the consistency of raters' criteria. Group discussion conferences may also be useful to clarify rater differences in their criteria, definitions, and interpretation of the basic construct.

2.2 Rater Bias

Rater bias refers to the tendency of a rater to make ratings generally higher or lower than those of other raters. Bias may occur for several reasons. For example, in clinical situations, some raters may tend to "overpathologize" or "underpathologize." Some raters may also simply interpret the calibration of the rating scale differently so as to make generally higher or lower ratings.

With interval-level ratings, rater bias can be assessed by calculating the mean rating of a rater across all cases that they rate. High or low means, relative to the mean of all raters, indicate positive or negative rater bias, respectively.

2.3 Rating Distribution

Sometimes an individual rater will have, when one examines all ratings made by the rater, a noticeably different distribution than the distribution of ratings for all raters combined. The reasons for this are somewhat more complex than is true for differences in rater association and bias. Partly it may relate to rater differences in what they believe is the distribution of the trait in the sample or population considered. At present, we mainly regard this as an empirical issue: examination of the distribution of ratings by each rater may sometimes reveal important differences. When such differences exist, some attempt should be made to reduce them, as they are associated with decreased rater agreement.

2.4 Rater vs. Rater or Rater vs. Group

In analyzing and interpreting results from a rater agreement study, and when more than two raters are involved, one often thinks in terms of a comparison of each rater with every other rater. This is relatively inefficient and, it turns out, often unnecessary. Most of the important information to be gained can be more easily obtained by comparing each rater to some measure of overall group performance. We term the former approach the Rater vs. Rater strategy, and the latter the Rater vs. Group strategy.

Though it is the more common, the Rater vs. Rater approach requires more effort. For example, with 10 raters, one needs to consider a 10 x 10 correlation matrix (actually, 45 correlations between distinct rater pairs). In contrast, a Rater vs. Group approach, which might, for example, instead correlate each rater's ratings with the average rating across all raters, would summarize results with only 10 correlations.

The prevalence of the Rater vs. Rater view is perhaps historical and accidental. Originally, most rater agreement studies used only two raters--so methods naturally developed for the analysis of rater pairs. As studies grew to include more raters, the same basic methods (e.g., kappa coefficients) were applied by considering all pairs of raters. What did not happen (as seldom does when paradigms evolve gradually) is a basic re-examination of and new selection of methods.

This is not to say that the Rater vs. Rater approach is always bad, or that the Rater vs. Group is always better. There is a place for both. Sometimes one wants to know the extent to which different rater pairs vary in their level of agreement; then the Rater vs. Rater approach is better. Other times one will wish merely to obtain information on the performance of each rater in order to provide feedback and improve rating consistency; then the Rater vs. Group approach may be better. (Of course, there is nothing to prevent the researcher from using both approaches.) It is important mainly that the researcher realize that they have a choice, and to make an informed selection of methods.

3. Measuring Rater Agreement

We now direct attention to the question of which statistical methods to use to assess association, bias, and rater distribution differences in an agreement study.

3.1 Measuring Rater Association

As already mentioned, from the Rater vs. Rater perspective, association can be summarized by calculating a Pearson correlation (r) of the ratings for each distinct pair of raters. Sometimes one may wish to report the entire matrix of such correlations. Other times it will make sense to summarize the data as a mean, standard deviation, minimum and maximum across all pairwise correlations.

From a Rater vs. Group perspective, there are two relatively simple ways to summarize rater association. The first, already mentioned, is to calculate the correlation of each raters' ratings with the average of all raters' ratings (this generally presupposes that all raters rate the same set of cases or, at least, that each case is rated by the same number of raters.) The alternative is to calculate the average correlation of a rater with every other rater--that is, to consider row or column averages of the rater x rater correlation matrix. It should be noted that correlations produced by the former method will be, on average, higher than those produced by the latter. This is because average ratings are more reliable than individual ratings. However, the main interest will be to compare different raters in terms of their correlation with the mean rating, which is still possible; that is, the raters with the highest/lowest correlations with one method will also be those with the highest/lowest correlations with the other.

A much better method, however, is factor analysis. With this method, one estimates the association of each rater with a latent factor. The factor is understood as a "proto-rating," or the latent trait of which each rater's opinions are an imperfect representation. (If one wanted to take an even stronger view, the factor could be viewed as representing the actual trait which is being rated.)

The details of such analysis are as follows.

Using any standard statistical software such as SAS or SPSS, one uses the appropriate routine to request a factor analysis of the data. In SAS, for example, one would use PROC FACTOR.
A common factor model is requested (not principal components analysis).
A one-factor solution is specified; note that factor rotation does not apply with a one-factor solution, so do not request this.
One has some latitude in choice of estimation methods, but iterated principal factor analysis is recommended. In SAS, this is called the PRINIT method.
Do not request commonalities fixed at 1.0. Instead, let the program estimate commonalities. If the program requests that you specify starting commonality values, request that squared multiple correlations (SMC) be used.

In examining the results, two parts of the output should be considered. First are the loadings of each rater on the common factor. These are the same as the correlations of each rater's ratings with the common factor. They can be interpreted as the degree to which a rater's ratings are associated with the latent trait. The latent trait or factor is not, logically speaking, the same as the construct being measured. For example, a patient's level of depression (the construct) is a real entity. On the other hand, a factor or latent trait inferred from raters' ratings is a surrogate--it is the shared perception or interpretation of this construct. It may be very close to the true construct, or it may represent a shared misinterpretation. Still, lacking a "gold standard," and if we are to judge only on the basis of raters' ratings, the factor represents our best information about the level of the construct. And the correlation of raters with the factor represents our best guess as to the correaltion of raters' ratings with the true construct.

Within certain limitations, therefore, one can regard the factor loadings as upper-bound estimates for the correlation of ratings with the true construct--that is, upper-bound estimate on the validity of ratings. If a loading is very high, then we only know that the validity of this rater is below this number--not very useful information. However, if the loading is low, then we know that the validity of the rater, which must be lower, is also low. Thus, in pursuing this method, we are potentially able to draw certain inferences about rating validity--or, at least, lack thereof, from agreement data (Uebersax, 1989).

Knowledgeable statisticians and psychometricians recognize that the factor-analytic approach is appropriate, if not optimal, for this application. Still, one may encounter criticism from peers or reviewers who are perhaps overly attached to convention. Some may say, for example, that one should really use Cronbach's alpha or calculate the intraclass correlation with such data. One should not be overly concerned about such comments. (I recognize that it would help many researchers to be armed with references to published articles that back up what is said here. Such references exist and I'll try to supply them. In the meantime, you might try a literature search using keywords like "factor analysis" and "agreement" or "reliability.")

While on this subject, it should be mentioned that there has been recent controversy about using the Pearson correlation vs. using the intraclass correlation vs. using a new coefficient of concordance. (Again, I will try to supply references.) I believe this controversey is misguided. Critics are correct in saying that, for example, the Pearson correlation only assesses certain types of disagreement. For example, if, for two raters, one rater's ratings are consistently X units higher than another rater's ratings, the two raters will have a perfect Pearson correlation, even though they disagree on every case.

However, our perspective is that this is really a strength of the Pearson correlation. The goal should be to assess each component of rater agreement (association, bias, and distributional differences) separately. The problem with these other measures is precisely that they attempt to serve as omnibus indices that summarize all types of disagreement into a single number. In so doing, they provide information of relatively little practical value; as they do not distinguish among different components of disagreement, they do not enable one to identify steps necessary to improve agreement.

Here is a "generic" statement that one can adapt to answer any criticisms of this nature:

"There is growing awareness that rater agreement should be viewed as having distinct components, and that these components should be assessed distinctly, rather than combined into a single omnibus index. To this end, a statistical modeling approach to such data has been advocated (Agresti, 1992; Uebersax, 1992)."

3.2 Measuring Rater Bias

The simplest way to express rater bias is to calculate the mean rating level made by each rater.

To compare rater differences (Rater vs. Rater approach), the simplest method is to perform a paired t-test between each pair of raters. One may wish to perform a Bonferonni adjustment to control the overall (across all comparisons) alpha level. However, this is not strictly necessary, especially if ones aims are more exploratory or oriented toward informing "remedial" intervention.

Another possibility is a one-way Analysis of Variance (ANOVA), in which raters are viewed as the independent variable and ratings are the dependent variable. An ANOVA can assess whether there are bias differences among raters considering all raters simultaneouly (i.e., this is related to the Rater vs. Group approach). If the ANOVA approach is used, however, one will still want to list the mean ratings for each rater, and potentially perform "post hoc" comparisons of each rater-pair's means--this is more rigorous, but will likely produce results comparable to the t-test methods described above.

If a paper is to be written for publication in a medical journal, my suggestion would be to perform paired t-tests for each rater pair and to report which or how many pairs showed significant differences. Then one should peform an ANOVA simply to obtain an overall p-value (via an F-test) and add mention of this overall p value.

If the paper will be sent to say, a psychology journal, it might be advisable to report results of a one-way ANOVA along with results of formal "post-hoc" comparisons of each rater pair.

3.2 Rater Distribution Differences

It is possible to calculate statistical indices that reflect the similarity of one rater's ratings distribution with that of another, or between each rater's distribution and the distribution for all ratings. However such indices usually do not characterize precisely how two distributions differ--merely whether or not they do differ. Therefore, if this is of interest, it is probably more useful to rely on graphical methods. That is, one can graphically display the distribution of each rater's ratings, and the overall distribution, and base comparisons on these displays.

4. Using the Results

Often rater agreement data is collected in during a specific training phase of a project. In other case, there is not a formal training phase, but it is nonetheless expected that results can be used to increase the consistency of future ratings.

Several formal and informal methods can be used to assist these tasks. Two are described here.

4.1 The Delphi Method

The Delphi Method is a technique developed at the RAND Corporation to aid to group decision making. The essential feature is the use of quantitative feedback given to each participant. The feedback consists of a numerical summary of that participant's decisions, opinions, or, as applies here, ratings, along with a summary of the average decisions, opinions or ratings across all participants. The assumption is that, provided with this feedback, a rater will begin to make decisions or ratings more consistent with the group norm.

The method is easily adapted to multi-rater, interval-level data paradigms. It can be used in conjunction with each of the three components of rater agreement already described.

4.2 Rater Bias

To apply the method to rater bias, one would first calculate the mean rating for each rater in the training phase. One would then prepare a figure showing the distribution of averages. Figure 1 is a hypothetical example for 10 raters using a 5-point rating scale.

                            *  * *
                          * *  * **  *  *<---you
               |----+----|----+----|----+----|----+----|
               1         2         3         4         5


                  Distribution of Average Rating Level
                               Figure 1

A copy of the figure is given to each rater. Each is annotated to show the average for that rater, as shown in Figure 1.

4.3 Rater Association

A similar figure is used to give quantitative feedback on the association of each rater's ratings with those of the other raters.

If one has performed a factor analysis of ratings, then the figure would show the distribution of factor loadings across raters. If not, simpler alternatives are to display the distribution of the average correlation of each rater with the other raters, or the correlation of each rater's ratings with the average of all raters (or, alternatively, with the average of all raters other than that particular rater).

Once again, a specifically annotated copy of the distribution is given to each rater. Figure 2 is a hypothetical example for 10 raters.


                      you-->*  * *   *  *   *  * * *  *
          |----+----|----+----|----+----|----+----|----+----|
          .5       .6         7        .8        .9        1.0


             Distribution of Correlation with Common Factor
                                Figure 2

4.4 Distribution of Ratings

Finally, one might consider giving raters feedback on the distribution of their ratings and how this compares with raters overall. For this, each rater would receive a figure such as Figure 3, showing the distribution of ratings for all raters and for the particular rater.


         |                               |
         |            **                 |  **
  % of   |            **                 |  **
 ratings |       **   **   **            |  **   **
         |  **   **   **   **            |  **   **   **   **
         |  **   **   **   **   **       |  **   **   **   **   **
       0 +---+----+----+----+----+       +---+----+----+----+----+
             1    2    3    4    5           1    2    3    4    5
                 Rating Level                    Rating Level


         Distribution of Ratings             Your Distribution
              for All Raters
                                Figure 3

Use of figures such as Figure 3 might be considered optional, as, to some extent, this overlaps with the information provided in the Rater Bias figures. On the other hand, it may make a rater's differences from the group norm more clear.

4.5 Discussion of Ambiguous Cases

The second technique consists of having all raters or pairs of raters meet to discuss disagreed-on cases.

The first step is to identify the cases that are the subject of most disagreement. If all raters rate the same cases one can simply calculate, for each case, the standard deviation of the different ratings for that case. Cases with the largest standard deviations--say the top 10%--may be regarded as ambiguous cases. These cases may then be re-presented to the set of raters who meet as a group, discuss features of these cases, and identify sources of disagreement. Alternatively, or if all rater do not rate the same cases, a similar method can be applied at the level of pairs of raters. That is, for each pair of raters I and J, the cases that are most disagreed on (cases with the greatest absolute difference between the rating by Rater I and the rating by Rater J) are reviewed by the two raters who meet to discuss these and iron out differences.

References

Agresti, A. (1992). Modelling patterns of agreement and disagreement. Statistical Methods in Medical Research, 1, 201-218.

Uebersax, J. S. (1988). Validity inferences from interobserver agreement. Psychological Bulletin, 104, 405-416.

Uebersax, J. S. (1992). A review of modeling approaches for the analysis of observer agreement. Investigative Radiology, 27, 738-743.

Back to Agreement Statistics

This page maintained by

John Uebersax PhD

Revised: 24 May 2000