Estimating a Latent Trait Model by Factor Analysis of Tetrachoric Correlations

Introduction

Estimating threshold parameters

Estimating correlation parameters

The tetrachoric correlation
Factoring the matrix of tetrachoric correlations
Factoring tetrachoric correlations with SAS

Example

Calculating factor scores

Final thoughts

Citation

References

Introduction

We consider here estimation of a fully Gaussian latent trait model. "Fully Gaussian" means that both the latent trait and measurement error are assumed to be normally distributed. This model, equivalent to what Item Response Theory (IRT) terms a two-parameter normal (2PN) model with a normally distributed latent trait, is often used in educational testing, social science research, and, increasingly, in health research. It is also useful--perhaps one of the best methods--for binary data factor analysis.

For general background on latent trait and IRT models (we use these terms interchangeably), the two classic texts are Lord and Novick (1968) and Lazarsfeld and Henry (1968; 1968 was a good year!). Heinen (1996) and Hulin, Drasgow and Parsons (1983) are also good sources. For background on Gaussian latent trait models see Bock and Aitkin (1981), Bock and Lieberman (1970), and Bock, Gibbons and Muraki (1988)--the last reference is less technical. For interesting discussion of newer, Bayesian methods for estimating IRT models, see Johnson and Albert (1999).

The best way to estimate the fully Gaussian latent trait model, a method developed by Bock and colleagues, is FIML (full-information maximum likelihood) estimation. With this approach, all model parameters are estimated simultaneously by maximum likelihood. The method is very effective, but, for multidimensional latent trait models, requires specialized software, such as TESTFACT (Wilson, Wood & Gibbons, 1998). [Unidimensional versions can be estimated with LEM (Vermunt, 1988), or LTM (Uebersax, 2000), both free programs.]

Another way to estimate the fully Gaussian latent trait model has long been known--in fact, it was the only option for many years. The Bock method was rightly seen as an improvement. Fairly recently, though, studies (Knol & Berger, 1991; Parry & McArdle, 1991) reexamined the older method, and concluded that it is still very viable.

This older method--sometimes called the heuristic method--is based on factor analysis of the tetrachoric correlations between item or variable pairs. (For convenience, we will use the term 'item' herein to refer generically to any binary variable.)

It is no surprise that the FIML and heuristic estimation methods produce very similar results; despite apparent differences, both are based on the same fundamental model. The FIML method formulates the problem as a standard latent trait or IRT model, whereas the heuristic method views it as a factor analysis problem. Mislevy (1986), and, more formally, Takane and de Leeuw (1987) describe how these models are the same.

In the following sections, we will describe the fully Gaussian latent trait model and its parameters, explain how to estimate the parameters by the heuristic method, and consider an example.

This discussion will not consider the case of ordered-polytomous data. However, methods similar to those described here are applicable to such data.

The Fully Gaussian Latent Trait Model

As noted above, the fully Gaussian latent trait model can be understood as a factor analysis model (to be precise, a common-factor model with probit link functions; this will become clear), or a latent trait/IRT model. In the former case, the model assumes that, underlying each dichotomous manifest variable (observed variable), X_i, is a corresponding latent continuous variable, Y_i. Dichotomous responses result from the operation of thresholds relative to the latent continuous variables.

The latent trait/IRT version is most easily described for the unidimensional case: (a) there is a normally distributed latent trait; (b) associated with each item is an item response function that gives the probability of a positive response for a respondent/examinee with each level of the latent trait; and (c) item response functions have a characteristic "S" shape--specifically, that of a "normal ogive," better known as the cumulative distribution function (cdf) of a normal curve.

Both models--the factor analysis model and the latent trait/IRT model--have two sets of parameters. These have various names; we shall simply call them the correlation parameters and the threshold parameters. We will consider these parameters mainly in terms of the factor analysis model, which is consistent with present aims.

Threshold parameters

These are the values of thresholds that discretize latent continuous variables to produce dichtomous observed values. These correspond--that is, they are related in a simple way--to what are alternatively termed location, difficulty, or b parameters in various latent trait/IRT models.

Correlation parameters

These are the correlations of latent continuous variables with the underlying factors. That is, they are latent factor loadings. These correspond--that is, again, are related by very simple functions--to what are variously known as slope, dispersion, a, or measurement error parameters in latent trait/IRT models.

We shall now consider how to estimate the threshold and correlation parameters by the heuristic method.

Estimating Threshold Parameters

Threshold parameters are estimated by applying the probit function to the marginal response proportions for each item.

The probit function

The probit function is the inverse of the cumulative distribution function (cdf) of the standard normal curve. We typically denote the latter by Phi() [i.e., F()] and the former function by Phi^-1(), or probit(). The Phi function is familiar from its use in significance testing. For some normal deviate, z, Phi(z) is the area under the standard normal curve to the left of z. For example, if z = 0, then

Phi(z) = Phi(0) = .5.

because half the standard normal curve is to the left of 0.

The probit function does the reverse: one supplies the proportion of the curve to the left of some z value, and the probit function gives the z value. So, for example,

probit(.5) = 0.,

Applying the probit function

The heuristic estimation of threshold parameters is, therefore, quite simple. First, for each item, one notes the proportion of negative responses. This is the proportion of cases that fall below the response threshold on the corresponding latent continuous variable--for example, the proportion of patients who are below a physician's threshold for making a positive diagnosis, or the proportion of respondents whose opinion strength is below the threshold required to answer "yes" to a survey question. Let p_i denote this proportion. Letting t_i denote the threshold for item i,

t_i = probit(p_i).

Doing this calculation for each item gives the heuristic item threshold estimates.

Calculating the probit function

The probit function is usually calculated with a polynomial approximation. Several good routines are available. Some are described in Abramowitz and Stegun (1972) and Kennedy and Gentle (1980).

See a simple BASIC probit function program.

Download an executable probit function program.

The thresholds calculated by this method are relative to the latent continuous variables. For a unidimensional latent trait model, one can also consider the location of the thresholds relative to the latent trait. These latter thresholds, sometimes called the location parameters in IRT, correspond to the latent trait level where probability of a positive response for an item is .5 (i.e., where the item response function is .5, at the inflection point of the "S" curve.) Letting b_i denote these thresholds,

b_i = t_i/rho_i, where rho_i is the correlation parameter for item i, calculated as described in the next section.

Estimating Correlation Parameters

To estimate the correlation parameters of the latent trait model, we will first calculate the matrix of tetrachoric correlations for all item pairs, and will then factor analyze this matrix.

The tetrachoric correlation

The tetrachoric correlation (Drasgow, 1988; Harris, 1988; the former reference is more complete) between two dichotomous items estimates the Pearson correlation one would obtain if the two constructs were measured continuously. In terms of our definitions, the tetrachoric correlation for manifest variables X_i and X_j, which we denote by r^*_ij, is equal to the Pearson correlation between the corresponding latent continuous variables Y₁ and Y₂. That is:

r^*_ij = r(Y₁, Y₂).

To dispel any possible confusion, it bears mentioning that we have referred to two different types of latent correlations. What we've termed the correlation parameters of the latent trait model are the correlations of latent continuous variables with factors. These we will derive from the tetrachoric correlations, which are the correlations of latent continuous variables with each other.

An intuitive explanation of the tetrachoric correlation

Figure 1 portrays the joint distribution of the latent continuous variables, Y₁ and Y₂, corresponding to manifest items X₁ and X₂.



Figure 1 (draft)

Bivariate normal distribution of latent continuous
variables Y₁ and Y₂ and discretizing thresholds
t₁ and t₂


             Y₂
              |                    .
              |              .       .
              |          . |          .
              |       .    |
           t  |    .    d  |  c     .
            2 +------------+----------------
              |  .      a  |  b .
              |          . |
              |   .        |
              +------------|-------------------> Y₁
                          t₁

a, b, c and d denote the proportions of cases that fall in each region defined by the two items' thresholds. For example, a is the proportion below both items' thresholds--e.g., the cases that fail or respond "no" to both items.

These proportions correspond to the summary of data as a 2 x 2 cross-classification table, as in Table 1.



Table 1 (draft)

Cross-classification of responses for Item 1 and Item 2


                              Item 1
                            0       1
                        +-------+-------+
                      0 |   a   |   b   |  p₂
              Item 2    +-------+-------+
                      1 |   c   |   d   |  q₂
                        +-------+-------+
                            p₁      q₁      1

Note that a, b, c and d here are defined as proportions, not frequencies.

Once we know the observed cross-classification proportions a, b, c and d for a study, it is a simple matter to estimate the model represented by Figure 1. Specifically, we want to know the locations of the discretizing thresholds, t₁ and t₂, and a third parameter, which determines the "fatness" of the ellipse: this third parameter is the tetrachoric correlation for Items 1 and 2, or r^*_ij.

The principle of estimation is very simple: basically, a computer program tries various combinations for t₁, t₂ and r^*, until values are found for which the expected proportions for a, b, c and d in Figure 1 are as close as possible to the actual, observed proportions in Table 1. The parameter values that do so are regarded as (estimates of) the true, population values.

In fact, our probit-based threshold estimates are exactly right (this is so for the tetrachoric, but not the polychoric correlation, a generalization of the tetrachoric to ordered category data), so we need only find the right value of r^*.

Estimating tetrachoric correlations

A computer program is necessary to calculate the tetrachoric correlation. Several algorithms are available. One method (the original approach) is based on a series expansion, called the tetrachoric series. More recent approaches rely on numerical integration. Nearly all modern algorithms are accurate to many decimal places. Older algorithms that rely on a series expansion with too few terms (e.g., a sixth-degree polynomial expansion) are potentially inaccurate. A good subroutine (Brown, 1977) can be downloaded from the Applied Statistics section of StatLib.

Download TetMat program to estimate tetrachoric correlations.

For more information about software options, click here.

SAS will calculate tetrachoric correlations, as an option for PROC FREQ. The only drawback is that it calculates these one at a time, rather than constructing a matrix of tetrachoric correlations between all item pairs. A search of the SAS web site for the term "polychoric", though, will reveal a macro for constructing such a matrix.

Once one has constructed a matrix of tetrachoric correlations between all item pairs (i, j), the next step is to factor analyze the matrix.

Factoring the matrix of tetrachoric correlations

One may factor analyze the matrix of tetrachoric correlations just as one would a matrix of Pearson correlations. One can use any software that will estimate a common factor model.

It is possible for a tetrachoric correlation matrix to not be positive definite. Knol and Berger (1991) suggest that, with ML or Generalized (Weighted) least-squares (GLS/WLS) estimation of the common factor model, if the matrix is indefinite or ill-conditioned with respect to inversion, a smoothing procedure is required; they also describe a procedure for this.

The extent to which this extra step is required is unclear. It perhaps becomes more of an issue as the number of items increases. If one uses iterated principal factor analysis or unweighted least-squares estimation, then the smoothing procedure is not required.

LISREL can be used for the factor analysis. LISREL works well for a 1-factor model (see Example below). However, it may be awkward for multi-factor models; LISREL is better suited for confirmatory than for exploratory factor analysis. If LISREL is used, one should not use ML estimation, as the assumptions of that method do not apply to tetrachoric correlations.

Factoring tetrachorics with SAS PROC FACTOR

Another possibility is to use SAS PROC FACTOR. For this, one reads the matrix of tetrachoric correlations into a special SAS data set that is specified as a correlation matrix. This data set is then used as the input for SAS PROC FACTOR.

For example, let "matrix.in" be the name of an external file (e.g., a DOS, Windows, or Unix file) that contains the tetrachoric correlations among five items, as follows:


    ITEM1      1.000       .          .          .          .
    ITEM2       .170      1.000       .          .          .
    ITEM3       .228       .189      1.000       .          .
    ITEM4       .107       .111       .187      1.000       .
    ITEM5       .067       .172       .105       .201      1.000

Note that: (1) this matrix has the form of a lower-half matrix, with main diagonal elements included; (2) the main diagonal elements are unities; (3) elements of the upper half of the matrix are explicitly supplied as missing values ("."); and (4) a variable name begins each row of the matrix.

I believe one could also supply a full correlation matrix here. However it is important to supply *some* values for the upper half of the matrix; otherwise when reading the data SAS will advance to the next line looking for the values above the main diagonal. (In theory you could use the SAS input option MISSOVER to prevent that, but not if any row of the correlation matrix occupies more than one physical line of the file.)

Also, this is just an example; in practice one might better supply the correlations to more decimal places.

A sample SAS program to analyze such data is as follows:

     * Input the matrix of tetrachoric correlations   ;
     * and store as a SAS correlation data set (refer ;
     * to SAS manual under PROC CORR for details)     ;

     data one (type=corr);
       infile 'matrix.in';
       _type_='corr';
       input _name_ $ item1-item5;
     run;

     * Do the factor analysis;

     proc factor
        data   = one
        method = prinit
        n      = 2
        rotate = v
        scree
        ;
     run;

The PROC FACTOR step requests estimation by the "PRINIT" (iterated principal factor analysis or IPFA) method, a two-factor model, varimax rotation, and a scree test of eigenvalues.

Based on limited experience, I have found the PRINIT method better for factoring tetrachorics than most other SAS factoring methods (a comparable method is available with SPSS). Knol and Berger found good results with this method, but suggested that unweighted least-squares estimation may be preferable in order to avoid possible "Heywood cases." (A Heywood case is an estimation problem where a commonality estimate becomes 1.0 or greater).

Regardless of the program used, one will find on the output a listing of factor loadings of each item on each factor. These loadings, the item--factor correlations, are the correlation parameters we seek--that is, the estimated correlations of latent continuous variables with the factors.

For a unidimensional model, we denote this value by rho_i (i.e., r_i).

For a multidimensional model, we could denote the latent correlation of item i with Factor c by rho_ic (i.e., r_ic). We could alternatively denote these as rho(Y_i, theta_c) [i.e., r(Y_i, q_c)], that is, the correlation of latent continuous variable Y_i with the c'th factor/latent trait, theta_c.

As with ordinary factor analysis, one may test models with varying numbers of factors. Or one may use analysis to confirm certain hypotheses. However, any goodness-of-fit statistic that depends on the assumption of continuous variables does not strictly apply; such statistics may roughly guide model selection and comparison, but disregard the p-values.

Example

This example uses the LSAT-6 data analyzed by Bock and Lieberman (1970) and others. The data are responses to five items of the Law School Admissions Test (LSAT) test by 1000 examinees (see Bock & Lieberman for data). We proceed in two steps.

Step 1: Calculate the threshold estimates and tetrachoric correlations.

For this, we use the PRELIS (Joreskog & Sorbom, 1988) program. The PRELIS command file is as follows:

     LSAT-6 data (Bock & Lieberman, 1970)
     da ni = 5   nobs = 1000    missing = -9    treatment = listwise
     raw-data-from file = lsat6.raw
     labels
     item1 item2 item3 item4 item5
     ordinal item1 - item5
     output  matrix = pmatrix sm = lsat6.mat sa=test.acv

These commands are for PRELIS Version 1; commands for Version 2 might differ slightly. The commands state that there are five variables (items 1 to 5), to be treated as ordinal, not continuous data. A matrix of estimated Pearson correlations (pmatrix) is requested; since the data are dichotomous, PRELIS will estimate the Pearson correlations as tetrachoric correlations. Raw data are supplied in the file LSAT6.RAW. The tetrachoric correlation matrix will be written to file LSAT6.MAT. The asymptotic variance/covariance matrix for estimated parameters is written to the file TEST.ACV. LISREL uses this matrix for Weighted Least Squares estimation (if you plan to use Unweighted Least Squares estimation with LISREL, this matrix is not needed). PRELIS will only calculate this matrix if listwise (not pairwise) deletion of missing cases is specified.

The output file produced by these commands gives the one-way item marginal totals as follows:

                 CATEGORY
      VARIABLE      1   2
                 ________

         ITEM1     76 924
         ITEM2    291 709
         ITEM3    447 553
         ITEM4    237 763
         ITEM5    130 870

From these we calculate, for each item i, the marginal proportion p_i and threshold t_i = probit(p_i). For example, for the first item,

p₁ = 76 / (76 + 924) = 76/1000 = .076

and

t₁ = probit(.076) = -1.4325.

Doing this for all items gives the heuristic threshold estimates shown in Table 2.

Note: It appears that PRELIS 2 supplies these thresholds directly; there is no need to calculate them from marginal proportions.

The matrix of tetrachoric correlations, stored in file LSAT6.MAT, is as follows:

(6D13.6)
  .100000D+01  .170316D+00  .100000D+01  .227522D+00  .189091D+00  .100000D+01
  .107186D+00  .111147D+00  .186680D+00  .100000D+01  .665002D-01  .172422D+00
  .105492D+00  .200924D+00  .100000D+01

The matrix as shown on the printed output, more readable, is:

         ESTIMATED CORRELATION MATRIX
               ITEM1      ITEM2      ITEM3      ITEM4      ITEM5
            ________   ________   ________   ________   ________
    ITEM1      1.000
    ITEM2       .170      1.000
    ITEM3       .228       .189      1.000
    ITEM4       .107       .111       .187      1.000
    ITEM5       .067       .172       .105       .201      1.000

Step 2. Factor analyze the matrix of tetrachoric correlations.

For this we use LISREL. The command file is:

FACTOR ANALYSIS OF DICHOTOMOUS VARIABLES: LSAT6 DATA -- WEIGHTED
LEAST SQUARES
DA NI=5 NO=1000 MA=PM
PM
  .100000D+01  .170316D+00  .100000D+01  .227522D+00  .189091D+00  .100000D+01
  .107186D+00  .111147D+00  .186680D+00  .100000D+01  .665002D-01  .172422D+00
  .105492D+00  .200924D+00  .100000D+01
AC FI=TEST.ACV
MO NX=5 NK=1 LX=FU,FR PH=ST TD=SY,FI
FR TD 1 1 TD 2 2 TD 3 3 TD 4 4 TD 5 5
OU

This simply requests a 1-common factor model of the data. The AC command tells LISREL where to find an asymptotic variance/covariance matrix for used with Weighted Least Squares (WLS) estimation. Because of this, LISREL will use WLS estimation even though it is not explicitly requested.

The LISREL output shows the loadings of each latent continuous variable on the common factor, as follows:

      LISREL ESTIMATES (WEIGHTED LEAST SQUARES)
              LAMBDA X

                    KSI 1
                 ________
         VAR 1       .389
         VAR 2       .397
         VAR 3       .471
         VAR 4       .377
         VAR 5       .342

These values (here labeled KSI 1) are the heuristic estimates of the correlation parameters rho_i of the latent trait model.

Table 2 compares the heuristic threshold and correlation parameter estimates with the FIML estimates reported by Bock and Lieberman (1970).


Table 2.  Heuristic and Full Information Maximum Likelihood (FIML)
Parameter Estimates for LSAT-6 Data

         ------------------------------------------------
                   Heuristic estimates    FIML estimates^a
                   -------------------   ----------------
                   Thresh-    Corre-    Thresh-    Corre-
         Item       olds     lations     olds     lations
          (i)       (t_i)      (rho_i)      (t_i)      (rho_i)
         -------------------------------------------------
           1      -1.4325     .389     -1.4329     .3856
           2       -.5505     .397      -.5505     .3976
           3       -.1332     .471      -.1332     .4732
           4       -.7160     .377      -.7159     .3749
           5      -1.1264     .342     -1.1264     .3377
         -------------------------------------------------
         ^aBock & Lieberman (1970); p. 191.

The heuristic and FIML estimates are very close, especially for the threshold parameters. Possibly, using more accurate FIML estimation than was available in 1970, the threshold estimates would be identical.

Calculating Factor Scores

For some studies one will be content simply to reveal the dimensional structure of the data. Other times one will want to locate each object in the multidimensional latent trait space. This is equivalent to calculating a factor score for each case, relative to each factor.

Bock and Aitkin (1981) describe three methods to estimate such factor scores: Maximum Likelihood (ML), Maximum a posteriori (MAP), and Expected a posteriori (EAP) estimation. The EAP estimates are fairly easily obtained, especially for a one-factor model; for details, click here .

Final Thoughts

At the outset it was suggested that the heuristic method may be useful mainly to those who do not have access to the specialized software needed for FIML estimation of a multidimensional latent trait/IRT model. However, after writing this, it occurs to me that the heuristic model may be somewhat more flexible than current versions of the FIML approach. For example, with the heuristic method, one can easily combine binary and continuous variables in the same analysis. One would simply construct a correlation matrix where correlations are estimated as:

Tetrachoric/polychoric correlations, if both variables are dichtomous/ordered categorical
Biserial/polyserial correlations, if one variable is continuous and one is dichotomous/ordered categorical
Pearson correlations, if both variables are continuous

One would then factor-analyze the resulting matrix.

Similarly, it would appear that the heuristic method can more easily permit certain uses of covariates to improve estimation of latent trait scores for individual cases.

These same things can, in principle, be done with FIML estimation, but it is unclear that software exists to do them.

To cite this article:

Uebersax JS. Estimating a latent trait model by factor analysis of tetrachoric correlations. 2000. Statistical Methods for Rater Agreement web site. Available at: http://john-uebersax.com/stat/irt.htm. Accessed mmm dd, yyyy.

References

Abramowitz M, Stegun IA. Handbook of mathematical functions. Washington DC: U. S. Government Printing Office, 1972.

Bock RD, Aitkin M. Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika, 1981, 46, 443-459.

Bock RD, Lieberman M. Fitting a response curve model for dichotomously scored items. Psychometrika, 1970, 35, 179-198.

Bock RD, Gibbons R, Muraki E. Full-information item factor analysis. Applied Psychological Measurement, 1988, 12, 261-280.

Brown MB. Algorithm AS 116: the tetrachoric correlation and its standard error. Applied Statistics, 1977, 26, 343-351.

Drasgow F. Polychoric and polyserial correlations. In Kotz L, Johnson NL (Eds.), Encyclopedia of statistical sciences. Vol. 7 (pp. 69-74). New York: Wiley, 1988.

Gibbons RD. TESTFACT Version 3 users manual. Chicago: Scientific Software International, 1998.

Harris B. Tetrachoric correlation coefficient. In Kotz L, Johnson NL (Eds.), Encyclopedia of statistical sciences. Vol. 9 (pp. 223-225). New York: Wiley, 1988.

Heinen T. (1996). Latent class and discrete latent trait models: similarities and differences. Thousand Oaks, California: Sage, 1996.

Hulin CL, Drasgow F, Parsons CK. Item response theory. Homewood, Illinois: Dow Jones-Irwin, 1983.

Johnson VE, Albert JH. Modeling ordinal data. New York: Springer, 1999.

Joreskog KG, Sorbom, D. PRELIS 1 User's Manual, Version 1. Chicago: Scientific Software, Inc., 1988.

Kennedy WJ Jr, Gentle JE. Statistical computing. New York: Dekker, 1980.

Knol DL, Berger MP. Empirical comparison between factor analysis and multidimensional item response models. Multivariate Behavioral Research, 1991, 46, 457-477.

Lazarsfeld PF, Henry NW. Latent Structure Analysis, Boston: Houghton Mifflin, 1968.

Lord FM, Novick MR. Statistical theories of mental test scores. Reading, Massachusetts: Addison-Wesley, 1968.

Mislevy RJ. Recent developments in the factor analysis of categorical variables. Journal of Educational Statistics, 1986, 11, 3-31.

Parry CD, McArdle JJ. An applied comparison of methods for least-squares factor analysis of dichotomous variables. Applied Psychological Measurement, 1991, 15, 35-46

Takane Y, de Leeuw J. On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 1987, 52, 393-408.

Uebersax J. LTM: a program for unidimensional latent trait modeling. In preparation, 2000.

Vermunt J. LEM users manual. Tilburg, The Netherlands: Tilburg University, 1998.

Wichura MJ. Algorithm AS 241: Percentage points of the normal distribution. Applied Statistics, 1988, 37(3) 477-484.

Go to Top of page
Go to Latent Structure Analysis
Go to Statistical Methods for Rater Agreement

This page maintained by John Uebersax PhD

Revised: 10 August 2006 (corrected example matrix, new software link)