User Guide

February 2008

It is assumed the reader is already familiar with the tetrachoric correlation coefficient. If not, please see here:

and here:

The tetrachoric correlation coefficient estimates the correlation between two continuous variables/constructs based on dichotomous measures of these variables.

That is, let x1 and x2 denote two dichotomous manifest variables. We assume these are dichotomized realizations of two underlying continuous variables, which we denote by y1 and y2. The tetrachoric correlation estimates, from x1 and x2 , what the correlation of y1 and y2 is.

Definitions:

• x1, x2 -- binary (e.g., 0/1) manifest variables
• y1, y2 -- continuous latent variables that correspond to x1 and x2
• θ -- A continuous latent trait or factor which represents the common variance of y1 and y2 (what they have in common)

A potential limitation of the tetrachoric correlation coefficient is that it assumes a bivariate normal distribution for y1 and y2. The glc program makes it possible to relax this assumption, providing a generalization of the tetrachoric correlation coefficient.

For this, we make use of a convenient identity. The usual tetrachoric correlation coefficient model can be re-expressed as a latent trait model. The latter replaces the latent bivariate normal distribution of y1 and y2 with (1) a continuously distributed latent trait (θ), and (2) item response functions, one for each item. When the latent trait distribution is assumed Gaussian, and the item response functions are assumed to have the form of normal ogives (i.e., cdf's of normal distributions; this basically follows from the assumption of normally-distributed measurement error; Lord & Novick, 1968), then the latent trait model and tetrachoric correlation models produce identical results (Takane & de Leeuw, 1987).

This correspondence gives us a way to relax the bivariate normality assumptions of the tetrachoric correlation model. Specifically, we may let the latent trait distribution have other distributional shapes -- including a skewed distribution.

Constructing a Skewed Distribution

A skewed latent trait distribution is easily modeled as a mixture of two Gaussian distributions. That is the approach taken here. An arbitrary skewed distribution is produced by specifying the following parameters of a two-Gaussian mixture:

1. means of the two component distributions
2. standard deviations of the two component distributions
3. prevalences of the two component distributions (mixing proportions)

Figure 1. A skewed latent trait distribution modeled as a miture of two
normal distributions.

For 1., we need only specify the difference between the two means, as the location of the distributions on the x-axis is unimportant.

These parameters are specified arbitrarily, according to ones prior judgment of how much skew the latent trait distribution has.

(Note that for two binary items, there are no df available to estimate the amount of skew from the data themselves; any form of the latent trait distribution can be perfectly fitted to the data. For 3×3 or larger tables, the skew parameters can be estimated from the data using the LTMA program.)

The glc program uses maximum likelihood estimation to estimate the correlation of the latent variables y1 and y2, or ρ(y1, y2), given the stipulated distribution parameters and the observed data. One might colloquially call this that skewed tetrachoric correlation or skewed latent correlation, but a more correct description would be the latent correlation assuming a skewed latent distribution. For simplicity, we could call this the generalized latent correlation and denote it as ρg.

Running the Program

The glc program runs in a Command Prompt (or DOS) window on Windows computers.

Easy method. Navigate to the folder where glc.exe is, and click on the file icon. A command prompt window will open automatically as the program runs. After the program finishes, press the Enter key when prompted.

Recommended method. A better approach is to manually open a command prompt window in the folder where the glc.exe program resides, and then to execute the command:

glc

The command prompt window is very easy to use; for information, consult the help page here:

After you start the program, many numbers may scroll in your command prompt window. That is normal. The more datasets you have, the longer this will take.

When prompted, press the Enter key to terminate the program.

Input File

The input file contains run parameters and data. This file:

• must be named input.txt
• must be in the folder from which you run glc
If you use the simple method, or if you manually run the program from the folder in which glc.exe resides, then input.txt must be in the same folder as glc.exe.

However, if you execute glc.exe from a different folder (which is possible in Command Prompt mode, but not recommended), then input.txt must be in your current folder.

The safest solution is to always run glc.exe from your current folder. There is no problem making copies of glc.exe and placing one in your current working folder.

The input file contains five command lines followed by data lines. The command lines are as follows:

Line 1. Title (72 characters)
Line 2. delta value. Difference between means of the component distributions.
Line 3. sd(1). Standard deviation of the first component.
Line 4. sd(2). Standard deviation of the second component.
Line 5. r(2). Prevalence of the second component.

On Lines 2-5 the numeric value must appear in the first 10 columns. Remaining columns can be used for comments.

Note that only the prevalence of the second component, r(2), needs to be specified. The prevalence of the first component, r(1), is then determined by the requirement r(1) + r(2) = 1.

The parameters sd(1) and sd(2) should almost always be set to 1.0. In any case, they mustn't be too much larger or smaller than 1.0 (e.g., within the range .5 to 2.0). That should be sufficient to model a wide range of skewed distributions.

The Excel spreadsheet mixture.xls, will let you preview various combinations of mixing parameters, and shows a plot of the resulting distribution.

Data Lines

After the command lines come the data lines.

The first data line specifies how many data sets you want to analyze.

Following this, place on successive lines the data for each study to analyze. That is, each line contains the four frequencies of a 2×2 crossclassification table.

Finally, after all datasets, place the command 'end'

An example input file looks like this:

```
Example 1:   Diagnostic agreement data (2-normal mixture)
1.000     DELTA      Difference between means
1.000     SD1        Std dev 1
1.000     SD2        Std dev 2
.200     R1         Prevalence of class 2
3
144   72    8   15
100   15   17   23
211   12   43   75
end
```

Tetrachoric Correlations

Note that you can use this program to estimate regular tetrachoric correlations. One just needs to specify a 0 separation between the means of the two component distributions. For example, you could use these values:

```    0.000     DELTA      Difference between means
1.000     SD1        Std dev 1
1.000     SD2        Std dev 2
.500     R1         Prevalence of class 2
```

Output File

Output is written to the file summary.txt.

The file shows, for each dataset, the raw data, the chi-square fit value, and two parameters, r and rho(y1, y2). r is a parameter of the latent trait model; you may disregard it. What you want is rho(y1, y2): this is the latent correlation between your two variables.

The value chisq should always be 0.00, meaning that the latent trait model was able to perfectly reproduce your observed data. If it is not 0, then there was a problem with the estimation algorithm; you are welcome to email me the data and summary.txt file to investigate.

Data Exceptions Special cases in the data are treated as follows:

• When either row marginal or either column marginal of a 2×2 table is 0, the generalized latent correlation (and the tetrachoric correlation) is undefined. Here such cases are simply reported as a 0 correlations.

• If only one cell in a given 2×2 table is equal to 0, then a discontinuity correction (adding +/- .5 to each observed frequency) is applied to the values before estimation.

• If both values on the main diagonal are 0 or both off-diagonal values are 0, then the correlation is automatically set at 1.0 or -1.0, respectively, and the estimation algorithm is not used.

At present, standard errors of the latent correlations are not estimated. Potentially these could be added to the program, as the necessary information to compute them is already produced.

The file dist.csv gives the data for the latent trait distribution in a format suitable for plotting in Excel. Five columns are shown as follows: (a) the quadrature node index (not important); ( b) the latent trait abscissa value; (c) and (d) the densities of the two component distributions, and (e) the probability density of the marginal latent trait distribution. Ignore the values in row 102.

The file output.txt contains estimation details on each dataset for which ML estimation was applied; this output is basically an artifact of the estimation engine, and you you can disregard it.

```------------------------------------------------------------------
Disclaimer:
This program is distributed 'as-is'.  The user assumes all risks.
The author is not responsible for any errors or technical problems
that may result from running it.
------------------------------------------------------------------
```

Citation

The following may be used to cite the glc program:

• Uebersax JS. Generalized latent correlation v1.0 (computer program); 2008 (Downloded from http://ourworld.compuserve.com/homepages/glc.htm ).

• Uebersax JS, Grove WM. A latent trait finite mixture model for the analysis of rating agreement. Biometrics, 49, 823-835, 1993.

References

Bock RD, Aitkin M. Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika, 1981, 46, 443-459.

Bock RD, Lieberman M. Fitting a response curve model for dichotomously scored items. Psychometrika, 1970, 35, 179-198.

Lazarsfeld PF, Henry NW. Latent Structure Analysis, Boston: Houghton Mifflin, 1968.

Lord FM, Novick MR. Statistical Theories of Mental Test Scores. Reading, Massachusetts: Addison-Wesley, 1968.

Takane Y, de Leeuw J. On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 1987, 52, 393-408.

Uebersax JS, Grove WM. A latent trait finite mixture model for the analysis of rating agreement. Biometrics, 1993, 49, 823-835.

Go to Latent correlation with a skewed latent trait distribution
Go to Tetrachoric and polychoric correlation page
Go to Agreement Statistics
Go to Latent Structure Analysis

Last updated: 6 Febrary 2008

(c) 2008 John Uebersax PhD    email