The advanced version of POLYCORR has several features that the basic version of POLYCORR does not include. The basic version, with only a few simple commands, is mainly intended as a teaching tool. Added features of the advanced version include: ability to have different numbers of levels for the row variable and column variable, ability to combine data cells, and options to control the accuracy of estimation. Appendix A gives a complete list of the advanced features.
Data are supplied as a table of observed frequencies. Output includes the polychoric correlation and its standard error, estimated thresholds and, possibly, their standard errors, and model fit statistics.
The user can choose joint maximum likelihood (ML) or two-step estimation
of parameters (Drasgow, 1988).
2 Running POLYCORR
To run the advanced version of POLYCORR, while in DOS or a DOS window,
navigate to the directory where XPC.EXE resides. Type the command:
XPC
The program can also be run from the Windows File Manager, or from the Windows "Run program" prompt. If you are using a pre-Pentium machine without a math coprocessor, this version of XPC will not run; contact the author to obtain a suitable version.
The program will first prompt for input and output filenames. In response to each of these prompts, supply a valid DOS file name, including, if appropriate, a path, for example:
c:\datasets\laruche.xpc
Simply pressing the Return key will cause the default file name to be used. The default input and output filenames are input.txt and output.txt, respectively.
Numbers will then scroll past as the program runs. These are the likelihood-ratio chi-squared statistic calculated at each iteration. These numbers should generally decrease.
With two-step estimation, fewer than 50 iterations may be needed; with
joint ML estimation, 1000 or more may be required for a large table. If
the program doesn't converge in the number of allotted iterations, enter
a "1" in Command Line 3 of the input file and re-run the program.
POLYCORR will the resume estimation where it left off.
3 Input File
To run POLYCORR you must construct an input file. It must contain
14 command lines and the data to be analyzed. (It may also
contain meta-cell definitions, as described below.)
3.1 Command lines
The 14 command lines of the input file are as follows:
Line 1. A run title of up to 80 characters.
Line 2. Maximum number of iterations. One can usually leave this set at 5000. (It is unlikely that that many iterations will be needed.)
Line 3. Use previous start values? The default value of 0 causes POLYCORR to begin estimation with default start values that it calculates. A value of 1 means the user will instead supply start values (see User-supplied Start Values).
Line 4. Levels for Item/Rater 1. This is the number of ordered categories associated with the first Item or Rater (the number of rows of the input table). The current maximum is 18.
Line 5. Levels for Item/Rater 2. This is the number of ordered categories associated with the second Item or Rater (the number of columns of the input table). The current maximum is 18.
Line 6. Estimation method. The default value of 0 means joint ML estimation will be used. A value of 1 specifies two-step estimation.
Line 7. Criterion. The default value of 0 means POLYCORR will minimize the likelihood-ratio chi-squared (G-squared) statistic. G-squared is equal to -2 log L plus a data-dependent constant. Therefore minimizing G-squared is equivalent to maximizing log L; that is, it produces maximum likelihood (ML) estimates. A value of 1 specifies minimization of the Pearson chi-squared (X-squared) statistic. Estimated parameter standard errors are not calculated with minimum-X-squared estimation.
Line 8. This option is reserved for future use. Specify a value of 0 or leave this line blank.
Line 9. Suppress standard errors. The default value of 0 means standard errors will be estimated. A value of 1 suppresses standard error calculation.
The following lines are more technical. Many users can leave these set to the default values of 0. |
Line 10. Algorithm used to calculate normal cdf. The default value of 0 means ALNORM (Applied Statistics algorithm AS 66) is used to calculate values for the normal cumulative distribution function (cdf). This should be adequate for most applications. If this value is 1 POLYCORR will use a more accurate cdf routine (NORMP). If the value is 2, an alternative accurate routine (NPROB) is used.
Line 11. Latent trait range. This defines the range of the latent trait over which integration is performed in the calculation of expected frequencies. The default value is the range (relative to a standard normal curve) of -/+ 5. To extend the range, supply a (positive) value of up to 10.0. The latent trait range will be set to minus/plus this value; for example, if 10.0 is specified, the range will be from -10 to 10. The format is F4.0. (If you include a decimal place, it will override the F4.0 format, but the value must be in columns 1-4).
Line 12. Number of quadrature points for integration. Integration is performed by dividing the latent trait into a finite number of equally-spaced points. A value of 0 in this field results in the default number of 51 points being used. It is recommended that this value not be changed unless there is a reason. For more accuracy, a larger number of up to 81 can be specified. For technical reasons it is probably better to specify an odd number. A number less than 51 will increase program speed, but, this should probably not be done without a good reason (in any case, the number should never be less than 21).
Line 13. Output format. This controls the number of decimal places for printing of expected frequencies, as follows:
Value | Number of decimal places printed |
0 (default) | 2 |
1 to 7 | 1 to 7, respectively |
8 | 18 |
9 or more | 0 |
Line 14. Number of meta-cells. A meta-cell is the combination of two or more cells in the original data table. When cells are combined, their observed and expected frequencies are pooled for purposes of parameter estimation. Up to 20 meta-cells can be defined. For Command Lines 2--14 (except Line 11) values are supplied in I4 format--that is, the integer value must be (a) in Columns 1-4 and (b) be right-justified. Leaving Columns 1-4 blank is the same as supplying a value of 0.
Comments can be supplied on Command Lines 2--14 anywhere after Column 4. It is recommended that comments be used to identify the option associated with each line.
The file input.txt supplied with POLYCORR shows proper construction of an input file. |
The elements of the meta-cell pattern matrix correspond one-for-one with the cells of the observed frequency table. Supply a "0" in the pattern matrix to show that a cell is not to be combined. Supply a positive integer from 1 to 20 to indicate meta-cell membership; all data cells with the same nonzero pattern value comprise the corresponding meta-cell. For example, all cells with a "1" in the pattern matrix define Meta-cell 1, all cells with a "2" define Meta-cell 2, etc.
The following example meta-cell pattern matrix:
0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 2 2 0 0 0specifies that, in cells (4, 1), (5, 1) and (5, 2) of the data table are to be combined, and cells (1, 4), (1, 5) and (2, 5) of the data table are to be combined for purposes of estimation.
Format for the pattern matrix is free-field. One or more blank lines can separate the observed frequency table and the pattern matrix.
The use of meta-cells is experimental at this point. The idea is to improve parameter estimation reducing data sparseness. Definitely do not use meta-cells to combine entire rows or columns of the data table--doing so will make the solution unidentified; instead, collapse the rows or columns before running POLYCORR. Nonidentifiability can possibly result in other situations as well. There probably should be at least one cell in each row and each column that is not combined with other cells. |
Chi-squared df are calculated as (R × C) - 1 - k, where:
If meta-cells are defined, df are adjusted (reduced) accordingly.
If the G-squared and/or X-squared statistics show significant lack of model fit (e.g., p < .10), the user may consider the following options:
|
This section also reports whether the program converged or not.
4.3 Parameter estimates
This section first reports the estimated polychoric correlation (rho)
and its standard error.
Next it reports a test of a zero polychoric correlation. This is simply a chi-squared test of statistical independence for the data, to which the polychoric model reduces when rho = 0. A non-significant result means that a model that assumes a zero polychoric correlation fits the data; this can be interpreted as evidence that the null hypothesis H0: rho = 0 is tenable. At present, POLYCORR does not consider meta-cells when performing this test.
Next the Pearson correlation between the two manifest variables is reported (i.e., the correlation obtained treating the variables as interval data).
Following this the threshold estimates are reported. Standard errors
of threshold estimates are not calculated if two-step estimation of the
polychoric correlation is used.
4.4 Observed/expected frequencies
This section shows, for each combination of levels of the row and column
variables, the observed and expected frequency. Observed and expected
marginal frequencies are also reported.
If meta-cells have been defined, meta-cell memberships are shown.
The observed and expected meta-cell frequencies are also printed.
4.5 First derivatives
If the program meets its internal convergence criteria, the first
derivative of G-squared relative to each estimated parameter will be
printed (this is the same as the first derivative of -2 log L relative
to each parameter). For a true convergent solution, these values should
be close to 0--ideally, less than 0.001. An occasional value as large
as 0.1 might be no cause for concern. However, a large value,
especially one much larger, means the program did not converge.
First derivatives are printed twice, once to 4 decimal places, and once
in scientific notation.
5 Model and Estimation Method
5.1 Model
POLYCORR reformulates the polychoric correlation model as a latent trait
or "variable-in-common" model (Hutchinson, 1993). The approach is
explained at
. The
latent trait model reformulation is not an approximation--it is isomorphically
equivalent to the usual bivariate-normal view of the polychoric
correlation.
Let X1 and X2 denote the observed levels of the row and column variables, respectively, for a given case. Let Y1 and Y2 denote values of the pre-discretized continuous variables associated with X1 and X2.
The measurement model is:
Y1 = bT + e1, Y2 = bT + e2. |
In the above equations, T is a latent trait--analogous to a common factor--which Y1 and Y2 have in common and which accounts for their correlation; b is a regression coefficient, and e1 and e2 represent random errors.
The standard model assumes that the latent trait T is normally distributed. As scaling is arbitrary, we specify that T ~ N(0, 1). Error is similarly assumed to be normally distributed (and independent both between raters and across cases). A consequence of these assumptions is that Y1 and Y2 must also be normally distributed. To fix the scale, we specify that var(Y1) = var(Y2) = 1. It follows that b = the correlation of both Y1 and Y2 with the latent trait, and that b2 is the correlation of Y1 and Y2 (it is also the polychoric correlation of X1 and X2--the correlation of the two variables we would observe if both variables were measured continuously.
The assumptions of the polychoric correlation coefficient may be summarized as follows:
Assumption 1 is true essentially true by definition. The existence of a latent trait is implied by the existence of a nonzero polychoric correlation and vice versa. Just as with a common factor in factor analysis, the latent trait is "what the variable have in common." It may correspond to a more-or-less real but unobserved variable--such as intelligence or disease severity. Or it may simply be a shared component of variation.
Assumptions 2, 3 and 4 can be alternatively expressed as the assumption that Y1 and Y2 follow a bivariate normal distribution.
Assumption 5 is essentially true by definition, since any consistent association between the two variables is accounted for by the latent trait. Assumption 6, a standard assumption for statistical methods, is usually considered met with random sampling.
Assumptions 2, 3 and 4, then, are the main assumptions tested with model
fit statistics. Assumption 2 can be relaxed by considering other
distributional forms for the latent trait, or modeling a
nonparametric latent trait distribution. Methods for relaxing
Assumption 4 are described by Hutchinson (2000); a version of POLYCORR
that permits relaxation of this assumption is currently being tested
(users may contact the author to obtain a preliminary version.)
5.2 Estimation method
Concerning calculations, expected frequencies are calculated by numerical integration over the range of the latent trait, T. The method is described in Uebersax (1993). Bivariate integration is not necessary. At each level of T, the product of two normal cumulative distribution function values (calculated via an accurate polynomial approximation), one associated with Y1 and one associated with Y2, is calculated.
Accuracy depends on the following:
Based both on experience and reference to earlier literature (e.g., Bock and Aitkin, 1981) a latent trait range of -/+ 5 (relative to a standard normal curve) is taken as the default.
POLYCORR uses the most elementary integration method--literally "integration by rectangles." Greater efficiency could be obtained by using Simpson's rule or Gauss-Hermite quadrature. However, with 51 quadrature points over the range +/- 5, this simpler method is sufficient. (Doubling the number of quadrature points, for example, has little effect on results).
Parameter estimates are obtained by iteratively adjusting parameter values to find those that best fit the observed data by the criterion of maximum likelihood (or, is specified, minimum-X-squared). The iterative adjustments are handled by STEPIT, a general algorithm for multivariate minimization/maximization (Chandler, 1969).
With joint ML estimation, all parameters (the polychoric correlation and thresholds) are estimated by this means. With two-step estimation, thresholds are estimated directly from cumulative marginal proportions, and only rho is estimated iteratively.
Standard errors are calculated by inverting the observed information matrix (the matrix of second derivatives of model parameters relative to -log L). The observed information matrix is calculated by finite differences. For two-step estimation, when estimating the standard error of rho, the thresholds are viewed as fixed parameters. This appears consistent with Drasgow (1988) and others. It is debatable, however, as the thresholds are still subject to sampling variability even if calculated from the marginals. At present, the question of standard errors for two-step estimation is left open.
POLYCORR has been benchmarked against: PRELIS Version 1.0 (Joreskog &
Sorbom, 1993) for two-step estimation; against SAS PROC FREQ PLCORR and
the calculations of Tallis (1962) and Drasgow (1988) for joint ML
estimation; and against Applied Statistics algorithm AS 116 (Brown,
1977) for the tetrachoric correlation. In each case POLYCORR appears
at least as accurate as the benchmark source.
6 User-Supplied Start Values
One can specify the initial parameter values by constructing a special
file. The file, named START.XPC, has k lines, where k is the
number of estimated parameters.
For joint ML estimation, k = R + C - 1, where R is the number of row levels and C is the number of column levels. The first line gives the start value for rho. Next, on successive lines, are the start values for thresholds 2, 3, ..., R for the first item/rater (row variable), followed by start values for thresholds 2, 3, ..., C for the second item/rater (column variable). With respect to each variable, it is important that threshold start values be in ascending order--that is, within the row variable and within the column variable, higher-numbered thresholds must be greater than lower-numbered thresholds. In general, one use successive integers, e.g., -2., -1., 0., 1., 2. as start values for each rater's thresholds.
For two-step estimation, k = 1. There is only one line, containing the start value for rho.
Values must include a decimal place and be one per line, with no blank
lines. Other than that the format is unimportant. To see an example,
run POLYCORR and examine the START.XPC file is creates.
7 Negative Polychoric Correlations
A minor adjustment must be made to the latent trait model to accommodate
a negative polychoric correlation. For technical reasons, in a given
run POLYCORR will estimate rho either within the range 0 to 1.0 or -1.0
to 0, but not both.
This will not likely affect the user. The default start value for rho is the Pearson r calculated for the data. If the Pearson r is positive, a positive rho will be estimated; if the Pearson r is negative, a negative rho is estimated.
It is unlikely that rho would have a sign opposite of the Pearson r. Still, should this be the case, the user has an option. Suppose that the Pearson r is positive, and that POLYCORR attempts to estimate a positive rho. If the true rho is negative, one of two things will happen: (1) rho will be reported as 0; or (2) the program will terminate with an error message.
In either case the user should re-run the program using user-supplied start values. A negative value should be specified for the rho start value. This will cause POLYCORR to estimate a negative-valued rho.
Similarly, a user-supplied positive-valued rho will cause POLYCORR to
estimate a positive-valued rho.
8 Limitations
It is possible to construct unusual data sets where POLYCORR will fail.
(The same is probably true of any program to estimate the polychoric
correlation. For example, even SAS has reported bugs associated with
PROC FREQ PLCORR.) Estimating the polychoric correlation, like many
forms of latent structure modeling, is a fairly complex numerical
procedure and cannot be guaranteed to work in every case. However, that
does not mean one should doubt the results in the large majority of
cases where it does work.
With POLYCORR, any computational problem that might occur is usually obvious. Signs that something is wrong include a negative G-squared value or a program crash. If these occur, first try two-step estimation to see if that eliminates the problem. If that doesn't work, please send me email (including the input file) and I will try to correct the problem.
For added assurance that POLYCORR has worked correctly, examine the
first derivatives in the printed output. If these are all near-zero, it
is likely that the estimates are correct.
9 Technical Output
The STEPIT subroutine writes a small amount of output to the file STEPIT.OUT. Most users need not be concerned with this file. The most useful information is potentially the matrix of second derivatives of the objective function (in this case G-squared) relative to estimated model parameters, which is produced if standard errors are estimated.
POLYCORR is copyrighted (all rights reserved). It may be downloaded from this site, and the user may retain multiple copies of the downloaded version for his or her personal use. But it may not be transmitted to other users. It may not be translated to other programming languages without the express permission and consent of the author. You may not decompile, disassemble, modify, decrypt, or otherwise exploit this program. |
The POLYCORR program can be downloaded at:
http://wwww.john-uebersax.com/bin/xpc.zip.
This user guide is available at:
http://www.john-uebersax.com/stat/xpc.htm.
I hope you find the POLYCORR program helpful. Please notify me if the program does not work correctly, or to suggest additions or changes that might make it more useful.
John Uebersax PhD
This program is distributed as-is. It has not undergone extensive testing. The author does not guarantee accuracy and assumes no responsibility for unintended consequences of its use. |
Brown MB. Algorithm AS 116: the tetrachoric correlation and its standard error. Applied Statistics, 1977, 26, 343-351.
Chandler JP. STEPIT--Finds local minima of a smooth function of several parameters. Computer program abstract. Behavioral Science, 1969, 14, 81-82.
Hutchinson TP. Kappa muddles together two sources of disagreement: tetrachoric correlation is better. Research in Nursing and Health, 1993, 16, 313-315.
Hutchinson TP. Assessing the health of plants: Simulation helps us understand observer disagreements. Environmetrics, 2000, 11, 305-314.
Olsson U. Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 1979, 44, 443-460.
Uebersax JS. Statistical modeling of expert ratings on medical treatment appropriateness. Journal of the American Statistical Association, 1993, 88, 421-427.
Uebersax JS. The tetrachoric and polychoric correlation coefficients. (http://www.john-uebersax.com/stat/tetra.htm). July, 2000.
Appendix A
The following are the features of the advanced version of POLYCORR
that, as of September, 2000, are not included with the simple version:
Appendix B
The file xpc.zip contains the following files:
xpc.htm | User guide for the POLYCORR program (advanced version); HTML format. |
xpc.exe | Executable version of POLYCORR (advanced version) |
input.txt | Sample input file |
output.txt | Sample output file |
BENCHMARK\ | Folder containing benchmark input and output files |
Uebersax JS. User Guide for POLYCORR 1.1. Statistical Methods for Rater Agreement web site. 2007. Available at: http://john-uebersax.com/stat/xpc.htm . Accessed mmmm dd, yyyy.
Last updated: 5 November 2010 (corrected links; xpc.exe now compatible with 64-bit Windows 7)
(c) 2006-2010
John Uebersax PhD
email