A Brief Study of Local Maximum Solutions in Latent Class AnalysisReturn to the Latent Class Analysis web pages IntroductionStudies that use latent class analysis (LCA) are prone to two problems: (1) too many latent classes, with an associated increased likelihood of a local maximum solution; and (2) failure to account for local dependence among manifest variables. It is important to understand these issues, which have not been adequately addressed in standard references. This web page considers the problem of local maximum solutions; a separate page discusses the subject of conditional dependence. All modern LCA programs estimate parameters by the method of maximum likelihood (ML) or some variant, and all are iterative in nature. The ML solution of a latent class model (LCM) consists of the parameter values that maximize the likelihood function and its natural log, log L. Ideally the estimation algorithm will converge on the global maximum solution--the parameter values associated with the single largest log L. However no existing LCA algorithm can distinguish between a global maximum and a local maximum of log L. If a solution is reached that is locally optimal--such that a minor change in any parameter value decreases log L--the algorithm will terminate. An algorithm only recognizes that a maximum has been reached, not whether the maximum is a global or local. The possibility of local maximum solutions with LCA has long been known (Goodman, 1974). Beyond passing mention, though, the subject has not been studied. One should not conclude from this that the problem is trivial or that it does not threaten the validity of results. Methodologists have simply not emphasized the issue. My experience is that local maxima are a more serious potential problem than one would gather from the literature. But note the word "potential" in the preceding sentence; with proper steps the problem can be mostly eliminated. The present goals, then, are as follows:
An Empirical StudyThis section describes a short empirical study of local maxima in LCA. For this I took a representative set of data, estimated the same latent class model several times, and noted the various solutions obtained.DataThe data were responses to eight dichotomous items on problem behaviors by 834 male high school students. These self-report items address issues such as trouble with police, acts of vandalism, fighting, truancy, etc. The items and their method of collection are described in Uebersax (1997). The data are typical of the data analyzed by LCA in psychiatry, psychology and medical research. The items have a fairly strong unidimensional structure--much of their structure can be accounted for by a latent trait (e.g., a continuous trait of "deviance"; for more on this topic and how it relate to LCA, see Uebersax, 1997). But the same seems true of many LCA applications in the areas noted above. [1] MethodsThe basic method was as follows:The data were first used as input on ten successive runs of the program LEM (Vermunt, 1997). On each run, a six-latent class unconstrained LCM model was specified. LEM was left to choose starting parameter values randomly for each run. To find the global maximum solution of a six-latent class model is a rigorous test for any LCA program. Some would maintain that one has no business estimating a six-latent class unconstrained model in the first place--in part, because of the problem of local maxima. Nevertheless--rightly or wrongly--this is typical of the complexity of models in many applied studies. For each run, the likelihood-ratio chi-squared (G-squared) and the estimated model parameter values were recorded. The G-squared statistic is equal to -2 log L plus a data-dependent constant. Thus a global/local maximum likelihood solution corresponds to a global/local minimum G-squared solution. The criterion for convergence was a change in log L between successive iterations of less than .00000001 (1.E-8). LEM uses the loglinear formulation of the latent class model (Clogg, 1995 concisely describes the different formulations; Haberman, 1979 and Hagenaars, 1990 give details on the loglinear approach). The procedure above was repeated using a second LCA program, PANMARK (Van de Pol, Langeheine & De Jong, 1998). PANMARK uses the classical formulation of LCA, i.e., the latent class analysis model as parameterized by Lazarsfeld and Henry (1968) and Goodman (1974). One reason for this was to see if either LCA formulation is less prone to local maxima. The same procedure was then repeated using the Latent GOLD program (Vermunt & Magidson, 2000). Latent GOLD, like LEM, is based on the loglinear LCA formulation. Default options were reset to make the runs comparable with those of LEM and PANMARK. [2] The entire process above was then repeated for five-, four-, three- and two-latent class models. ResultsTable 1 summarizes the main findings. Shown, for each program and each number of latent classes, is the number of times out of ten runs that the program found the global maximum solution.
No program did well with the six-latent class model: LEM found the global maximum only half of the time; PANMARK and Latent Gold found the solution two and three times, respectively. Across all programs there was only one chance in three of finding the correct solution for the six-class model in a single run. [3] PANMARK and Latent GOLD have the option to randomly generate and test multiple sets of start values. Separate additional runs of both programs were made using each program's default system for this. [4] In these runs each program found the global maximum solution of the six- class model in only two of five runs. All programs performed generally well with fewer latent classes. PANMARK found the global maximum solution of the five-latent class model in only four of ten runs. However when the option for multiple random start values was used in five separate runs, PANMARK found the correct solution each time. [5] For the six-latent class model LEM produced three separate local maximum solutions, PANMARK found six separate local maximum solutions, and Latent GOLD found four separate local maximum solutions. A total of eight different local maximum solutions were found; two were found by all three programs. [6] A natural question to ask is "Even though local maximum solutions are technically not the 'best' solution, are they close enough for practical purposes?" The results here suggest otherwise. Tables 2 and 3 compare the parameter estimates for the global maximum solution with the best and worst (as measured by the value of G-squared) local maximum solutions found by LEM. For these comparisons a correspondence between the latent classes of different solutions was established. This was done so as to maximize similarity of conditional response probabilities across items and latent classes between the models.
Table 2 shows estimated latent class prevalences for the different
solutions. For the best-fitting local maximum solution (Solution ii)
the average absolute difference between a latent class prevalence
estimate and the corresponding value in the global maximum solution
(Solution i) was .03; for the worst-fitting local maximum solution
(Solution iv) this value was .07. Note also the differences in the
estimated prevalence of Latent Class 1 among the solutions.
|
Often latent class prevalences are a minor concern, especially since they may be highly influenced by the particular population and setting of a study. Incorrect estimates of conditional response probabilities are more serious, as these parameters define the nature of the subgroups or taxa inferred from the analysis.
Table 3 shows the average absolute differences between the conditional
response probabilities of the local maximum solutions and those of the
global maximum solution, by latent class.
|
Considered across all latent classes, the average absolute difference between the estimated conditional response probabilities of local maximum Solution ii and the global maximum solution was .07. For local maximum Solution iv, the corresponding average absolute difference was .12. For some latent classes the average differences were as high as .25 and .36. Note that these are averages; for some individual items even greater differences occurred. These differences are large enough to affect the substantive interpretations of individual latent classes.
The results of Tables 2 and 3, then, show that the parameter estimates of local maximum solutions differ from those of the global maximum solution by an amount sufficient to affect conclusions.
Clearly the six-latent class model posed problems for all three programs. Local maximum solutions seem often or always associated with convergence of parameter estimates to boundary values of 0 or 1. Local maximum solutions for the six-class model had from six to nine independent response probability estimates with boundary values. (The global maximum solution had nine boundary-valued response probability estimates.)
The better results for this model with LEM and Latent GOLD raise the possibility that the loglinear formulation is better than the classical formulation at avoiding local maxima in large models. Whereas in the classical model a conditional probability estimate may tend to 1/0, in the loglinear formulation the corresponding parameter would tend to positive/negative infinity. The latter situation may help prevent the algorithm from converging on suboptimal boundary solutions.
That is merely a possibility, however. The different results might instead derive from how the programs assign random start values. Further, PANMARK seemed slightly less prone to local maxima with the four-class model. Also, it is significant that often the same local maximum solution was found by more than one program: these reflect the nature of the "likelihood surface," which itself has local maxima, rather than the characteristics of any one program, algorithm, or formulation.
The value of the convergence criterion was found to affect the likelihood of local maximum solutions. The LEM default convergence criterion of .000001 (1.E-6) was found too large. [7]
The default systems of PANMARK and Latent GOLD for automatic generation and testing of multiple start values did not seem as effective as hoped with large models. However, changing the default method for PANMARK appeared to help considerably: as already noted, pursuing 100 random start value sets for 9 iterations each did not seem as effective as pursuing 25 random sets for 40 iterations each; it appeared that nine first-round iterations were simply not enough to distinguish promising from unpromising initial estimates.
While both PANMARK and Latent GOLD use the above method of multi-step testing, one might prefer a simpler method. That would be to just generate a number of random sets of start values, say 10 to 20, and then estimate each to full convergence.
Latent GOLD has the option to use Bayes constants in estimation. These apply a Bayesian prior distribution for probability parameter values--such that values near the boundaries of 0 or 1 are considered less likely. [8] Different values for the Bayes constant associated with conditional response probabilities were tested to find one that would help avoid local maxima but which was no too large. With a value for the constant of 10, Latent GOLD found the same solution in five of five runs for the six-class model. [9] The substantive interpretation of the solution was very close to that of the global ML Solution i. Therefore it appeared that use of the Bayes constant succeeded in avoiding local maxima without distorting results.
1. Keep number of latent classes as few as necessary |
Several ways to accomplish this are as follows:
Allow for local dependence. One way to reduce the number of latent classes needed is to model local dependence that may exist among items. Local dependence means that items are associated within a latent class. This is technically a violation of the basic latent class model, though it can be compensated for by adding latent classes. Handling the problem by adding latent classes, however, is undesirable for at least three reasons: (1) with more latent classes, local maximum solutions are more likely; (2) the extra latent classes tend not to correspond to real (e.g., biological or genetic) population subgroups; and (3) if items are locally dependent, then an LCM that posits local independence is incorrect a priori.
It is much better to relax the latent class model to explicitly allow local dependence of the items involved. This can be easily done, as described on a separate web page.
Consider bootstrapping Perhaps the best statistical criterion for determining the number of latent classes is the nonparametric bootstrap method (Langeheine, Pannekoek & van de Pol, 1996; van der Heijden, 't Hart & Dessens, 1997; von Davier, 1997). This is especially helpful when there are many (e.g., more than 7 or 8) items, such that the G-squared statistic cannot be assumed to have an approximate c2 distribution, and its use to statistically test model fit is doubtful. In this case, the parametric bootstrap can be used to statistically test model fit. Potentially this method may show that a model with fewer latent classes fits, making it unnecessary to test models with more latent classes (it is possible, however, that the reverse may be true). PANMARK 3 will perform a nonparametric bootstrap test of model fit.
Be wary of solutions with many parameter estimates of 1 and/or 0. The more boundary-value parameter estimates there are in a given solution, the more one should be concerned that the solution is a local maximum. Some maintain that the presence of many boundary-valued parameter estimates is a sign that data have been overfit. Boundary-value estimates also present possible interpretive difficulties.
2. Use a tight convergence criterion |
The minimum convergence criterion (change in log L between two iterations) should be at least .00000001 (1.E-8). If possible, it should be set lower--e.g., 1.E-10.
3. Always test multiple start values |
It appears unwise to assume that the results of a single run of any LCA program will produce a global maximum solution--even with as few as three latent classes. A suggested protocol for testing multiple start values is as follows:
4. Consider Bayesian methods |
Though the present study was too limited to draw any strong conclusions about the effectiveness of Bayes constants for avoiding local maxima, the results were encouraging. Researchers, then, may wish to experiment with this method, which is currently available with Latent GOLD.
Bayesian prior distributions can also be placed on parameter values
using Markov Chain Monte Carlo (MCMC) and Gibbs' sampler estimation.
Garrett and Zeger (in press) recently described use of these methods for
LCA.
5. When appropriate, use unidimensional latent class models |
Unidimensional latent class models, also known as discrete latent trait models (Heinen, 1996; Lindsay, Clogg & Grego, 1991; Uebersax, 1993a) are appropriate when there is a unidimensional structure to the data. This is often the case in psychiatry, psychology, medical, and related applications. By "unidimensional structure" it is meant that one can imagine the latent classes as corresponding to gradations of some underlying trait, such as disease severity. For a given number of latent classes, unidimensional LCMs are less prone to local maximum solutions. [10] Such models can be estimated with LEM or the LLCA program (Uebersax, 1993b).
This brief study necessarily leaves many questions unanswered. One is how results would be affected with more than eight items. Possibly with more items one can have a larger number of latent classes before local maxima become problematic. On the other hand, with more items there are more parameters, and more opportunities for boundary-value estimates, which might make local maximum solutions more common. The best way to resolve this question is to conduct analyses like those here on data sets with more items. However, without firm evidence to the contrary, it seems prudent to assume that local maxima are a potential problem any time there are five or more latent classes.
The conclusions here apply only to models with dichotomous items. The subject of local maxima with polytomous-item LCMs requires separate study.
If nothing else, it is hoped the need for further study of this issue has been made evident. Other data sets should be analyzed by the methods here. Actual rather than simulated data should be used, as the latter are generally less prone to local maxima (i.e., an algorithm usually has little difficulty recovering the parameter values used to generate the simulated data).
[1]
The data are also reasonably non-sparse: out of 256 possible response
patterns, 167 occurred at least once.
(Go back)
[2]
The following Latent GOLD default options were changed:
[3]
Evaluation of the matrix of second derivatives showed that the
model was not weakly identified. Therefore this does not explain
the occurrence of the local maxima.
(Go back)
[4]
The default PANMARK sampling method tests 100 randomly generated sets of
start values for nine iterations each and selects the best three sets.
These three sets are continued to a convergence criterion of .0035.
(The user can change the number of initial sets, the number of
first-round iterations, and the second-round convergence criterion).
The set associated with the largest log L after the second round is then
continued to full convergence.
Latent GOLD's method tests a user-specified number (for this study, 100) of randomly sampled start value sets for 20 iterations each, continues the best 10% for 40 more iterations, and continues the best one of these to full convergence. (Go back)
[5]
For these runs 25 initial sets of starting values were tested for 40
iterations each. The default value of nine first-round iterations
appeared sufficient to detect good start values.
(Go back)
[6]
The values of G-squared for the local maximum solutions ranged from
217.75 to 226.83. The G-squared for the global maximum solution was
216.67.
(Go back)
[7]
Several times it was observed that after the convergence criterion had
gone below .000001 (1.E-6), it subsequently increased for many
iterations before decreasing again. In one run this happened
several times. (Go back)
[8]
The method works by penalizing log L. The penalty increases as
parameter values approach 0 or 1. With use of Bayes constants, the
solution is technically no longer an ML solution, but instead a maximum
posterior likelihood solution. (Go back)
[9]
This was presumably the global maximum. Model fit increased to
G-squared = 224.06 (df = 202, p = .137). (Go back)
[10]
That is so because they place structural constraints on conditional
response probabilities, making them less likely to achieve boundary
values. One might also speculate that with such models the frequency of
local maximum solutions does not increase with the number of latent
classes, due to the nature of the constraints. (Go
back)
Clogg CC. Unrestricted and restricted maximum likelihood latent structure analysis: a manual for users. Working paper 1977-09, Pennsylvania State University, Population Issues Research Center, 1977.
Garrett ES, Zeger SL. Latent class model diagnosis. Biometrics, in press.
Goodman LA. Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 1974, 61, 215-231.
Haberman SJ. Qualitative data analysis: Vol. 2. New developments. New York: Academic Press, 1979.
Hagenaars JA. Categorical longitudinal data. Newbury Park, California: Sage, 1990.
Heinen T. Latent class and discrete latent trait models: Similarities and differences. Thousand Oaks, California: Sage, 1996.
Langeheine R, Pannekoek J, van de Pol, F. Bootstrapping goodness-of-fit measures in categorical data analysis. Sociological Methods and Research, 1996, 24, 492-516.
Lazarsfeld PF, Henry NW. Latent structure analysis. Boston: Houghton Mifflin, 1968.
Lindsay B, Clogg CC, Grego J. Semiparametric estimation in the Rasch model and related exponential response models, including a simple latent class model for item analysis. Journal of the American Statistical Association, 1991, 86, 96-107.
Uebersax JS. Statistical modeling of expert ratings on medical treatment appropriateness. Journal of the American Statistical Association, 1993a, 88, 421-427.
Uebersax JS. LLCA: Located latent class analysis. Computer program documentation, 1993b.
Uebersax JS. Analysis of student problem behaviors with latent trait, latent class, and related probit mixture models. In Rost J, Langeheine R (eds), Applications of latent trait and latent class models in the social sciences (pp. 188-195). New York: Waxmann, 1997;
van de Pol F, Langeheine R, de Jong W. PANMARK user manual, version 3. Netherlands Central Bureau of Statistics, Voorburg, The Netherlands, 1998.
van der Heijden P, 't Hart H, Dessens J. A parametric bootstrap procedure to perform statistical tests in a LCA of anti-social behaviour. In Rost J, Langeheine R (eds), Applications of latent trait and latent class models in the social sciences (pp. 196-208). New York: Waxmann, 1997.
Vermunt JK. LEM: A general program for the analysis of categorical data. Tilburg University, Department of Methodology, 1997.
Vermunt JK, Magidson J. Latent GOLD User's Guide. Belmont, Mass.: Statistical Innovations, Inc., 2000.
von Davier M. Bootstrapping goodness-of-fit statistics for sparse categorical data: results of a Monte Carlo study. Methods of Psychological Research, 1997, 2(2). ( http://www.pabst-publishers.de/mpr/issue3/art5/article.html )
This page maintained by:
John Uebersax Revised: 10 August 2000