Basic Questions
Intermediate Questions
Advanced Questions
Back to LCA main page
LCA is used in way analogous to cluster analysis (see FAQ, How does
LCA compare to other statistical methods?). That is, given a sample of
cases (subjects, objects, respondents, patients, etc.) measured on several
variables, one wishes to know if there is a small number of basic groups
into which cases fall. A more precise analogy is between LCA and a type
of cluster analysis called finite mixture estimation (Day, 1969; Titterington,
Smith & Makov, 1985; Wolfe, 1970).
Another more-or-less distinctly medical LCA application is evaluation
of diagnostic tests in the absence of a "gold standard." For example, if
one has several tests for detecting presence/absence of a disease, but
no comparison "gold standard" that indicates disease status with certainty,
LCA can be used to provide estimates of diagnostic accuracy (sensitivity,
specificity, proportion of correct diagnoses, etc.) of the different tests.
LCA may also serve simply as a convenient data-reduction tool.
To say this differently, latent classes are defined such that, if one
removes the effect of latent class membership on the data, all that remains
is randomness (understood here as complete independence among measures).
Paul Lazarsfeld (Lazarsfeld & Henry, 1968; see also his earlier papers),
the main originator of LCA, argued that this criterion leads to the most
natural and useful groups.
For some applications, conditional independence may be an inappropriate
assumption. For example, one may have two very similar items, such that
responses on them are probably always associated. For this and certain
related situations, extensions of the latent class model exist.
The model parameters are: (1) the prevalence of each of C case subpopulations
or latent classes (they are called 'latent' because a case's class
membership is not directly observed); and (2) conditional response probabilities--i.e.,
the probabilities, for each combination of latent class, item or variable
(the items or variables are termed the manifest variables), and
response level for the item or variable, that a randomly selected member
of that class will make that response to that item/variable. A conditional
response probability parameter, then, might be the probability that a member
of Latent Class 1 answers 'yes' to the Question 1.
Consider a simple medical example with five symptoms (coded 'present'
or 'absent') and two latent classes ('disease present' and 'disease absent').
The model parameters are: (1) the prevalence of cases in the 'disease present'
and 'disease absent' latent classes (but only one of the two prevalences
needs to be estimated, since they must sum to 1.0); and (2) for each symptom
and each latent class, the probability of the symptom being present/absent
for a member of the latent class (once again, for each symptom and latent
class, only the probability of symptom presence or symptom absence needs
to be estimated, since one probability is obtained by subtracting the other
from 1.0).
Parameters are estimated by the maximum likelihood (ML) criterion. The
ML estimates are those most likely to account for the observed results.
Estimation requires iterative computation, but this is fairly trivial for
a computer.
Goodman, L. A. Exploratory latent structure analysis using both identifiable
and unidentifiable models. Biometrika, 1974, 61, 215-231. Rost J, Langeheine R. A guide through latent structure models for categorical
data. In J. Rost & R. Langeheine (Eds.), Applications of latent
trait and latent class models in the social sciences. New York: Waxmann,
1997. In fact, this entire book is a good introductory resource. It includes
many papers that illustrate applications of LCA in various areas. The papers,
written mainly by methodologists, convey the "state of the art" for use
of LCA. Lindsay, B., Clogg, C. C., & Grego, J. (1991). Semiparametric estimation
in the Rasch model and related exponential response models, including a
simple latent class model for item analysis. Journal of the American
Statistical Association, 86, 96-107.
Uebersax, J. S. (1993). Statistical modeling of expert ratings on medical
treatment appropriateness. Journal of the American Statistical Association,
88,
421-427. An introduction to probit discrete latent trait models, not
covered by either of the above references.
For example, a line of the raw data might have the form:
More often, however, one supplies a frequency table. An indexed frequency table lists, along with each observed response pattern, the number of cases with this pattern. Lines in indexed frequency table would have a form such as:
1 1 2 58 1 2 1 22 1 2 2 15 2 1 1 12 2 1 2 32 2 2 1 55 2 2 2 245 The alternative is a full frequency table. With this format, one supplies only the frequencies. However, the frequencies must have a precise form. First, frequencies for all possible rating patterns (even those not observed) must be supplied. Second, frequencies follow a "last variable fastest, second-to-last variable second-fastest, ..., first variable slowest" order. Here "fastest/slowest" refers to the incrementing of rating levels. The level of the fastest variable changes first; after all its levels have been completed, the level of the second-fastest variable increments, etc. The data above are in this form. As a full frequency table, they could be supplied simply as:
143 58 22 15 12 32 55 245
Raw data can be converted to either table format with standard computer
programs, such as SAS PROC FREQ.
Most current LCA programs do not allow missing values, so that a case
with missing data on any variable must be excluded.
The second thing needed is software to perform LCA. As of this writing,
no major statistical package (SAS, SPSS, Systat, etc.) includes a module
for LCA. You will therefore need a standalone program. Fortunately, there
are several good programs to choose from, some free. See the FAQ section,
What are some good programs for LCA?
With binary or nominal data, LCA is straightforward. With ordered-category
or Likert-scale data, one may wish to apply certain constraints to response
probability parameters (see FAQ, How are ordered category data handled?).
There is no technical barrier to analyzing models that combine categorical
and continuous data. At least two computer programs, Multimix and Mplus,
allow this.
For information about free and commercial LCA software, please see
the LCA software page.
The oldest forms of LCA used complicated estimation methods based on
matrix manipulation and simultaneous linear equations. A breakthrough
came when Goodman (1974) showed how simple iterative proportional
fitting could be used to find ML parameter values; this method is type
of EM algorithm.
Haberman, working within a loglinear modeling framework, successfully used
Newton-Raphson estimation for estimation.
More generally, estimation can be approached as a problem of
multivariate nonlinear optimization. The simplex method, gradient
methods, the Davidon-Fletcher-Powell method, and many other algorithms
(see Press et al., 1989), as implemented, for example, by subroutines in
the IMSL or NAG subroutine libraries, can be used for parameter
estimation. The advantage of approaching the problem as one of
generalized optimization is that it is very easy to apply various
constraints, including structural constraints, on model parameters. Some
subroutines also calculate asymptotic parameter standard errors and
supply output that can be used to test model identifiability. I have
found the STEPIT subroutine (Chandler, 1969) very useful.
Recently, several people have successfully used Markov Chain Monte
Carlo (MCMC) and Gibbs sampling to estimate latent class models.
This would probably be viewed as an area of active research in the
area of latent class models.
From the recruitment probabilities, the estimated prevalence of each
latent class, and Bayes' theorem, one easily calculates the a posteriori
probability of a case's membership in each class. One may then assign the
case to the latent class with the highest a posteriori probability (modal
assignment), or leave classification "fuzzy"--i.e, view the case as belonging
probabilistically to each latent class to the degree indicated.
S
Where:
s indexes
response patterns
G² has a theoretical chi-squared distribution, with df equal
to S minus the number of estimated parameters. Therefore, to assess fit
of a given model, one calculates the p-value for (G², df) from
a chi-squared table or computer program (e.g., the PROBCHI function in
SAS). Since this is a goodness-of-fit test, a conservative critical value,
say one that corresponds to p = .10, is appropriate. The df for the test
are equal to (S - 1 - p), where p is the number of estimated model parameters.
A model with values for (G², df) that exceed the critical value
are considered not to fit the data; otherwise the model is considered plausible.
A complication may arise with large, sparse tables--this is especially
a concern where there are many multi-category variables, such that the
number of observed rating patterns is extremely large. For large sparse
tables, the G² statistic no longer has a theoretical chi-squared
distribution (Agresti & Yang, 1986). Thus statistical assessment by
the method described above is inappropriate. In this case, while it may
not be possible to statistically evaluate a single model, one may obtain
some insight by means of comparing the fit of alternative models, either
with a difference chi-squared test, or with parsimony indices. von Davier
(1997) explored the use of parametric bootstrapping to assess model fit
for large, sparse tables.
Difference Chi-Squared Test
Two latent class models (or two models of some other form, such two
latent trait models or two loglinear models) for the same data are often
compared via the difference G² statistic. This is calculated
as the difference in the G² statistics for the two models,
with df equal to the difference in the df's for the two models (or,
alternatively, the difference in their number of estimated parameters).
The difference G² statistic again has a theoretical
chi-squared distribution, and critical values and/or p-values can again
be gotten by usual methods.
Here a significant difference implies that one model fits better than
the other. A nonsignificant difference implies that there is no
meaningful difference. For this test, a conservative alpha level (e.g.,
p = .05) is appropriate.
Some caveat's apply, however, to use of the difference G²
statistic. First, the two models must be "nested." This means that the
parameters of one model are a subset of the parameters of the other.
This usually occurs when, say, Model B is a restricted version of Model
A, constructed by placing fixed-value or equality constraints of some of
Model A's parameters. A significant difference implies that the
additional constraints, or strictly speaking, the substantive hypotheses
that suggests the constraints, are false.
Second, for the difference G² statistic to have a theoretical
chi-squared distribution, the less restrictive model should fit the data.
Third, for large, sparse tables, the difference G² statistic
again does not have a true chi-squared distribution. Agresti and Yang (1986)
suggested that the difference G² statistic is more robust
to violations of this assumption than the ordinary G² statistic.
Often the magnitude of the difference G² is large enough to
demonstrate a substantial difference between two nested models even without
formal calculation of a p value.
Fourth, the difference G² is not appropriate for comparing
models with different number of latent classes--unfortunately so, since
this is often a main interest. Models that differ only in the number of
assumed latent classes are nested, but in a somewhat different way than
other nested models. Certain regularity assumptions required for the
difference G² test to have a theoretical chi-squared distribution
are not met. While some have suggested simple modifications to the difference
G² statistic to adjust for this, this approach is questionable.
Parsimony Indices
Partly due to this, there has been much recent interest in assessing
model fit via so-called information statistics. These statistics are based
mainly on the value of -2 times the loglikelihood of the model, adjusted
for the number of parameters in the model, the sample size, and, potentially,
other factors. The main idea is that, all other things being equal, given
two models with equal loglikelihoods, the model with the fewest parameters
is better. Appropriately, these measures are called parsimony indices.
Common parsimony indices include the Akaike Information Criterion (AIC):
the Schwarz Bayesian Criterion (SBC):
and the CAIC:
For these indices, smaller values correspond to more parsimonious models.
In comparing different models for the same data, then, one will prefer
models with lower values on these indices.
A more computation-intensive approach relies on bootstrapping, Monte
Carlo, or similar methods (see Aitken et al, 1981; and especially Langeheine
et al., 1996 and van der Heijden et al, 1997). These methods require no
assumptions about the data such as those required for chi-squared tests.
Other, more "heuristic" methods include use of parsimony indices (AIC,
BIC or CAIC), a "scree"-type test (where one plots model fit against number
of latent classes, and looks for a leveling-off point of the curve), and
examination of parameter estimates (for example, one might reject
models as having too many latent classes if some latent classes are
associated with very small prevalences or have many estimated conditional
probabilities of 1 or 0.)
In some applications there may be no "right" answer to the question
How many latent classes are there? For example, in a population
of depressed patients, two latent classes of "Reactive depression" and
"Endogenous depression" may, in one sense accurately represent the taxonic
structure. However, it may be that there are two subtypes of
Endogenous depression--so that a three-latent class solution is also in
some sense correct.
LCA is often called a categorical-data analogue to factor analysis.
The precise rationale for this comparison is unclear. Factor analysis is
concerned with the structure of variables (i.e., their correlations), whereas
LCA is more concerned with the structures of cases (i.e., the latent taxonomic
structure). While there is clearly some connection between these two issues,
LCA does seem more strongly related to cluster analysis than to factor
analysis.
Still, there are some methodological similarities between LCA and factor
analysis worth noting. First, both are useful for data reduction. Second,
latent classes, like factors, are unobserved constructs, inferred from
observed data. Third, determining the number of latent classes is analogous
in certain respects to that of determining the number of factors: as the
number of clusters/factors increases, fit of the latent class/factor model
to the observed data becomes better, but one seeks a balance between fit
to the data and number of latent classes/factors required.
There are several connections--historical and mathematical--between
LCA and latent trait analysis (LTA; including item response theory (IRT)
and Rasch models). It is common to consider LCA and LTA as two variations
of latent structure analysis. They are united by the assumption that observed
data structure result from a latent structure. With LCA, the latent variable
that determines data structure is nominal (latent class membership). With
LTA, the latent variable that determines data structure is continuous--a
latent (continuous) trait. With both LCA and LTA, manifest variables are
assumed independent, conditional on values of the latent variable.
"In between" LCA and LTA, as it were, are discrete latent class models
(Heinen, 1996). With these models, the latent variable is discrete, and
unidimensional. There are latent classes, as with LCA, but the classes
are viewed as ordered along a latent continuum, as with LTA. (See FAQ,
What are discrete latent trait models?).
Another related statistical method is latent distribution analysis (LDA;
Mislevy, 1984; Uebersax & Grove, 1993; Qu, Tan & Kutner, 1996).
LDA also includes elements of both LCA and LTA. In LDA, there is a unidimensional,
continuous latent trait. However, relative to this continuum are two or
more separate distributions of cases--corresponding to different latent
classes. For more information about the relationship between LCA, LTA,
discrete latent trait models and LDA, see Uebersax (1997).
Grade-of-Membership (GOM) analysis (Woodbury & Manton, 1982) has
often been used to discover taxonomic structure, mainly in health-related
applications. LCA is similar to, but simpler than GOM analysis. GOM analysis
views cases as having partial membership (grades of membership) in two
or more latent classes. With LCA, class membership is not known precisely--one
merely knows the probabilities of membership. Thus, with both methods,
class membership is "fuzzy." What distinguishes the two approaches is that
GOM analysis estimates, for each case, parameters that reflect each case's
grade of membership in each latent class--this can be a considerably large
number of parameters. With LCA, these parameters are not directly estimated;
however, once the other model parameters are estimated, these probabilities
can be easily estimated a posteriori by Bayes theorem. As a result, LCA
requires much fewer estimated parameters.
Connections between LCA and loglinear modeling should also be noted.
Espeland and Handelman (1988) approached LCA as a mixture of loglinear
models. Haberman (1979) and Hagenaars (1988) also approach LCA from the
standpoint of loglinear models.
The problem is like climbing a mountain in the dark. By proceeding
constantly uphill, always taking the steepest slope, you will reach the
top of whatever peak you are already on. However, the highest peak may
actually be across a valley; to reach it, you would need to first go
downhill, and then uphill again. Finding a global maximum can be
difficult for most estimation algorithms, because their strategy is to
move "uphill" at all times.
Local maxima are related to the complexity of the model; they become
more common as the number of latent classes increases. For example,
with say eight dichotomous items and only two or three latent classes,
chances are good that an algorithm will reach the global maximum. With,
say, five latent classes, however, a single run has a good chance of
reaching a local maximum.
To guard against local maxima solutions, one should run the
estimation algorithm several times with different parameter start values
and either (1) verify that the same solution is reached each time, or
(2) if there are differences, choose the best solution. The PanMark and
Latent GOLD programs have options for automatic testing of numerous
starting values.
When adequate precautions are taken, local maxima do not pose a
serious obstacle to the effective use of LCA. For more information on
this subject, including specific strategies for avoiding local maxima
solutions, click here
Progress has been made in recent years in methods for
detecting conditional dependence and in relaxing the conditional
independence assumptions of LCA. For more detailed discussion,
including example programs, click here.
Discrete latent trait models are often combined with unconstrained LCA
to test whether the latent structure is unidimensional. Specifically, one
compares model fit for an unconstrained LCA model with C latent classes
to the fit of a unidimensional discrete latent trait model with C latent
classes for the same data. If the difference G2 statistic for
the comparison is nonsignificant, one concludes that the latent class structure
is unidimensional
Discrete latent trait models are also potentially helpful in problems
of measurement and scaling (Clogg, 1988).
For a basic latent class model, the covariance parameters are assumed
equal to 0, which is the same as assuming conditional independence. One
advantage of the probit latent class model, however, is precisely that
this assumption can be easily relaxed to accommodate various conditional
dependencies among manifest variables. Variances are often fixed to a constant,
say 1.0.
For binary data, with covariances equal to 0 and constant variances,
the probit latent class model is, for most practical purposes, isomorphically
identical to the standard latent class model. However, the probit latent
class model allows useful and plausible structural constraints to be applied
to the latent class model. For example, as mentioned above, various forms
of conditional dependence may be introduced, or a unidimensional (or, say,
two-dimensional) structure imposed on latent classes. The probit LCA model
automatically provides an appropriate constraint system for ordered category
data.
The probit latent class model also provides a unifying framework for
understanding various latent structure models; a number of models, including
latent class analysis, latent trait analysis, and latent distribution analysis,
are subsumed under the model. The model also approaches mixtures of binary
or ordered-category data in precisely the same way as multivariate mixture
estimation with continuous data. Thus it leads directly to mixture estimation
models for mixed-mode measurement--that is, combinations of continuous,
binary, and ordered-category data.
For more discussion on probit latent class models, see Uebersax (in
press), available on the "Some of my papers and programs" page.
One may distinguish two types of nonidentifiability: intrinsic and empirical
nonidentifiability. With intrinsic nonidentifiability, it is the model
design--that is, the number of manifest variables, number of response levels
for each manifest variable, and number of latent classes--that results
in nonidentification; all instances of the same such design are unidentified
(with, the possible exception of certain degenerate data structures). With
empirical nonidentifiability, a model may or may not be identified, depending
on the particular values of the observed data. We consider intrinsic nonidentifiability
first.
The most common cause of intrinsic model nonidentifiability is a poorly
specified model. Specifying a model too complex--usually, one with too
many latent classes--can cause the problem. For every new latent class
added to a model, more parameters require estimation; the maximum number
of estimatable parameter is limited by the available degrees of freedom
(the number of unique observed rating patterns, minus 1).
For example, with three binary manifest variables, there are 2 x 2 x
2 = 8 possible rating patterns; and, if all rating patterns are observed,
(8 - 1) = 7 total df. An unconstrained two-class model requires exactly
seven independent estimated parameters (1 latent class prevalence, and,
for each latent class, three response probabilities). Because the number
of estimated parameters and total df are the same, this model is "just
identified"; a three class model in this case is nonidentified, however,
as it would require estimation of an additional latent class prevalence
and three additional response probabilities.
It happens that the basic LCA model is intrinsically unidentified whenever
there are only two manifest variables (regardless of the number of rating
levels; Goodman, 1974). In practice, this is seldom a limitation, as one
can make the model identifiable by adding one or more equality constraints
to the model, as suggested by substantive considerations.
With polytomous data, there are potentially a few specific combinations
of number of manifest variables, levels of the variables, and number of
latent classes where a model is intrinsically unidentified, even though
there appears sufficient total df for the number of estimated parameters.
At least one such combination is known (McCutcheon, 1987). Again, instances
such as this can be easily handled with equality constraints.
Empirical nonidentifiability has been examined less and perhaps less
common. It occurs due to certain specific accidental structures of the
observed data. For example, observed data might conform perfectly to the
results expected for a two-latent class model. If one specifies a three-class
model, the model will be unidentified. This is not a common occurrence,
but it may be more likely with small sample sizes and sparse tables.
As noted above, if a model is not identified, it can usually be made
so by, for example, adding plausible equality constraints to the model.
For example, if two manifest variables have response levels of "low," "medium,"
and "high," one might require that the conditional response probability
of "low" be the same for both variables in one or more latent class. Therefore,
the main concern is not so much nonidentifiality per se, but that a model
might be unidentified without the researcher realizing this. This is a
potential problem because, if a model is unidentified, a researcher may
mistakenly accept results of LCA as "the" solution, when, in fact, it is
merely one of many possible solutions.
Fortunately, it is fairly easy to detect nonidentifiability. The best
method relies upon testing the matrix of second partial derivatives of
all free, independent estimated model parameters towards the loglikelihood
(the Hessian matrix; van de Pol, Langeheine & de Jong, 1989). If this
matrix is of less than full rank, the model is not identified. A similar
method can be used based on the Jacobian matrix of partial derivatives
(Goodman, 1974; Clogg, 1977).
Typically this test is performed after the algorithm has converged on
a solution. This is slightly inefficient inasmuch as one must first obtain
estimates, only then to find, for a nonidentified model, that the estimates
are meaningless. Compounding this, one characteristic of an unidentified
model is that it takes an unusually long time for convergence to occur
(in fact, this is one way nonidentifiability can be detected.)
A more efficient way to check for intrinsic nonidentifiability is as
follows:
The above supposes one is using a program such as PanMark or MLLSA that
features a mathematical test of identifiability. For software without this
feature, other methods can be used to detect nonidentifiability. One method
is to run the estimation algorithm two or more times, using the same data,
but different start values. If the same solution is reached using different
start values, the model is identified. Similarly, one can follow Steps
1 and 2 above, then change starting values and see if the algorithm recovers
the values used to generate the first set of expected frequencies; if so,
then the model is identified.
Derivatives can be calculated either analytically, or numerically--i.e.,
by evaluating how much the log-likelihood changes when adding and/or subtracting
a small value (delta) to/from model parameter values.
Parameter standard errors can also be estimated using the parametric
bootstrap method. This method resamples (constructs multiple
simulated data sets) using the expected frequencies of a given latent
class model. Specifically, for a set of observed data, one first
estimates a latent class model, second, calculates the expected
frequencies given the parameter estimates so obtained, third, constructs
numerous "pseudosamples" from the expected frequencies, and fourth, fits
a latent class model to each pseudosample. The variation of a parameter
estimate across pseudosamples gives an empirical estimate of its
standard error.
Last updated: 08 July 2009 (new domain)
|