Try   HackMD

Residual Structural Equation Modeling - A short sketch

MH Manuel Haqiqatkhah


Intro

This document is an informal introduction to-and a subsequent literature review of-[residual] dynamic structural equation modeling ([R]DSEM) of (intensive) longitudinal data. Although [R]DSEM is mostly applied to multi-level problems wherein the within- and between-person differences are modeled, here I assume these sources of variations can be disentangled. Then I will demonstrate how a simple, non-hierarchical, case of such models can be implemented using R package lavaan.

Without some basic knowledge of factor analysis-or, SEM with latent variables/measurement models-the reader might find the text a bit cryptic. Nonetheless, I will first review the factor analysis formally. Let's first see how FA/SEM works in non-longitudinal data. If you are aquainted with FA, you can safely skip the next section. However, although it may seem tedious, I suggest not doing so because I use a similar notation afterwards.

Before moving on, notice that I use vertical vectors in this post, and to write them as transpose (

) of horizontal vectors to fit inline typesetting-sorry if that is too confusing.

Conventional R-factor analysis: Modeling between-persons differences

Latent variable modeling (LVM) tries to estimate a (multidimensional) latent construct-AKA factor-that is believed to have "caused" the variations obsereved in a measurement made on a sample. This factor need not actually exist; it can be an abstract reification of processes involved in generating the variations observed in the data, and one need not make any ontological commitments to their actual existence-what does it mean to exist by the way?

I know many psychmetricians/psychologists/philosophers would disagree with what I just mentioned. I believe this disagreement (at least partly) lies in the definition of (ontological) existence and causality. I take on a mechanistic approach towards explanation of phenomenon (as discussed in my master thesis in AI, which I will link it here when I publish it online) and with my interpretation of "vertical causation".

Anyways.

The classic case of LVM is R-factor analysis, wherein cross-sectional measurements of

p variables in a sample of
N
individuals is analyzed to find out whether some (abstract) "entities" (i.e., factors) can explain the variations in the data collected from the individuals-if people differ in their responses, there should be something there to cause their differences, right?

Suppose you have collected some intensive longitudinal data of a participant and now you have a set of

p-dimensional vector of manifest variables (that are manifested in the measurement; also known as items in the context of psychometrics)
Yn×1=[y1,y2,,ym]
.

The idea is, if an

m-dimensional latent variable
Θm×1
can explain the variations in
yi
s among the individuals, then the following system of equations would hold (assuming zero intercepts, i.e., centered variables, for thye sake of simplicity):

(1){y1=λ1Θ+ϵ1y2=λ2Θ+ϵ2yn=λnΘ+ϵn

Where

λi are
1×m
vectors of loadings, showing how much
Θ
is present (or manifested) in variable
yi
.
ϵi
s are the residual terms of the regressions. Since this system of equations consists of linear regression models, the assumptions of such models (most importantly, normality, independence, and homoscedasticity of residuals) should hold. The individual differences of the subjects
j
and
k
then will be summarized in their estimated
Θj
and
Θk
.

In a (more elegant) matrix notation, we have:

(2)Yp×1=Λp×mΘm×1+Ep×1

Wherein

Λn×m=[λ1,λ2,,λn] is the loading matrix.
En×1=[ϵ1,ϵ2,,ϵn]
, usually referred to as meassurement errors, basically encapsulates the unexplained variations of items that the factor model could not capture. remembering assumptions of regression models, it has to be normally distributed. normally distributed. More formally,
EN(0,Ψ)
, wherein
Ψp×p
is the covariance matrix of the residuals. Independence of error terms imply that
cov(ϵi,ϵji)=0
. Hence,
Ψ
is diagonal matrix:

(3)Ψp×p=[ψ11000ψ220000ψpp]

However, the covariance matrix of the manifest variables (i.e.,

Sp×p) is, in general, non-diagonal.

The factor structure (explained in the system of linear regressions models), if

Θis are known, implies a covariance matrix
Σp×p
. However, we do not know
Θi
s-they are latent after all. Hence, in order to estimate
Θ
, one can try to make
Σ
as similar as possible to
S
.

To do so, let the covariance matrix of the latent variables be

Φm×m. Playing with the system of regression models, it can be shown that the following holds:

(4)Σ=ΛΦΛ+Ψ

Eventually, we are interested in

Λ. However, we also have to estimate
Φ
, and
Ψ
to identify the model.

To do so,

Σ (as defined by
Λ
,
Φ
, and
Ψ
in
(4)
) is estimated to resemble the actual covariance matrix of measurements as mush as possible. This can be done by minimizing the fit function
FML
through maximum likelihood estimation:

(5)FML=ln(|S|)ln(|Σ|)+trace[(S)(Σ1)]p

So, the estimations and calculations go as follows:

(7)YSMLΛ,Φ,&ΨΛ&YjΘj

This estimation requires fixing some of the parameters to make the model identifiable; otherwise, the unknowns (i.e., parameters) outnumber the knowns (i.e., equations), and there would be infinite number of equivalent models. Usually, this is done by fixing variances of certain latent variables or some loadings to 1.

Note that one can (manually) set some elements of the loading matrix to zero (so that some factors do are not manifested in some

yis), set some elements of the latent covariance matrix to zero, or allow non-zero off-diagonal elements of
S
. The latter is known as factor models with structured residuals, and can be shown to be equivalent to including another factor to the model (hence not violating the independence assumption in
E
).

One can add additional equations to the system of regression models to model additional relations (mostly regressions) between the latent variables and other manifest variables. Such models-that have more than factor models-are called structural equation models wherein the factor parts of them are called measurement models.

What if we wanted to find factors underlying within-person variations of the manifest variables? This is discussed in what follows.

P-factor analysis: Explaining within-persons variations

Now suppose that, instead of measuring a sample of a population, you have measured variables of one person at different times, e.g., have measured their emotions over the course of a year, and want to model the idiographic process underlying variations in the manifest variables.

More formally, suppose you have collected some intensive longitudinal data of a participant and now you have an

p-dimensional multivariate time series, say,
Yn×1t=[y1t,y2t,,ynt]
at times
0<t<T
. Now you want to model the
m
-dimensional latent constructs 'causing' the measured values, i.e.,
Θm×1t=[θ1t,θ2t,,θmt]
. Similar to R-factor analysis, you might want to write a factor model for each time as

(8)Yp×1t=Λp×mtΘm×1t+Ep×1t

As in R-factor analysis, the loading matrix (here,

Λt) should be shared by all observations (here, for all
t
s), hence
Λt=Λ
.

So far so good? No!

Like R-factor analysis, the assumptions of linear regression must hold. Again, most importantly, the residuals should be independent at times

ti and
tj
In other words,
cov(ϵti,ϵtjti)=0
-and it is not necessarily the case.

Conventionally, the simple/classic P-factor analysis ignores the serial temporal dependencies between measurements.

To be completed later

A short literature review

Most of the research in this area (at least those enumerated here) belong to psychology and social sciences.

A quick review of these references

A lacks the measurement model and consequently does not take into account AR at measurement level. [Mplus]

B, C, belong to ctsem R package developed in D, which is Driver's Ph.D. dissertation. Driver claims ctsem is capable of dynamic factor analysis. However, the suggested measurement model assumes serial independence of manifest residuals (in contrast to, e.g., RDSEM). So this cannot be used in our problem. (Driver and I had a discussion on twitter.) [R]

E briefly addresses the trended data but lacks measurement model (and consequently, residual AR). [Mplus]

F is a (rather comprehensive) tutorial to DSEM which also (rather briefly) addresses RDSEM. It has an appendix on RDSEM estimation which helps better understand the model. In short, they break down the error/residual term and consequently form "a special case of the DSEM model where the residual variables are modeled as within-level latent variables." [Mplus]

G is a better, more detailed comparison of DSEM and RDSEM. [Mplus]

H and I are good texts explaining DFMs. The former discusses the role of serial and contemporaneous idiosyncratic noise (~ residuals in RDSEM context) and how to include it in the model. [R]

J discusses DFM in more details with additional examples in R but doesn't go deep in residual AR as far as I've noticed. [R]

K is a great reference discussing building such AR models (as called Latent Variable-Autoregressive Latent Trajectory, LV-ALT models). [no implementation]

L is an interesting case of ARMA (D)SEM but still does not moel residual temporal dependencies. [Mx]

M, good to be comapred with L, is an integration of two approaches: "standard time series analysis (

T large and
N=1
) and conventional SEM (
N
large and
T=1
or small)," and addresses ergodicity too. It does not model residual serial dependencies however. [OpenMx]

Finally, one should also cf. N for evidence supporting robustness of P-factor analysis in time series of affect. If this is true for a multivariate time series, one does not need to deal with complicated (D)SEM models. [Nonlis]

The solution

To be completed later