Hidden Variables

We now consider how one can incorporate latent variables into our simulation, explicitly giving their simulated values.

In this example we use the survey package to get robust standard errors when reweighting. We start by loading the libraries.

library(causl)
library(survey)

Set Up the Model

We begin by setting up the formulas, families and parameter values:

formulas = list(list(U ~ 1, L ~ A0), list(A0 ~ 1, A1 ~ A0*L), Y ~ A0*A1, ~A0*A1)
fam = list(c(4,3), c(5,5), c(3), c(1,1,1))

pars <- list(A0 = list(beta = 0),
             U = list(beta = 0, phi=1),
             L = list(beta = c(0.3,-0.2), phi=1),
             A1 = list(beta = c(-0.3,0.4,0.3,0)),
             Y = list(beta = c(-0.5,0.2,0.3,0), phi=1),
             cop = list(beta = matrix(c(1,0,0,0,
                                 1,0,0,0,
                                 0.5,0,0,0), nrow=4)))

Simulate data and check distributions

Now simulate the data.

set.seed(123)
n <- 1e4
dat <- rfrugalParam(n, formulas, family=fam, pars=pars)

## Inversion method selected: using pair-copula parameterization

# dat <- causalSamp(n, formulas, family=fam, pars=pars)

We can then check that the parameter values match their intended values:

summary(svyglm(L ~ A0, family=Gamma(link="log"), 
               design=svydesign(id=~1, weights=~1, data=dat)))$coef

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    0.289     0.0144    20.1 3.35e-88
## A0            -0.216     0.0202   -10.7 1.34e-26

glmA1 <- glm(A1 ~ A0*L, family=binomial, data=dat)
summary(glmA1)$coef

##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.3539     0.0414  -8.538 1.36e-17
## A0            0.3811     0.0592   6.433 1.25e-10
## L             0.3233     0.0243  13.298 2.39e-40
## A0:L          0.0113     0.0398   0.285 7.76e-01

These look close to the desired values.

Wrong models and inverse probability weighting

We can start by fitting some naïve regressions.

## wrong models
mod_w <- svyglm(Y ~ A0*A1, family=Gamma(link="log"), 
                design=svydesign(id=~1, weights=rep(1,nrow(dat)), data=dat))

Coef	Truth	Est.	Std. Err.	p-value
(Intercept)	-0.5	-0.550	0.020	0.01
A0	0.2	0.216	0.030	0.30
A1	0.3	0.353	0.027	0.03
A0:A1	0.0	0.012	0.040	0.38

We see that the p-values for both the intercept at the A₁ coefficient are highly significant.

We can also try estimating the inverse probability of treatment weights and using them to obtain a consistent estimate of the treatment effects.

w <- predict(glmA1, type="response")
ps <- dat$A1/w + (1-dat$A1)/(1-w)

## correct model
mod_c <- svyglm(Y ~ A0*A1, family=Gamma(link="log"), 
                design=svydesign(id=~1, weights=ps, data=dat))

Coef	Truth	Est.	Std. Err.	p-value
(Intercept)	-0.5	-0.501	0.021	0.47
A0	0.2	0.205	0.032	0.44
A1	0.3	0.272	0.028	0.16
A0:A1	0.0	0.029	0.041	0.24

Indeed, they are all within two standard errors of their nominal values.

Maximum likelihood approach

We can also fit the data using maximum likelihood directly. Set eval=TRUE to run this chunk.

out <- fitCausal(dat, formulas = list(L ~ A0, Y ~ A0*A1, ~1), family = c(3,3,1))
out

Again, all estimates are within two standard errors of the true values.

Including the latent variable

We can transform U to a standard normal (Z) and include it in the model as well. (Again, set eval=TRUE to run this chunk.)

dat <- dplyr::mutate(dat, Z = qnorm(U))  # transform to normal for fitting
out <- fitCausal(dat, formulas = list(Z ~ 1, L ~ A0, Y ~ A0*A1, ~A0), family = c(1,3,3,1))
out

Note that again the results look correct including the ‘hidden’ variable U/Z.