---
title: "Plasmode Simulation"
author: "Robin Evans"
date: "`r Sys.Date()`"
output: html_document
vignette: >
  %\VignetteIndexEntry{Plasmode}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
options(digits=3)
```

## Introduction

This document describes how to use the `causl` package to perform plasmode 
simulation; that is, producing datasets that combining real covariates (and 
possibly treatments) with synthetic outcomes.  This enables us to have realistic
datasets, but also to have knowledge of the underlying causal effect, making 
them very useful for testing the relative effectiveness of different inference
methods.

Start by loading the `causl` and `dplyr` packages.

```{r lib, message=FALSE}
library(dplyr)
library(causl)
```

## Dataset

We will use the dataset from a competition held at the Atlantic Causal Inference 
Conference in 2016.  To obtain this, first run the following commands:
```{r install_dorie, eval=FALSE}
install.packages("remotes")
remotes::install_github("vdorie/aciccomp/2016")
```
Then load the `aciccomp2016` package.
```{r load_dorie}
library(aciccomp2016)
(dat <- as_tibble(input_2016)) # show 10 rows of first few variables
```

## Model

Let us consider a model for the causal effect of smoking on birthweight.  We will
start with a binary variable $A$ to indicate whether the mother smoked during the
pregnancy, and later extend this to a zero-inflated continuous variable.  We
start by setting up the model inputs:

```{r setup1}
forms <- list(list(),
              list(A ~ x_1 + x_3 + x_4),
              list(Y ~ A),
              list(~ 1))
# fams <- list(integer(0), 5, 1, 1)
fams <- list(integer(0), "binomial", "gaussian", 1)
pars <- list(A = list(beta=c(-1.5,0.03,0.02,0.05)),
             Y = list(beta=c(3200, -500), phi=400^2),
             cop = list(beta=-1))
```
Then call `rfrugalParam` to simulate $A$ and $Y$.
```{r simdat1}
set.seed(123)   # to obtain consistent results
datAY <- rfrugalParam(formulas=forms, family=fams, pars=pars, dat=dat)
```
We can now check that this basic simulation was performed correctly.  First we
fit the treatment variable using the correct model.
```{r chkA1}
glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY)
summary(glmA)$coefficients
```
Indeed, the parameters appear correct.  Then we can use inverse probability 
weighting (IPW) to estimate the parameters for the outcome model.  We will need 
to load the `survey` package to get robust standard errors after the weighting.
```{r chkY1, message=FALSE, warning=FALSE}
library(survey)
ps <- predict(glmA, type="response")
wt <- datAY$A/ps + (1-datAY$A)/(1-ps)
glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY))
summary(glmY)$coef
```

We can also compare with a naïve method that ignores reweighting.

```{r chkY1n, message=FALSE, warning=FALSE}
glmYn <- glm(Y ~ A, data=datAY)
summary(glmYn)$coef
```

### Automatic parameter generation

We can also use the function `gen_cop_pars()` to sample parameters for the 
copula, rather than having to specify a vector for every bivariate copula.  To
do this, we simply provide the list of formulas, the dataset, and a range that
we would like the (partial) correlations to ultimately fall into.  

For example, if we want the correlations in the range $(-0.5,0)$, we can call
```{r sim_cop_pars}
pars2 <- pars
pars2$cop <- gen_cop_pars(formulas = forms, data=dat, range=c(-0.5,0))
```
```{r, echo=FALSE}
cors <- 2*expit(unlist(pars2$cop)) - 1
```
This method gives (partial) correlations `r cors[1]`,  `r cors[2]`, and  `r cors[3]`.

We can then simulate using this set of parameters.
```{r simdat2}
set.seed(124)
datAY2 <- rfrugalParam(formulas=forms, family=fams, pars=pars2, dat=dat)
```
```{r chkAY2, echo=FALSE}
glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY2)
# summary(glmA)$coefficients
ps <- predict(glmA, type="response")
wt <- datAY2$A/ps + (1-datAY2$A)/(1-ps)
glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY2))
# summary(glmY)$coef
est2 <- summary(glmY)$coef[1:2,1:2]
```
We can check the parameters in the same manner as above, and indeed we obtain
an intercept of `r round(est2[1,1])` (se $=$ `r round(est2[1,2],1)`) and a 
causal effect of `r round(est2[2,1],1)` (`r round(est2[2,2],1)`).

We can also use a more complicated method for generating copula parameters.  Suppose
we want these to depend upon `x_2`, which is a factor with six levels.  Then 
we specify
```{r setup3}
forms3 <- list(list(),
              list(A ~ x_1 + x_3 + x_4),
              list(Y ~ A),
              list(~ x_2))
```
and simulate as:
```{r gen_pars3}
set.seed(125)
pars3 <- pars
pars3$cop <- gen_cop_pars(formulas = forms3, data=dat, range=c(-0.5,0))
datAY3 <- rfrugalParam(formulas=forms3, family=fams, pars=pars3, dat=dat)
```

```{r chkAY3, echo=FALSE}
glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY3)
# summary(glmA)$coefficients
ps <- predict(glmA, type="response")
wt <- datAY3$A/ps + (1-datAY3$A)/(1-ps)
glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY3))
# summary(glmY)$coef
est3 <- summary(glmY)$coef[1:2,1:2]
```
This time we obtain an intercept of `r round(est3[1,1])` (`r round(est3[1,2],1)`) and a 
causal effect of `r round(est3[2,1],1)` (`r round(est3[2,2],1)`).  In this case
the copula parameter is a vector of six entries; for example for the $Y$-$X_1$ 
effect this is (`r round(pars3$cop$Y$x_1$beta, 2)`).

## Other copula families

Until now we have only used Gaussian copulas, but we can easily specify a 
different family.  For example, suppose we wish to consider Student t-copulas.
In this case we need to choose a degrees of freedom parameter, though this is 
easily managed with `gen_cop_pars()` which allows additional arguments as `...`.  

```{r gen_pars4}
set.seed(126)
pars4 <- pars3
(pars4$cop <- gen_cop_pars(forms3, data=dat, range=c(-0.5,0), par2=4))
fams4 <- fams
fams4[[4]] <- 2
datAY4 <- rfrugalParam(formulas=forms3, family=fams4, pars=pars4, dat=dat)
```
```{r chkAY4, echo=FALSE}
glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY4)
# summary(glmA)$coefficients
ps <- predict(glmA, type="response")
wt <- datAY4$A/ps + (1-datAY4$A)/(1-ps)
glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY4))
# summary(glmY)$coef
est4 <- summary(glmY)$coef[1:2,1:2]
```
Now the estimates are `r round(est4[1,1])` (`r round(est4[1,2],1)`) and a 
causal effect of `r round(est4[2,1],1)` (`r round(est4[2,2],1)`).  


## Strength of relationships

We also introduce the function `adj_vars()`, which allows users to modify the 
strength of the partial correlations (or other parameters) by scaling the 
parameters of the linear predictor.  The argument `factor` can be modified to
control the strength of the `strong` and `weak` variables respectively; the 
default values are 5 and 0.2.  Returning to the example above, suppose 
that we want `x_4` to be much more closely related than `x_1` or `x_3`.  Then
we can apply:
```{r gen_pars5}
set.seed(127)
pars5 <- pars3
pars_tmp <- gen_cop_pars(forms3, data=dat, range=c(-0.25,0))
(pars5$cop <- causl:::adj_vars(pars_tmp, strong="x_4", weak=c("x_1","x_3"), factor=c(2,0.25)))
fams5 <- fams
fams5[[4]] <- 5  # use Frank copula
datAY5 <- rfrugalParam(formulas=forms3, family=fams5, pars=pars5, dat=dat)
```
```{r chkAY5, echo=FALSE}
glmA <- glm(A ~ x_1 + x_3 + x_4, family=binomial, data=datAY5)
# summary(glmA)$coefficients
ps <- predict(glmA, type="response")
wt <- datAY5$A/ps + (1-datAY5$A)/(1-ps)
glmY <- svyglm(Y ~ A, design = svydesign(~1, weights = wt, data=datAY5))
# summary(glmY)$coef
est5 <- summary(glmY)$coef[1:2,1:2]
```
This time the estimates are `r round(est5[1,1])` (`r round(est5[1,2],1)`) and a 
causal effect of `r round(est5[2,1],1)` (`r round(est5[2,2],1)`).