---
title: "Copula Simulation"
author: "Robin J. Evans"
date: "27/05/2021"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Copula Simulation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  cache = TRUE,
  fig.width=7, 
  fig.height=7
)
```

```{r setup}
library(causl)
```

## Simulation

The main function for simulating data is `causalSamp()`.  Suppose we wish to
have
\begin{align*}
Z &\sim N(0, \; \sigma_z^2)\\
X \mid Z=z &\sim N(\mu_x, \; \sigma_x^2)\\
Y \mid \operatorname{do}(X=x) &\sim N(\mu_y, \; \sigma_y^2)
\end{align*}
with $\mu_x = 0.5 z$, $\mu_y = 0.3 x$, and $\sigma_z^2 = \sigma_x^2 = \sigma_y^2 = 1$.
Then we set:
```{r parameters}
pars <- list(z = list(beta = 0, phi=1),
             x = list(beta = c(0,0.5), phi=1),
             y = list(beta = c(0,0.3), phi=1),
             cop = list(beta = matrix(c(0.5,0.25), ncol=1)))
```
Note that the copula parameters have been selected for a linear predictor of the
form $\eta = 0.5 + 0.25x$, and the default link function for the Gaussian
copula is $g(\eta) = \operatorname{logit}((\eta+1)/2)$.

Then we simulate as follows:
```{r simulate}
set.seed(124)  # for consistency
dat <- causalSamp(1e3, formulas=list(z~1, x~z, y~x, ~x), family=rep(1,4), pars=pars)
pairs(dat)
```
Note that the distribution is as we expect.  Regressing \code{x} on \code{z}
should give a coefficient close to 0.5.
```{r check}
lmXZ <- lm(x ~ z, data=dat)
summary(lmXZ)$coef
# lmYX <- lm(y ~ x, data=dat, weights = 1/dnorm(predict(lmXZ)))
# summary(lmYX)
```
Note that it is not necessary to use the standard variable names `z`, `x` and
`y`, but if you choose not to then you must specify the names you want to use
with the `formulas` argument.
```{r abc}
pars <- list(a = list(beta = 0, phi=1),
             b = list(beta = c(0,0.5), phi=1),
             c = list(beta = c(0,0.3), phi=1),
             cop = list(beta = matrix(c(0.5,0.25), ncol=1)))
set.seed(124)
dat2 <- causalSamp(1e3, formulas=list(a ~ 1, b ~ a, c ~ b,  ~ b), pars=pars)
pairs(dat2)
```

`causalSamp()` has various default settings.  The copula $C_{YZ}$, and the
distribution families of $X$, $Y$ and $Z$ all default to Gaussian.  It
over-samples by a factor of 10 so that some may be rejected later, though this
is increased later on if necessary up to the control parameter `max_oversamp`
(with default value 1000).

## Fitting Models

We can fit a model using maximum likelihood.
```{r fit_model}
out <- fit_causl(dat, formulas = list(z~1, y~x, ~x))
out
```
We see that the `x` coefficient in the regression parameter is correct (even
surprisingly so!).

Note that if we allow `z` to depend upon `x` we obtain a biased estimate.
```{r fit_other}
out2 <- fit_causl(dat, form = list(y~x, z~x, ~x))
out2
```
<!-- We can also use a different copula. -->
<!-- ```{r fit_another} -->
<!-- fit_cop(dat, forms = list(~ x, ~ 1, ~ 1), fam_cop=3) -->
<!-- ``` -->

### Other families

We can also perform maximum likelihood estimation with other parametric 
families.  Suppose that $Z$ is t-distributed and $Y$ Gamma distributed.  We can
simulate with 
```{r gammaY}
forms <- list(Z ~ 1, X ~ Z, Y ~ X, ~ 1)
fams <- c(2, 5, 3, 1)
pars <- list(Z = list(beta = 0, phi=1, par2=4),
             X = list(beta = c(0,0.5)),
             Y = list(beta = c(1,-0.3), phi=2),
             cop = list(beta = 1))
set.seed(124)
dat <- rfrugalParam(1e3, formulas=forms, family=fams, pars=pars)
```
Then we can perform maximum likelihood estimation on this data:
```{r fit_gammaY}
(out <- fit_causl(dat, formulas = forms[-2], family = fams[-2], other_pars=list(Z=list(par2=4L))))
```