--- title: "Copula Simulation" author: "Robin J. Evans" date: "27/05/2021" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Copula Simulation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", cache = TRUE, fig.width=7, fig.height=7 ) ``` ```{r setup} library(causl) ``` ## Simulation The main function for simulating data is `causalSamp()`. Suppose we wish to have \begin{align*} Z &\sim N(0, \; \sigma_z^2)\\ X \mid Z=z &\sim N(\mu_x, \; \sigma_x^2)\\ Y \mid \operatorname{do}(X=x) &\sim N(\mu_y, \; \sigma_y^2) \end{align*} with $\mu_x = 0.5 z$, $\mu_y = 0.3 x$, and $\sigma_z^2 = \sigma_x^2 = \sigma_y^2 = 1$. Then we set: ```{r parameters} pars <- list(z = list(beta = 0, phi=1), x = list(beta = c(0,0.5), phi=1), y = list(beta = c(0,0.3), phi=1), cop = list(beta = matrix(c(0.5,0.25), ncol=1))) ``` Note that the copula parameters have been selected for a linear predictor of the form $\eta = 0.5 + 0.25x$, and the default link function for the Gaussian copula is $g(\eta) = \operatorname{logit}((\eta+1)/2)$. Then we simulate as follows: ```{r simulate} set.seed(124) # for consistency dat <- causalSamp(1e3, formulas=list(z~1, x~z, y~x, ~x), family=rep(1,4), pars=pars) pairs(dat) ``` Note that the distribution is as we expect. Regressing \code{x} on \code{z} should give a coefficient close to 0.5. ```{r check} lmXZ <- lm(x ~ z, data=dat) summary(lmXZ)$coef # lmYX <- lm(y ~ x, data=dat, weights = 1/dnorm(predict(lmXZ))) # summary(lmYX) ``` Note that it is not necessary to use the standard variable names `z`, `x` and `y`, but if you choose not to then you must specify the names you want to use with the `formulas` argument. ```{r abc} pars <- list(a = list(beta = 0, phi=1), b = list(beta = c(0,0.5), phi=1), c = list(beta = c(0,0.3), phi=1), cop = list(beta = matrix(c(0.5,0.25), ncol=1))) set.seed(124) dat2 <- causalSamp(1e3, formulas=list(a ~ 1, b ~ a, c ~ b, ~ b), pars=pars) pairs(dat2) ``` `causalSamp()` has various default settings. The copula $C_{YZ}$, and the distribution families of $X$, $Y$ and $Z$ all default to Gaussian. It over-samples by a factor of 10 so that some may be rejected later, though this is increased later on if necessary up to the control parameter `max_oversamp` (with default value 1000). ## Fitting Models We can fit a model using maximum likelihood. ```{r fit_model} out <- fit_causl(dat, formulas = list(z~1, y~x, ~x)) out ``` We see that the `x` coefficient in the regression parameter is correct (even surprisingly so!). Note that if we allow `z` to depend upon `x` we obtain a biased estimate. ```{r fit_other} out2 <- fit_causl(dat, form = list(y~x, z~x, ~x)) out2 ``` ### Other families We can also perform maximum likelihood estimation with other parametric families. Suppose that $Z$ is t-distributed and $Y$ Gamma distributed. We can simulate with ```{r gammaY} forms <- list(Z ~ 1, X ~ Z, Y ~ X, ~ 1) fams <- c(2, 5, 3, 1) pars <- list(Z = list(beta = 0, phi=1, par2=4), X = list(beta = c(0,0.5)), Y = list(beta = c(1,-0.3), phi=2), cop = list(beta = 1)) set.seed(124) dat <- rfrugalParam(1e3, formulas=forms, family=fams, pars=pars) ``` Then we can perform maximum likelihood estimation on this data: ```{r fit_gammaY} (out <- fit_causl(dat, formulas = forms[-2], family = fams[-2], other_pars=list(Z=list(par2=4L)))) ```