--- title: "Custom Families" author: "Robin Evans" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Custom Families} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, cache=TRUE) options(digits=3) ``` The `causl` package now allows users to specify custom families with which to perform their simulations. To do this, one must create a function that takes as an argument the link function, and returns a list containing: * `name`: the name of the relevant family; * `ddist`: a function returning the density of the distributions; * `qdist`: a function returning the quantiles from probabilities; * `rdist`: a function to sample values from the distribution; * `pdist`: a cumulative distribution function; * `pars`: a list of the names of the parameters used; * `default`: a function that returns a list of the default values for an observation and each of the parameters; * `link`: the specified link function. The function should also give the output the class `"causl_family"`, so that it is interpreted appropriately. ## Example (Poisson distribution) Suppose we wish to simulate from a Poisson distribution as part of a frugal model. Then we can define a function such as the below: ```{r pois, message=FALSE} library(causl) poisson_causl_fam <- function (link) { if (missing(link)) link = "log" ## write functions dens <- function (x, mu, log=FALSE) dpois(x, lambda=mu, log=log) quan <- function (p, mu) qpois(p, lambda=mu) sim <- function (n, mu) { qx <- runif(n) x <- qpois(qx, lambda=mu) attr(x, "quantile") <- qx return(x) } probs <- function (x, mu) ppois(x, lambda=mu) default <- function(theta) list(x=1, mu=1) ## define family out <- list(name="poisson", ddist=dens, qdist=quan, rdist=sim, pdist=probs, pars=c("mu"), default=default, link=link) class(out) <- "causl_family" return(out) } ``` Some remarks are in order. Note that the `ddist` function must have a `log` argument to allow the log-density to be evaluated. In addition, since this is a discrete distribution, if we use it in a copula we must know the exact quantile used in the simulation (not just the one returned by applying `ppois`). Hence we simulate a uniform and then obtain the Poisson random variables by inversion from `qpois`. For continuous distributions this is unnecessary, and `glm_sim` will just apply the `pdist` function. Now, let us try to simulate some data using this distribution. We will need to use `poisson_causl_fam` in our `family` argument: ```{r params} forms <- list(Z ~ 1, X ~ Z, Y ~ X, ~ 1) fams <- list(1, poisson_causl_fam(), 1, 1) pars <- list(Z = list(beta = 0, phi = 1), X = list(beta = c(0.3,1)), Y = list(beta = c(-1,0.5), phi = 1), cop = list(beta = 1)) cm <- causl_model(formulas = forms, family = fams, pars = pars) set.seed(123) dat <- rfrugal(n=1e3, causl_model=cm) ``` ### Treatment model Then we can check that the new variables have been simulated correctly. We ought to have: \begin{align*} P(X = x \mid z) &= \frac{1}{x!} \exp\{\alpha_0 + \alpha_1 z x - e^{\alpha_0 + \alpha_1 z}\}, \end{align*} with $\alpha = `r pars$X$beta[1]`$ and $\alpha_1 = `r pars$X$beta[2]`$. Let's perform a regression to check the parameters: ```{r chk_pois} modX <- glm(X ~ Z, family=quasipoisson, data=dat) summary(modX)$coefficients ``` So we can see that the estimates are well within two standard errors of the true value. In addition, ```{r chk_pois_disp} summary(modX)$dispersion ``` the dispersion parameter is close to its nominal value of 1. ### Outcome model We can also fit the outcome model using maximum likelihood estimation with sandwich errors. ```{r fit_causl} out <- fit_causl(dat = dat, formulas = list(Y ~ X, Z ~ 1, cop ~ 1), family = c(1, 1, 1)) ``` This gives us estimates of $\beta_0 = `r out$par[1]`$ (`r out$sandwich_se[1]`) and $\beta_1 = `r out$par[2]`$ (`r out$sandwich_se[2]`), which is consistent with the values $-1$ and $0.5$ that we used to simulate. ## Custom Link Functions The example above works because `log` is an existing link function specified in the package. Suppose, however, that we want to use the `sqrt` function as the link; this is not used with any built-in families. For this we need to specify a `custom_links` argument to `poisson_causl_fam`; each entry in this named list should itself be a list, containing: * `linkfun`: the link function $g$; * `linkinv`: the inverse of $g$. The names of the entries should correspond to the string used as the name of the link function. ```{r pois2} poisson_causl_fam <- function (link) { if (missing(link)) link = "log" ## write functions dens <- function (x, mu, log=FALSE) dpois(x, lambda=mu, log=log) quan <- function (p, mu) qpois(p, lambda=mu) sim <- function (n, mu) { qx <- runif(n) x <- qpois(qx, lambda=mu) attr(x, "quantile") <- qx return(x) } probs <- function (x, mu) ppois(x, lambda=mu) default <- function(theta) list(x=1, mu=1) custom_links <- list(sqrt = list(linkfun = function (x) sqrt(x), linkinv = function (x) x^2)) ## define family out <- list(name="poisson", ddist=dens, qdist=quan, rdist=sim, pdist=probs, pars=c("mu"), default=default, link=link, custom_links=custom_links) class(out) <- "causl_family" return(out) } ``` Now we can suitably modify our earlier example, ```{r params_sqrt} fams <- list(1, poisson_causl_fam(link="sqrt"), 1, 1) pars$X$beta <- c(0.9, 0.05) cm_sq <- causl::modify(cm, family=fams, pars=pars) ``` and re-simulate: ```{r sim2} set.seed(124) dat <- rfrugal(1e3, causl_model = cm_sq, control=list(quiet=TRUE)) ``` We can again check that the distribution was correctly sampled. ```{r chk_pois_sqrt} modX <- glm(X ~ Z, family=quasipoisson(sqrt), data=dat, start=c(1,0)) summary(modX)$coefficients ``` So we again see that the estimates are (just about) within two standard errors of the true values (0.9 and 0.05), and the dispersion parameter is estimated to be `r summary(modX)$dispersion`, again close to 1.