---
title: "Controlling Effect Size Distributions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Controlling Effect Size Distributions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(GWASBrewer)
library(dplyr)
library(ggplot2)
library(reshape2)
```

## Introduction

`GWASBrewer` can support a range of effect size distributions. At one end of the spectrum, effect sizes and 
direct effect variants can be completely deterministic. At the other end of the spectrum, 
both are random. There are multiple ways to control the distributions of these
random values demonstrated in this vignette.

## Default Behavior


The default behavior of `sim_mv` is to sample effect variants randomly by first randomly 
selecting direct effect variants and then choosing the standardized effect size of these variants from 
a normal distribution with mean zero appropriately chosen variance.
Standardized effects are effects with units of change in outcome on SDs per unit increase in genotype SD.
This procedure results 
in  effect variants spread evenly throughout the genome and per-allele effect sizes that 
are inversely proportional to $\sqrt{2 f_j (1-f_j)}$.

We demonstrate this by simulating data for only one trait. We will set $N = 0$ since we don't need 
the summary statistics for this demonstration.
```{r}
set.seed(1)
sim_dat1 <- sim_mv(G = 1,
                   J = 50000,
                   N = 0,  
                   h2 = 0.3, 
                   pi = 1000/50000, 
                   af = function(n){rbeta(n, 1, 5)})
```

In this specification, we expect to have about 1000 effect variants. 
Below, we plot the distribution of the standardized and per-allele effects.
Note that in this simulation, `direct_SNP_effects_joint` and `beta_joint` are
the same because there is only one trait.


```{r}
sum(sim_dat1$direct_SNP_effects_joint != 0)

ix <- which(sim_dat1$direct_SNP_effects_joint != 0)
per_allele_effects <- sim_dat1$direct_SNP_effects_joint[ix]
effect_afs <- sim_dat1$snp_info$AF[ix]
std_effects <- per_allele_effects*sqrt(2*effect_afs*(1-effect_afs))

## first plot per-allele effects
hist(per_allele_effects, breaks = 30, freq = F, 
     main = "Per-Allele Direct Effects")

## next show that standardized effects are normal
hist(std_effects, breaks = 30, freq = F, 
     main = "Standardized Direct Effects")
curve(dnorm(x, sd = sqrt(0.3/1000)), add = T)
```


## Controlling which Variants are Effect Variants

The `pi` argument to `sim_mv` gives the probability that each variant is an effect variant. 
This can be a scalar in which calse the probability is the same for all variants and all traits,
a vector giving a different probability for each trait or $J \times M$ matrix. 
If `pi` has matrix format then  `pi[j,m]` gives the probability than
variant `j` has a direct effect on trait `m`. To make the set of 
direct effect variants fully deterministic, set the values of `pi` to be either 0
or 1. Using `pi` in matrix format is incompatible with options `pi_exact = TRUE`
and `sporadic_pleiotropy = FALSE`. 

Below, we use the matrix format to force all effect variants to be in the 
first half of the genome. This strategy could be used to designate
regions with high concentrations of effect variants or assign different
probabilities to variants with different features.

```{r}
pi_mat <- matrix(0, nrow = 50000, ncol = 1)
pi_mat[1:25000,1] <- 1000/25000
sim_dat2 <- sim_mv(G = 1,
                   J = 50000,
                   N = 60000,  
                   h2 = 0.3, 
                   pi = pi_mat, 
                   af = function(n){rbeta(n, 1, 5)})
```

```{r}
plot(sim_dat2$direct_SNP_effects_joint)
```

In another application of this option, we can make the probability that a
variant is an effect variant a function of the allele frequency.

```{r}
af <- rbeta(n = 50000, 1, 5)
pi_mat <- matrix(0, nrow = 50000, ncol = 1)
pi_mat[,1] <- (1-pmin(af, 1-af))^4
# ensure there are still 1000 effect variants expected
pi_mat <- 1000*pi_mat/sum(pi_mat) 
sim_dat3 <- sim_mv(G = 1,
                   J = 50000,
                   N = 60000,  
                   h2 = 0.3, 
                   pi = pi_mat, 
                   af = af)
```
To verify that we observe the desired effect, we plot the proportion of effect
variants by bins of minor allele frequency.

```{r}
x <- seq(0, 0.5, length.out = 100)
maf <- pmin(af, 1-af)
snp_dat <- data.frame(maf = maf, 
                      is_effect = sim_dat3$direct_SNP_effects_joint !=0, 
                      maf_cat = cut(maf, breaks = x), 
                      maf_cat_order = cut(maf, breaks = x, labels = FALSE)) %>%
           group_by(maf_cat) %>% 
           summarize(n = n(), 
                     n_eff = sum(is_effect), 
                     maf_min = first(maf_cat_order)) %>%
           mutate(eff_prob = n_eff/n)
snp_dat$maf_min <- x[snp_dat$maf_min]
with(snp_dat, plot(maf_min, eff_prob))
```


## Controlling Effect Size Distribution

To sample standardized effects from a distribution other
than the normal distribution, we use the `snp_effect_function` 
argument in `sim_mv`. This argument can accept a function 
for randomly sampling from any scale family distribution.
The function given to `snp_effect_function`
must accept three arguments, `n`, 
`sd`, and `snp_info`.
The function will be called internally by `sim_mv`. 
It should return a length `n` vector of real numbers such that
the expected sum of squares is `sd^2`. That is, if
`f` is our function, and `x <- f(n, sd, snp_info)` then `x` should be 
length `n` and the expected value of `sum(x^2)` should be `sd^2`.

The `snp_info` argument allows the function to receive variant information such as allele frequency. 
The simplest effect size functions will not use this argument at all, though all functions should 
be able to accept it.

Currently, `GWASBrewer` will test that your function executes and will 
test that it gives output with approximately the expected variance for 
one test case. This warning will flag some major errors but lack of warning
does not guarantee that your function is working correctly. You should be sure
to verify it's behavior for yourself. You may see the warning occasionally 
for correctly working functions due to randomness or if you function 
produces effect sizes with an eccentric distribution. 


### Drawing Effects from a Mixture of Normals

Frequently, we observe a pattern that many effect variants have a small "background" effect
and some have a larger "foreground" effect. This pattern is consistent with an omnigenic
model for genetic architecture. To simulate this, we can draw standardized
effects from a mixture of normal distributions rather than from a single normal distribution.
There is a helper function `mixnorm_to_scale_fam` which generates functions 
appropriate to pass
to `sim_mv`. For example, suppose we want a distribution in which 90\% of variants 
have standardized direct effects following a $N(0, \sigma^2_1)$ distribution and 10\% 
have standardized effects from a $N(0, 10 \sigma^2_1)$ distribution. Then we can use

```{r}
myf <- mixnorm_to_scale_fam(sigma = c(1, sqrt(10)), pi = c(0.9, 0.1))
```

Note that the `sigma` argument just specifies the relative scaling of the
variances so it would be equivalent to use 

```{r}
myf <- mixnorm_to_scale_fam(sigma = c(10, 10*sqrt(10)), pi = c(0.9, 0.1))
```


```{r}
set.seed(2)
sim_dat4 <- sim_mv(G = 1,
                   J = 50000,
                   N = 0,  
                   h2 = 0.3, 
                   pi = 5000/50000, 
                   snp_effect_function = myf,
                   af = function(n){rbeta(n, 1, 5)})
```

This gives a distribution of standardized direct effects with much heavier tails than 
using the default behavior. 

```{r}
std_effects <- with(sim_dat4, 
                    direct_SNP_effects_joint*sqrt(2*snp_info$AF*(1-snp_info$AF)))
hist(std_effects[!std_effects == 0], breaks = 30, freq = F, 
     main = "Standardized Direct Effects")
```

Rather than use the built-in helper function, we could have gotten the same result using

```{r, eval = FALSE}
myf <- function(n, sd, ...){
  Vtarget <- sd^2 # Target sum of squares
  Vbase <- sum(0.9*1 + 0.1*10)*n # base sum of squares
  a <- sqrt(Vtarget/Vbase)  # scale factor
  mix_sds <- a*c(1, sqrt(10))
  # choose which variants are from which component
  Z <- rbinom(n =n, size = 1, prob = 0.1)
  x <- rep(NA, n)
  x[Z == 0] <- stats::rnorm(n = sum(Z == 0), sd = mix_sds[1])
  x[Z == 1] <- stats::rnorm(n = sum(Z == 1), sd = mix_sds[2])
  return(x)
}

## double check that we got the scaling right
x <- myf(1e5, sd = sqrt(0.2))
sum(x^2) # should be about 0.2
```

Note that because our function has heavier tails, this particular example will 
generate a warning about 2\% of the time. Seeing this warning doesn't necessarily mean that anything is wrong. 
When in doubt, check your code and check that the resulting heritability (e.g. `sim_dat4$h2`) is
close to the expected heritability.

### Providing a fixed list of relative effect sizes

If you want to provide a specific fixed list of relative variant effects, you can do this using the helper function `fixed_to_scalefam`.  Effects will still be scaled to give the desired heritability, however the relative sizes of effects will be given by a fixed list, repeated as many times as necessary to give the desired number of variants. For example, the function below provides two standardized effect sizes, one twice as big as the other. This is not a very realistic distribution but it is ok for an example.

```{r}
myf <- fixed_to_scale_fam(c(1, 2))
```


```{r}
set.seed(2)
sim_dat5 <- sim_mv(G = 1,
                   J = 50000,
                   N = 0,  
                   h2 = 0.3, 
                   pi = 5000/50000, 
                   snp_effect_function = myf,
                   af = function(n){rbeta(n, 1, 5)})
```

This gives us standardized effects that alternate between two values.

```{r}
per_allele_effects <- sim_dat5$direct_SNP_effects_joint
genotype_variance <- with(sim_dat5$snp_info, 2*AF*(1-AF))
std_effects <- per_allele_effects*sqrt(genotype_variance)
std_effects[!std_effects == 0][1:10]
```

Note that the per-allele effects take on many values because there is variation in allele frequency.

```{r}
per_allele_effects[!per_allele_effects == 0][1:10]
```


### Providing an Exact Set of Direct Effects

The function `sim_mv_determined` is a wrapper function for `sim_mv` that allows you to directly specify direct effects for all traits and variants. 
The only differences between `sim_mv_determined` and `sim_mv` are that `sim_mv_determined` accepts the matrix of direct effects plus parameters `geno_scale` and `pheno_sd` to indicate the scale of the provided effects, and `sim_mv_determined` does not accept parameters related to the distribution of effect sizes. Note that for `sim_mv_determined`, the `G` matrix gives effects on the phenotype scale of the effect sizes, indicated with the `pheno_sd` argument. 
In `sim_mv`, the phenotype scale is always 1 (phenotypes have unit variance).
The resulting object can be translated to a different phenotype scale using `rescale_sumstats` if desired.


Below is a very simple example using `sim_mv_determined` for ten variants and two traits. We will put the desired direct effects into a matrix and then pass these to `sim_mv_determined`. Here we use effects on the per-genotype SD scale, so we set `geno_scale=sd`. 

```{r}
G <- matrix(c(0, 0.5, 0, 0), nrow = 2, byrow =TRUE)
my_effects <- matrix(0, nrow = 10, ncol = 2)
my_effects[c(1, 5),1] <- c(-0.008, 0.01)
my_effects[c(3, 6, 9), 2] <- c(-0.02, 0.06, 0.009)
my_effects

# for fun, lets include some sample overlap
N <- matrix(c(40000, 10000, 10000, 20000), nrow = 2)
sim_dat6 <- sim_mv_determined(N = N,
                              direct_SNP_effects_joint = my_effects,
                              geno_scale = "sd",
                              pheno_sd = 1,
                              G=G,
                              est_s = TRUE)

sim_dat6$direct_SNP_effects_joint
sim_dat6$beta_joint
sim_dat6$Sigma_G
```


### Different effect distributions for different traits

The `snp_effect_function` parameter can accept a list with length equal to the number of traits. In this case, the direct effect variants for each trait will be drawn from the corresponding element of `snp_effect_function`. If `snp_effect_function` is a list, each element should either be an acceptable function or the string `normal` to use the default option. The code below generates data for two independent traits. The effects for the first trait are normally distributed. The effects for the second trait take on one of two values as in the previous section. 


```{r}
myf <- fixed_to_scale_fam(c(1, 2))
sim_dat7 <- sim_mv(G = 2,
                   J = 50000,
                   N = 60000,  
                   h2 = 0.3, 
                   pi = 5000/50000, 
                   snp_effect_function = list('normal', myf),
                   af = function(n){rbeta(n, 1, 5)})


```


```{r}
std_effects1 <- with(sim_dat7, 
                     direct_SNP_effects_joint[,1]*sqrt(2*snp_info$AF*(1-snp_info$AF)))
hist(std_effects1[!std_effects1 == 0], breaks = 30, freq = F, main = "Standardized Direct Effects Trait 1")
```

```{r}
std_effects2 <- with(sim_dat7, 
                     direct_SNP_effects_joint[,2]*sqrt(2*snp_info$AF*(1-snp_info$AF)))
hist(std_effects2[!std_effects2 == 0], breaks = 30, freq = F, main = "Standardized Direct Effects Trait 2")
```

### Example Using Pre-Specified Effects

Below is some code that combines some of the effect size specification options we have seen to simulate GWAS data for one LD block and two traits. The first trait will have two effect variants that have equal standardized effect size. The second trait has two effect variants with different effect sizes. One pair of variants co-localizes across traits. Simulations of this type could be used to test co-localization methods.  Since we are looking at only one block, we will set the heritability of the block to be a reasonable amount of heritability provided by two variants, 6e-4 for one trait and 1e-3 for the other. 

```{r}
data("ld_mat_list")
data("AF")
# Pull out sub section of one of the built in LD blocks
ld_block <- ld_mat_list[[10]][118:350, 118:350]
J <- nrow(ld_block)
# Corresponding allele frequency
start_ix <- sum(sapply(ld_mat_list[1:9], nrow)) + 118 
af <- AF[start_ix:(start_ix + J-1)]
## plot the LD block
as.matrix(ld_block) %>% 
    reshape2::melt() %>% 
    ggplot(aes(x = Var1, y = Var2, fill = value))  + 
      geom_tile() + scale_fill_gradient2()
```

```{r}
## Select causal variants
pi_mat <- matrix(0, nrow = J, ncol = 2)
pi_mat[c(75, 150), 1] <- 1
pi_mat[c(75, 130), 2] <- 1

# Set effect size functions
myf1 <- fixed_to_scale_fam(c(1, 1))
myf2 <- fixed_to_scale_fam(c(-1, 2))

sim_dat7 <- sim_mv(G = 2,
                   J = J,
                   N = 100000,  
                   h2 = c(6e-4, 1e-3), 
                   pi = pi_mat,
                   R_LD = list(ld_block),
                   snp_effect_function = list(myf1, myf2),
                   af = af)

## Plotting joint effects
plot(sim_dat7$direct_SNP_effects_joint[,1], main = "Trait 1 Joint Effects", ylab = "beta joint")
abline(v = c(75, 150))
plot(sim_dat7$direct_SNP_effects_joint[,2], main = "Trait 2 Joint Effects", ylab = "beta joint")
abline(v = c(75, 130))
```

Note that the effects for the first trait are not the same because we specified standardized effects but have plotted per-allele effects.
Below we plot the marginal associations and observed $z$-scores for this data.

```{r}
## Plotting marginal effects
plot(sim_dat7$direct_SNP_effects_marg[,1], main = "Trait 1 Marginal Effects", ylab = "beta marg")
abline(v = c(75, 150), h = 0)
plot(sim_dat7$direct_SNP_effects_marg[,2], main = "Trait 2 Marginal Effects", ylab = "beta marg")
abline(v = c(75, 130), h = 0)
```


```{r}
## Plotting observed z-scores
with(sim_dat7, plot(beta_hat[,1]/se_beta_hat[,1], main = "Trait 1 Obs. Z-scores", ylab = "zhat marg"))
abline(v = c(75, 150), h = 0)
with(sim_dat7, plot(beta_hat[,2]/se_beta_hat[,2], main = "Trait 2 Obs. Z-scores", ylab = "zhat marg"))
abline(v = c(75, 130), h = 0)
```


### Custom Effect Size Distributions 

If you write your own function to pass to `snp_effect_function`, it needs to accept three parameters:
number of variants (`n`), 
standard deviation (square root of total variance explained) (`sd`), and SNP information (`snp_info`). 
This last argument can be used to create effect distributions that depend on 
allele frequency or other features. 

The default model used by `GWASBrewer` is a sparse version of the GCTA model in which
all causal variants have the same expected heritability explained. We might, instead
want to use a heritability model like the [LDAK](https://dougspeed.com/) model
or we might want to allow the expected heritability explained to depend on an annotation. 

For this example, we assume that we would like to use the heritability model

$$E[h_j^2] = x(f_j (1-f_j))^{0.75}(0.1 A_{1,j} + 0.13 A_{2,j} + 0.9)$$

where $x$ is a scalar that depends on the total heritability and $A_{1,j}$ and $A_{2,j}$ 
are the values of two annotations for variant $j$. 
This is a simplified version of the
LDAK model. 
To use this model, we need a data frame containing the annotations that we will pass to the `sim_mv` function via the `snp_info` argument. 
If you are using LD, the input `snp_info` data frame should have number of rows equal to the size of the LD pattern. 
Otherwise, it should have number of rows equal to the number of variants. 
For this example, we will generate the two annotations randomly and will not use LD. 
To really simulate from the LDAK model, it would be better to generate data with LD and realistic annotations.

```{r}
my_snp_info <- data.frame(A1 = rexp(n = 50000, rate = 1), 
                          A2 = rexp(n = 50000, rate = 2))
```

Next, we write a function to generate effect sizes. The `snp_info` data frame that our function
receives will include the annotations given to the input `snp_info` argument  of 
`sim_mv` **plus** columns called `SNP` for SNP name (usually not useful) and `AF` for allele frequency.


```{r}
myf <- function(n, sd, snp_info){
  stopifnot(nrow(snp_info) == n)
  # expected heritability explained by each variant up to a scalar
  # AF, A1, and A2 are all columns of snp_info
  # we provided A1 and A2. sim_mv added the AF column.
  s2 <- with(snp_info, 
             ((AF*(1-AF))^0.75)*(0.1*A1 + 0.13*A2 + 0.9))
  # Scale s2 to give sum(s2) = sd^2
  scale <- (sd^2)/sum(s2)
  s2 <- s2*scale
  # sample standardized effect sizes from normal distributions
  x <- stats::rnorm(n = n, mean = 0, sd = sqrt(s2))
  return(x)
}
```

Now we generate data, 

```{r}
set.seed(10)
sim_dat8 <- sim_mv(G = 1,
                    J = 50000, 
                    N = 0, 
                    h2 = 0.3, 
                    pi = 1, 
                    snp_effect_function = myf,
                    af = function(n){rbeta(n, 1, 5)},
                    snp_info = my_snp_info)
```

The annotations will be included in the `snp_info` data frame included in the output along with allele frequency and SNP number.

```{r}
head(sim_dat8$snp_info)
```
Let's verify that our function did what we thought it would by plotting the average heritability explained by bins of $(f_j (1-f_j))^{0.75}(0.1 A_{1,j} + 0.13 A_{2,j} + 0.9)$.

```{r}
my_snp_info <- sim_dat8$snp_info
my_snp_info$beta_joint <- sim_dat8$beta_joint[,1]
my_snp_info <- mutate(my_snp_info, 
                      exp_h2_factor = ((AF*(1-AF))^0.75) *( 0.1*A1 + 0.13*A2 + 0.9),
                      h2_explained = 2*AF*(1-AF)*(beta_joint^2))
my_snp_info$exp_h2_bin <- cut(my_snp_info$exp_h2_factor, breaks = 500)
my_snp_info %>% 
       group_by(exp_h2_bin) %>%
       summarize(mean_h2 = mean(h2_explained),
                 mean_exp_h2 = mean(exp_h2_factor))%>%
  ggplot() + geom_point(aes(x = mean_exp_h2, y = mean_h2)) + 
  xlab("Average Expected h2 (up to scale factor)") + 
  ylab("Average observed h2")

```