Simulating Data

library(GWASBrewer)
library(DiagrammeR)
library(dplyr)
library(reshape2)
library(ggplot2)
set.seed(1)

Introduction

This vignette demonstrates how to use the sim_mv function to simulate data a few different types of GWAS data.

Introduction to sim_mv

The sim_mv function generates GWAS summary statistics for multiple continuous traits from a linear structural equation model encoded as a matrix of direct effects. Variants can be generated with or without LD. There are also some helper functions for LD-pruning and generating special kinds of direct effect matrices.

The sim_mv function internally calls a more general function sim_lf which generates summary statistics given a latent factor structure for a set of traits.

Throughout this vignette, we will use M to represent the number of traits and J to represent the number of variants.

Basic Usage

Input

The sim_mv function has five required arguments:

  • N: The GWAS sample size for each trait. This can be a scalar, vector, matrix, or data frame. If a vector, N should have length equal to M. If there are overlapping samples between GWAS, N should be an M × M matrix or data frame with format described in the “Sample Overlap” section below.
  • J: The number of SNPs to simulate (scalar).
  • h2: The heritability of each trait. This can be a scalar or a length M vector.
  • pi: The proportion of SNPs that have a direct effect on each trait. This can be a scalar, a length M vector, or a matrix with J × M matrix (more details on the matrix format can be found in the Effect Size Distribution vignette).
  • G: A matrix specifying direct causal effects between traits. G should be an M × M matrix described below. Alternatively, if there are no causal effects between any traits, G can be given as the positive integer M. This is equivalent to G = matrix(0, nrow = M, ncol = M).

There are additional optional arguments:

  • R_E and R_obs: Alternative ways to specify environmental correlation (see “Sample Overlap” section below for more details).
  • R_LD: A list of LD blocks (see “Simulating Data with LD”).
  • af: Vector of allele frequencies (required if R_LD is specified). If R_LD is not specified, af can be a scalar, vector of length J or a function (more details below). If R_LD is specified, af must be a vector with length corresponding to the size of the LD pattern (see “Simulating Data with LD”)
  • sporadic_pleiotropy: Allow a single variant to have direct effects on multiple traits at random. Defaults to TRUE.
  • pi_exact: If TRUE, the number of direct effect SNPs for each trait will be exactly equal to round(pi*J). Defaults to FALSE.
  • h2_exact: If TRUE, the heritability of each trait will be exactly h2. Defaults to FALSE.
  • est_s: If TRUE, return estimates of se(beta_hat). Defaults to FALSE but we generally recommend setting est_s to TRUE if you will be making use of standard errors.
  • snp_effect_function and snp_info are parameters useful for specifying non-default distributions of effect sizes. This is not covered in this vignette but is covered in the Effect Size Distributions vignette.

Output

The sim_mv function returns an object with class sim_mv. This object contains the following elements:

GWAS summary statistics are contained in two or three matrices:

  • beta_hat: Simulated GWAS effect estimates and standard errors.
  • se_beta_hat: True standard errors of beta_hat.
  • s_estimate: If est_s= TRUE then a simulated estimate of se_beta_hat.

True marginal and joint, total and direct effects are contained in four matrices:

  • beta_joint: Total causal effects of SNPs on traits.
  • beta_marg: Expected marginal association of SNPs on traits. beta_marg is the expected value of beta_hat. When there is no LD, beta_marg and beta_joint are the same.
  • direct_SNP_effects_joint: Direct causal effects of SNPs on traits. Direct means not mediated by other traits.
  • direct_SNP_effects_marg: Like beta_marg but considering only direct rather than total effects.

The relationship between traits is contained in two matrices:

  • direct_trait_effects: Matrix of direct effects between traits. This should be the same as the input G.
  • total_trait_effects: Matrix of total effects between traits.

Trait covariance is described by four matrices:

  • Sigma_G: Genetic variance-covariance matrix, determined by heritability and G.
  • Sigma_E: Environmental variance-covariance matrix. This is determined by heritability and R_E.
  • trait_corr: Population trait correlation, equal to cov2cor(Sigma_G + Sigma_E). For data produces with sim_mv, trait variance is always equal to 1, so trait_corr = Sigma_G + Sigma_E.
  • R: Correlation in sampling error of beta_hat across traits, equal to trait_corr scaled by a matrix of sample overlap proportions.

Some other pieces of information useful in more complicated scenarios:

  • h2: Realized trait heritability. For data produced with sim_mv, this will always be equal to diag(Sigma_G). However this may not be the case for resampled data (see the Resampling Data vignette).
  • pheno_sd: Standard deviation of traits. For data produced with sim_mv, this is always a vector of 1’s.
  • snp_info: A data frame of variant information including allele frequency and possibly other information (see the Effect Distribution vignette).
  • geno_scale: Equal to allele if effect sizes are per allele or sd if effect sizes are per genotype SD (i.e. standardized).

The order of the columns of all results corresponds to the order of traits in G.

Simplest Usage

The simplest thing to do with sim_mv is to generate summary statistics for M traits with no causal relationship with no LD between variants. In the code below, we generate data for 3 traits and 100,000 variants. Since the sample size is specified as a scalar, all three GWAS have the same sample size but there are no overlapping samples. For some variety, we give each trait a different heritability and a different proportion of causal variants.

dat_simple <- sim_mv(G = 3, # using the shortcut for specifying unrelated traits.
                     # equivalent to G = matrix(0, nrow = 3, ncol = 3)
                     N = 50000, # sample size, same for all three traits
                     J = 100000, # number of variants
                     h2 = c(0.1, 0.25, 0.4), # heritability
                     pi = c(0.01, 0.005, 0.02), # proportion of causal variants
                     est_s = TRUE # generate standard error estimates.
                     )
#> SNP effects provided for 100000 SNPs and 3 traits.

The sim_mv function always assumes that traits have been scaled to have variance equal to 1, so effect sizes are interpretable as the expected change in the trait in units of SD per either alternate allele (if geno_scale = allele) or per genotype SD (if geno_scale = sd). In our case, dat_simple$geno_scale is equal to sd because no allele frequencies were provided. The realized genetic and environmental variance matrices are in Sigma_G and Sigma_E.

dat_simple$Sigma_G
#>              [,1]         [,2]         [,3]
#> [1,] 1.045339e-01 4.515705e-06 0.0005262836
#> [2,] 4.515705e-06 2.530663e-01 0.0009271551
#> [3,] 5.262836e-04 9.271551e-04 0.4429666631
dat_simple$Sigma_E
#>           [,1]      [,2]      [,3]
#> [1,] 0.8954661 0.0000000 0.0000000
#> [2,] 0.0000000 0.7469337 0.0000000
#> [3,] 0.0000000 0.0000000 0.5570333

The diagonal of dat_simple$Sigma_G is equal to the trait heritability (also stored in dat_simple$h2). We can see that these numbers are close to but not exactly equal to the values input to the h2 parameter. This is because h2 provides the expected heritability. If we want to force the realized heritability to be exactly equal to the input h2, we can use h2_exact = TRUE. The genetic covariance (the non-diagonal elements of Sigma_G) is slightly non-zero because the traits share a small number of causal variants by chance. If we want to prevent this, we can use sporadic_pleiotropy=FALSE. However, in some cases, it is not possible to satisfy sporadic_pleiotropy = FALSE and that option will generate an error.

The simulated summary effect estimates are in dat_simple$beta_hat and simulated standard error estimates are in dat_simple$s_estimate. In our case, these will both be 100,000 by 3 matrices.

head(dat_simple$beta_hat)
#>              [,1]         [,2]         [,3]
#> [1,] 0.0063610849 -0.007271609  0.003926946
#> [2,] 0.0009913487  0.012587585  0.003931295
#> [3,] 0.0031938057 -0.002387450 -0.003170087
#> [4,] 0.0030629600  0.001481305 -0.002369274
#> [5,] 0.0020414224  0.007953421  0.005325885
#> [6,] 0.0030602370 -0.002037463 -0.008673129
head(dat_simple$s_estimate)
#>             [,1]        [,2]        [,3]
#> [1,] 0.004478273 0.004480611 0.004483377
#> [2,] 0.004468891 0.004478611 0.004482622
#> [3,] 0.004472107 0.004466985 0.004465830
#> [4,] 0.004465952 0.004460529 0.004476516
#> [5,] 0.004449150 0.004493462 0.004482853
#> [6,] 0.004480817 0.004464715 0.004479402

In actual GWAS data, these two matrices are the only information we get to observe. Everything else stored in dat_simple is information that is unobservable in “real” data but useful for benchmarking analysis methods. The effect estimates in dat_simple$beta_hat are always estimates of the true marginal effects in in dat_simple$beta_marg. Since there is no LD in this data, marginal and joint effects are the same, so you will find that dat_simple$beta_marg and dat_simple$beta_joint are identical. We can identify causal variants as variants with non-zero values of beta_joint. For example, which(dat_simple$beta_joint[,1] != 0) would give the indices of variants causal for trait 1.
The estimates in dat_simple$s_estimate are estimates of the standard error of dat_simple$beta_hat. The true standard errors are stored in dat_simple$se_beta_hat. If we had left est_s at its default value of FALSE, dat_simple would not contain the s_estimate matrix.

Specifying Causal Relationships Between Traits

The matrix G specifies a linear structural equation model for a set of traits. To generate a set of M traits with no causal relationships, G can be set either equal to M or to an M × M matrix of 0’s. Otherwise, G must be an M × M matrix with G[i,j] specifying the direct linear effect of trait i on trait j. The diagonal entries of G should be 0 (no self effects). An error will be generated if G specifies a graph that contains cycles. Since all traits have variance equal to 1, so G[i,j]^2 is the proportion of trait j variance explained by the direct effect of trait i.

For example, the matrix

G <- matrix(c(0, sqrt(0.25), 0, sqrt(0.15), 
              0, 0, 0, sqrt(0.1), 
              sqrt(0.2), 0, 0, -sqrt(0.3), 
              0, 0, 0, 0), nrow = 4, byrow = TRUE)
colnames(G) <- row.names(G) <- c("X", "Y", "Z", "W")
G
#>           X   Y Z          W
#> X 0.0000000 0.5 0  0.3872983
#> Y 0.0000000 0.0 0  0.3162278
#> Z 0.4472136 0.0 0 -0.5477226
#> W 0.0000000 0.0 0  0.0000000

corresponds to the graph

To simulate simple data from this graph, we can use

sim_dat1 <- sim_mv(G = G,
                  N = 50000,
                  J = 100000,
                  h2 = c(0.3, 0.3, 0.5, 0.4), 
                  pi = 0.01,
                  est_s = TRUE)
#> SNP effects provided for 100000 SNPs and 4 traits.

In this specification, we have four GWAS with sample size 50,000, again with no overlapping samples. Since J = 100000 and pi = 0.01, we expect each trait to have 1000 direct effect variants.

With causal relationships between traits, there are now some interesting things to notice about the output. First, we can see that there is both genetic and environmental covariance between the traits.

sim_dat1$Sigma_G
#>            X          Y          Z           W
#> X 0.28249614 0.14135064  0.2179663  0.03485498
#> Y 0.14135064 0.29912608  0.1089297  0.08886324
#> Z 0.21796635 0.10892974  0.4892269 -0.14942217
#> W 0.03485498 0.08886324 -0.1494222  0.40289524
sim_dat1$Sigma_E
#>           X         Y          Z          W
#> X 0.7175039 0.3545701  0.2288111  0.2619020
#> Y 0.3545701 0.7008739  0.1130720  0.2951464
#> Z 0.2288111 0.1130720  0.5107731 -0.1531603
#> W 0.2619020 0.2951464 -0.1531603  0.5971048
sim_dat1$trait_corr
#>           X         Y          Z          W
#> X 1.0000000 0.4959207  0.4467774  0.2967570
#> Y 0.4959207 1.0000000  0.2220017  0.3840097
#> Z 0.4467774 0.2220017  1.0000000 -0.3025824
#> W 0.2967570 0.3840097 -0.3025824  1.0000000

By default, sim_mv assumes that direct environmental components of each trait are independent, meaning that the DAG explains all of the correlation between traits. This is modifiable using the R_E and R_obs arguments, discussed a bit later in this vignette.

Relationships between traits are described by two matrices, direct_trait_effects which is equal to the input G matrix and total_trait_effects which gives the total effect of each trait on each other trait. Here we can notice that the direct effect of Z on W is -0.548 as specified but the total effect includes the effects mediated by X and Y as well, −0.548 + 0.447 ⋅ 0.387 + 0.447 ⋅ 0.5 ⋅ 0.316 = −0.304.

sim_dat1$direct_trait_effects
#>           X   Y Z          W
#> X 0.0000000 0.5 0  0.3872983
#> Y 0.0000000 0.0 0  0.3162278
#> Z 0.4472136 0.0 0 -0.5477226
#> W 0.0000000 0.0 0  0.0000000
sim_dat1$total_trait_effects
#>           X         Y Z          W
#> X 0.0000000 0.5000000 0  0.5454122
#> Y 0.0000000 0.0000000 0  0.3162278
#> Z 0.4472136 0.2236068 0 -0.3038068
#> W 0.0000000 0.0000000 0  0.0000000

We can also use the output to understand which variants have direct effects on each trait and which have indirect (mediated) effects. The direct_SNP_effects_joint object gives the direct effect of each variant on each trait while beta_joint gives the the total effect of each variant. Direct and total marginal effects are stored in direct_SNP_effects_marg and beta_marg. Since we again have no LD, the _marg and _joint matrices are the same.

Direct SNP effects are always independent across traits while total SNP effects are the sum of direct effects and indirect effects mediated by other traits. We will make some plots to see the difference.

First we plot direct SNP effects on Z vs direct SNP effects on W

plot(sim_dat1$direct_SNP_effects_joint[,3], sim_dat1$direct_SNP_effects_joint[,4], 
     xlab = "Direct Z effect", ylab = "Direct W effect")

Most variants have direct effects on at most one of Z or W but a small number affect both because sporadic_pleiotropy = TRUE by default.

Next we plot the total SNP effects on Z vs the total SNP effects on W. Because Z has a causal effect on W, all variants with effects on Z also affect W. The line in the plot has slope equal to the total effect of Z on W. The majority of SNPs that have non-zero effect on Z fall exactly on this line. With sporadic_pleiotropy= FALSE, all of the variants with non-zero effect on Z would fall on this line. The variants on the vertical line at 0 are variants with non-zero direct effect on W but no direct effect on Z.

plot(sim_dat1$beta_joint[,3], sim_dat1$beta_joint[,4], 
     xlab = "Total Z effect", ylab = "Total W effect")
abline(0, sim_dat1$total_trait_effects[3,4], col = "red", lty = 2, lwd = 2)

Specifying Allele Frequencies

Allele frequencies can be specified using the af argument which can accept a scalar, a vector of length J, or a function that takes a single argument and returns a vector of allele frequencies with length determined by the argument. If af is a scalar, the same allele frequency is used for all variants. The function specification is used in the example below. If the af argument is provided, sim_mv will return all results on the per-allele scale and the geno_scale element of the returned object will be equal to allele. The snp_info element of the returned object will also include the allele frequency of each variant.

sim_dat2 <- sim_mv(G = G,
                  N = 50000, 
                  J = 10000, 
                  h2 = c(0.3, 0.3, 0.5, 0.4), 
                  pi = 0.01, 
                  af = function(n){rbeta(n, 1, 5)})
#> SNP effects provided for 10000 SNPs and 4 traits.

sim_dat2$geno_scale
#> [1] "allele"
head(sim_dat2$snp_info)
#>   SNP         AF
#> 1   1 0.05448172
#> 2   2 0.52558863
#> 3   3 0.02522715
#> 4   4 0.29653930
#> 5   5 0.29400173
#> 6   6 0.09964018

Simulating Data with LD

The sim_mv function can be used to generate data with LD by inputting a list of LD matrices and corresponding allele frequency information. The function will work fastest if the LD matrix is broken into smallish independent blocks. The input data format for the LD pattern is a list of either a) matrices, b) sparse matrices (class dsCMatrix) or c) eigen decompositions (class eigen). R_LD is interpreted as providing blocks in a block-diagonal SNP correlation matrix.

Importantly, the supplied LD pattern does not have to be the same size as the number of SNPs we wish to generate (J). It will be repeated or subsetted as necessary to create an LD pattern of the appropriate size.

The package contains a built-in data set containing the LD pattern from Chromosome 19 in HapMap3 broken into 39 blocks. This LD pattern was estimated from the HapMap3 European subset using LDShrink. This data set can also be downloaded here. The LD pattern must be accompanied by a vector of allele frequencies with length equal to the total size of the LD pattern (i.e. the sum of the size of each block in the list).

Let’s look at the built-in LD data

data("ld_mat_list")
data("AF")

length(ld_mat_list)
#> [1] 39

sapply(ld_mat_list, class)
#>  [1] "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix"
#>  [7] "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix"
#> [13] "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix"
#> [19] "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix"
#> [25] "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix"
#> [31] "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix" "dsCMatrix"
#> [37] "dsCMatrix" "dsCMatrix" "dsCMatrix"
# This prints the number of SNPs in each block
sapply(ld_mat_list, nrow)
#>  [1] 140 519 339 435 523 280 675 325 651 548 274 483 442 744 460 177 469 173 358
#> [20] 564 392 737 596 818 307 863 276 435 204 364 480 381 757 844 753 656 483 856
#> [39] 709

sapply(ld_mat_list, nrow) %>% sum()
#> [1] 19490

length(AF)
#> [1] 19490

The LD pattern covers 19,490 SNPs, equal to the length of the AF vector. The built-in LD pattern corresponds to a density of about 1.2 million variants per genome. However, for this example, we will generate data for only 100k variants. This means that causal effects will be denser than they might be in more realistic data with the same number of effect variants.

set.seed(10)
sim_dat1_LD <- sim_mv(G = G,
                      J = 1e5, 
                      N = 50000, 
                      h2 = c(0.3, 0.3, 0.5, 0.4), 
                      pi = 0.01, 
                      R_LD = ld_mat_list, 
                      af = AF)
#> SNP effects provided for 100000 SNPs and 4 traits.

In data with LD, the _joint objects and _marg objects are not identical. For example, we can compare beta_joint and beta_marg for the third trait (Z).

with(sim_dat1_LD, plot(beta_joint[,3], beta_marg[,3]))
abline(0, 1, lty = 2, lwd =2, col = "red")

Variants with non-zero values of beta_joint[,3] have causal effects on Z while those with non-zero values of beta_marg[,3] have non-zero marginal association with Z, meaning that they are in LD with at least one causal variant. In the plot, we see that many variants with no causal effect have non-zero marginal association, which is expected. The causal variants don’t fall exactly on the red line because, multiple causal variants may fall into the same LD block.

LD-Pruning, LD-Proxies, and LD Matrix Extraction

Many post-GWAS applications such as Mendelian randomization and polygenic risk score construction require an LD-pruned set of variants. GWASBrewer contains a few LD-related functions to help with pruning and testing methods that require input LD matrices. Note that all of these methods use the true LD pattern rather than estimated LD.

The sim_ld_prune function will perform LD-clumping on simulated data, prioritizing variants according to a supplied pvalue vector. Although this argument is called pvalue, it can be any numeric vector used to prioritize variants. The pvalue argument can also accept an integer. If pvalue = i, variants will be prioritized according to the p-value for the ith trait in the simulated data. If pvalue is omitted, variants will be prioritized randomly (so a different result will be obtained each re-run unless a seed is set).

To speed up performance, if you only need variants with p-value less than a certain threshold, supply the pvalue_thresh argument. Below we prune based on the p-values for trait Z in two equivalent ways.

pruned_set1 <- sim_ld_prune(dat = sim_dat1_LD, 
                            pvalue = 3, 
                            R_LD = ld_mat_list, 
                            r2_thresh = 0.1,
                            pval_thresh = 1e-6)
#> Prioritizing variants based on p-value for trait 3
length(pruned_set1)
#> [1] 354
pval3 <- with(sim_dat1_LD, 2*pnorm(-abs(beta_hat[,3]/se_beta_hat[,3])))
pruned_set2 <- sim_ld_prune(dat = sim_dat1_LD, 
                            pvalue = pval3, 
                            R_LD = ld_mat_list, 
                            r2_thresh = 0.1,
                            pval_thresh = 1e-6)
all.equal(pruned_set1, pruned_set2)
#> [1] TRUE

sim_ld_prune returns a vector of indices corresponding to an LD-pruned set of variants.

The sim_ld_proxy function will return indices of LD-proxies (variants with LD above a given threshold) with a supplied set of variants. Here we extract proxies for a few arbitrary variants. The return_mat option will cause the function to return the LD matrix for the proxies as well as the indices of proxies

ld_proxies <- sim_ld_proxy(sim_dat1_LD, index = c(100, 400, 600), R_LD = ld_mat_list, r2_thresh = 0.64, return_mat = TRUE)
ld_proxies
#> [[1]]
#> [[1]]$index
#> [1] 100
#> 
#> [[1]]$proxy_index
#> [1] 98 99
#> 
#> [[1]]$Rproxy
#>           100        98        99
#> 100 1.0000000 0.9568668 0.9745924
#> 98  0.9568668 1.0000000 0.9798682
#> 99  0.9745924 0.9798682 1.0000000
#> 
#> 
#> [[2]]
#> [[2]]$index
#> [1] 400
#> 
#> [[2]]$proxy_index
#> [1] 395 396 397 398 399 401 402
#> 
#> [[2]]$Rproxy
#>           400       395       396       397       398       399       401
#> 400 1.0000000 0.9495092 0.9661446 0.9793188 0.9730871 0.9306611 0.9105497
#> 395 0.9495092 1.0000000 0.9812294 0.9675572 0.9742288 0.9302869 0.8876271
#> 396 0.9661446 0.9812294 1.0000000 0.9841205 0.9905250 0.9454487 0.9054550
#> 397 0.9793188 0.9675572 0.9841205 1.0000000 0.9810106 0.9367627 0.9175545
#> 398 0.9730871 0.9742288 0.9905250 0.9810106 1.0000000 0.9522425 0.9119614
#> 399 0.9306611 0.9302869 0.9454487 0.9367627 0.9522425 1.0000000 0.8731959
#> 401 0.9105497 0.8876271 0.9054550 0.9175545 0.9119614 0.8731959 1.0000000
#> 402 0.8080771 0.7821419 0.7934351 0.8097815 0.7991366 0.7643812 0.7522628
#>           402
#> 400 0.8080771
#> 395 0.7821419
#> 396 0.7934351
#> 397 0.8097815
#> 398 0.7991366
#> 399 0.7643812
#> 401 0.7522628
#> 402 1.0000000
#> 
#> 
#> [[3]]
#> [[3]]$index
#> [1] 600
#> 
#> [[3]]$proxy_index
#> [1] 601 606
#> 
#> [[3]]$Rproxy
#>            600        601        606
#> 600  1.0000000 -0.8044757  0.8670863
#> 601 -0.8044757  1.0000000 -0.7193148
#> 606  0.8670863 -0.7193148  1.0000000

Finally, the sim_extract_ld function will extract the LD matrix for a set of variants.

ld_mat1 <- sim_extract_ld(sim_dat1_LD, index = 600:606, R_LD = ld_mat_list)
ld_mat1
#>            600        601        602        603        604        605
#> 600  1.0000000 -0.8044757 -0.6715969 -0.7920287 -0.7323562 -0.4605856
#> 601 -0.8044757  1.0000000  0.8354364  0.9791149  0.9404418  0.6067155
#> 602 -0.6715969  0.8354364  1.0000000  0.8138201  0.7891286  0.6555054
#> 603 -0.7920287  0.9791149  0.8138201  1.0000000  0.9363088  0.5849580
#> 604 -0.7323562  0.9404418  0.7891286  0.9363088  1.0000000  0.5890082
#> 605 -0.4605856  0.6067155  0.6555054  0.5849580  0.5890082  1.0000000
#> 606  0.8670863 -0.7193148 -0.5968298 -0.7200977 -0.7469780 -0.3902207
#>            606
#> 600  0.8670863
#> 601 -0.7193148
#> 602 -0.5968298
#> 603 -0.7200977
#> 604 -0.7469780
#> 605 -0.3902207
#> 606  1.0000000

Specifying Sample Size, Sample Overlap, and Environmental Correlation

If two GWAS are performed on different traits using overlapping samples, the sampling errors of effect estimates will be correlated. If the two GWAS have sample sizes N1 and N2 with Nc overlapping samples, then the correlation of 1j and 2j, z-scores for variant j in study 1 and study 2, is approximately $\frac{N_c}{\sqrt{N_1 N_2}} \rho_{1,2}$ where ρ1, 2 is the observational trait correlation (assuming the studies are conducted in the same super population). Below we describe how to specify the observational correlation and sample overlap.

Specifying Sample Size and Sample Overlap

The sample size argument, N, can be specified as a scalar, a vector, a matrix, or a data frame. Both the scalar and matrix specification indicate no overlapping samples between GWAS. To specify sample overlap, we need to use the matrix or data frame formats. If N is a matrix, it should have dimension M × M with N[i,i] giving the sample size of study i and N[i,j] giving the number of samples that are in both study i and study j. In data frame format, N should have columns named trait_1, … trait_[M] and N. The trait_[x] columns will be interpreted as logicals and the N column should give the number of samples in each combination of studies. For example, the following specifications for two traits are equivalent.

N <- matrix(c(60000, 30000, 30000, 60000), nrow = 2, ncol = 2)
N
#>       [,1]  [,2]
#> [1,] 60000 30000
#> [2,] 30000 60000

Ndf <- data.frame(trait_1 = c(1, 1, 0), 
                  trait_2 = c(0, 1, 1), 
                  N = rep(30000, 3))
Ndf
#>   trait_1 trait_2     N
#> 1       1       0 30000
#> 2       1       1 30000
#> 3       0       1 30000

When there are more than two traits, the data frame format contains more information than the matrix format. This format is required by the resample_inddata function (covered in a different vignette). For sim_mv, either format is sufficient.

Using Sample Size 0 to Omit Traits

In some circumstances, we may want to generate true effects for a trait but not generate summary statistics. In this case, we can use a sample size of 0. Setting N = 0 will mean that beta_hat, s_estimate, se_beta_hat will be NA for all traits. Alternatively, if N is a vector with some 0 elements, only those traits will be missing. Finally, if N is a matrix, we can set the row and column corresponding to the omitted trait to zero. For example, in the specification below, we omit summary statistics for Z, the third trait. We also specify some overlapping samples in studies for the other three traits.

N <- matrix(c(50000, 10000, 0, 0, 
              10000, 40000, 0, 10000, 
              0, 0, 0, 0, 
              0, 10000, 0, 20000), nrow = 4)
N
#>       [,1]  [,2] [,3]  [,4]
#> [1,] 50000 10000    0     0
#> [2,] 10000 40000    0 10000
#> [3,]     0     0    0     0
#> [4,]     0 10000    0 20000
sim_dat2 <- sim_mv(G = G,
                  N = N,
                  J = 100000,
                  h2 = c(0.3, 0.3, 0.5, 0.4), 
                  pi = 0.01,
                  est_s = TRUE)
#> SNP effects provided for 100000 SNPs and 4 traits.
head(sim_dat2$beta_hat)
#>               [,1]         [,2] [,3]          [,4]
#> [1,]  1.258828e-03 -0.002383374   NA  1.074629e-02
#> [2,] -8.580051e-03 -0.015482893   NA -7.281355e-03
#> [3,] -2.515356e-03 -0.001070368   NA -2.189328e-02
#> [4,]  5.422213e-05 -0.003164621   NA  8.103383e-03
#> [5,] -8.150978e-03 -0.005807492   NA  5.922625e-05
#> [6,] -5.861632e-03 -0.003012872   NA  3.302201e-03
head(sim_dat2$se_beta_hat)
#>             [,1]  [,2] [,3]        [,4]
#> [1,] 0.004472136 0.005   NA 0.007071068
#> [2,] 0.004472136 0.005   NA 0.007071068
#> [3,] 0.004472136 0.005   NA 0.007071068
#> [4,] 0.004472136 0.005   NA 0.007071068
#> [5,] 0.004472136 0.005   NA 0.007071068
#> [6,] 0.004472136 0.005   NA 0.007071068
head(sim_dat2$s_estimate)
#>             [,1]        [,2] [,3]        [,4]
#> [1,] 0.004453566 0.004976698   NA 0.007050114
#> [2,] 0.004471810 0.005008104   NA 0.007081267
#> [3,] 0.004471716 0.004976063   NA 0.007033821
#> [4,] 0.004459041 0.004982154   NA 0.007068942
#> [5,] 0.004480086 0.004961544   NA 0.007006829
#> [6,] 0.004472674 0.004987358   NA 0.007045390

Understanding Genetic and Environmental Covariance

In the model used by GWASBrewer, each trait has a direct genetic component, a direct environmental component, and components from effects of other traits in the DAG. For example, in the four trait DAG we have been working with, the underlying model really looks like this:

In this picture, Gx, Gy, Gz, and Gw are direct genetic components of each trait and Ex, Ey, Ez, and Ew are direct environmental components. We always assume that the direct genetic components are independent of each other and of the environmental components. By default, we also assume that the environmental components are independent of each other (so all blue circles in the picture above are mutually independent). This means that, by default, the observational trait correlation is explained completely by the specified DAG. In this case, the default trait correlation is

sim_dat1$trait_corr
#>           X         Y          Z          W
#> X 1.0000000 0.4959207  0.4467774  0.2967570
#> Y 0.4959207 1.0000000  0.2220017  0.3840097
#> Z 0.4467774 0.2220017  1.0000000 -0.3025824
#> W 0.2967570 0.3840097 -0.3025824  1.0000000

The simulation data object contains four matrices that describe trait covarince, Sigma_E, the total environmental trait covariance, Sigma_G the total genetic trait covariance, trait_corr, the trait correlation equal to Sigma_G + Sigma_E, and R, the row correlation of beta_hat. R is equal to trait_corr scaled by the a matrix of sample overlap proportions.

Sigma_G is always determined by the DAG and the heritabilities. Currently, there are two ways to modify Sigma_E. The first is to specify R_obs which directly specifies the observational trait correlation (trait_corr). In some cases, it is possible to request an observational correlation matrix that is impossible. For example, in our example, Z has a strong negative effect on W so it is not possible that all four traits are mutually strongly positively correlated. We can get an error using

R_obs <- matrix(0.8, nrow = 4, ncol = 4)
diag(R_obs) <- 1

wont_run <- sim_mv(G = G,
                   J = 50000,
                  N = 60000,  
                  h2 = c(0.3, 0.3, 0.5, 0.4), 
                  pi = 1000/50000, 
                  R_obs = R_obs )
#> Error in value[[3L]](cond): R_obs is incompatible with trait relationships and heritability.

A quick way to find out if your desired observational correlation is feasible is to compute R_obs - Sigma_G and check that this matrix is positive definite (i.e. check that it has all positive eigenvalues).

An alternative way to specify the environmental correlation is to specify R_E which gives the total correlation of environmental components, i.e. cov2cor(Sigma_E). Importantly, R_E is the correlation of the total environmental components, not the direct environmental components shown in the graph above. Any positive definite correlation matrix is a valid input for R_E, so we could use

R_E <- matrix(0.8, nrow = 4, ncol = 4)
diag(R_E) <- 1

sim_dat3 <- sim_mv(G = G,
                   J = 50000,
                  N = 60000,  
                  h2 = c(0.3, 0.3, 0.5, 0.4), 
                  pi = 1000/50000, 
                  R_E = R_E)
#> SNP effects provided for 50000 SNPs and 4 traits.

which results in a total trait correlation of

sim_dat3$trait_corr
#>           X         Y         Z         W
#> X 1.0000000 0.7109713 0.7020261 0.5677883
#> Y 0.7109713 1.0000000 0.5966303 0.6110510
#> Z 0.7020261 0.5966303 1.0000000 0.3158510
#> W 0.5677883 0.6110510 0.3158510 1.0000000

It is not currently possible to specify the correlation of the direct environmental components.

Note that the environmental covariance and observational trait correlation only influence the distribution of summary statistics if there is overlap between GWAS samples. This means that sim_dat3 in the previous code block is actually a sample from exactly the same distribution as sim_dat1, because our specification has no sample overlap. We can tell that this is the case because sim_dat4$R is the identity, indicating that all summary statistics are independent across traits.

sim_dat3$R
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    0    0    0
#> [2,]    0    1    0    0
#> [3,]    0    0    1    0
#> [4,]    0    0    0    1

By contrast, in sim_dat2, we did have sample overlap, so there is non-zero correlation between the summary statistics.

sim_dat2$R
#>           [,1]      [,2] [,3]      [,4]
#> [1,] 1.0000000 0.1120147    0 0.0000000
#> [2,] 0.1120147 1.0000000    0 0.1355998
#> [3,] 0.0000000 0.0000000    0 0.0000000
#> [4,] 0.0000000 0.1355998    0 1.0000000