Resampling and Re-Scaling Summary and Individual Level Data

library(GWASBrewer)
library(dplyr)
library(ggplot2)
set.seed(1)

Introduction

In some applications, we may need to generate multiple data sets for the same set of effect sizes. This would mimic performing multiple GWAS on the same trait. There are two functions, resample_sumstats and resample_inddata that facilitate this. This vignette will demonstrate these functions as well as discuss important considerations when the simulation involves sampling GWAS for multiple ancestries for the same trait.

To get started, we will generate a small sim_mv object. For a review of that function, see the “Simulating Data” vignette. For this vignette, we will keep things very simple by having only 12 variants with very large effect sizes. Of course, this is not realistic, but it will let us see exactly what is going on. In our example, we will have two traits. The first trait has an effect of 0.2 on the second trait. Recalling that sim_mv always sets the variance of each trait to 1, this means, that trait 1 explains 4% (0.2² = 0.04) of the variance of trait 2. We will also use an LD pattern. Here we just specify the LD pattern for five variants. This will be repeated 2.2 times to cover 12 variants. We will specify N so that there is no sample overlap.

set.seed(1000)
my_ld1 <- matrix(c( 1.00,  0.60,  0.40, -0.1, -0.07,
                 0.60,  1.00,  0.60, -0.1, -0.07,
                 0.40,  0.60,  1.00, -0.1, -0.07,
                 -0.10, -0.10, -0.10,  1.0,  0.90,
                 -0.07, -0.07, -0.07,  0.9,  1.00), nrow = 5, byrow = T)
af1 <- c(0.35, 0.3, 0.4, 0.72, 0.75)

G <- matrix(c(0, 0.2, 0, 0), nrow = 2, byrow =TRUE) # matrix of causal effects
orig_dat <- sim_mv(N = c(10000, 20000),
                 J = 12,
                 h2 = c(0.2, 0.3),
                 pi = 0.4,
                 G = G,
                 R_LD = list(my_ld1),
                 af = af1, 
                 est_s = TRUE)
#> SNP effects provided for 12 SNPs and 2 traits.

Resampling Summary Statistics or Individual Level Data from the Same Population

In the simplest scenario, we want to simulate performing a new GWAS or collecting new individual level data for the original set of traits in the original population, meaning that the LD pattern and allele frequencies are the same.

Resampling Summary Statistics from the Same Population

To resample summary statistics, we can use resample_sumstats. Here, we will regenerate data assuming that samples for the new GWAS are totally overlapping.

N1 <- matrix(50000, nrow = 2, ncol = 2)
new_dat1 <- resample_sumstats(dat = orig_dat,
                              N = N1, 
                              R_LD = list(my_ld1), 
                              af = af1, 
                              est_s = TRUE)
#> I will assume that the environmental variance is the same in the old and new population.
#> I will assume that environmental correlation is the same in the old and new population. Note that this could result in different overall trait correlations.
#> SNP effects provided for 12 SNPs and 2 traits.

The new_dat1 object is another object of class sim_mv. The new and old data have the same effect sizes so beta_joint, beta_marg, direct_SNP_effects_joint, and direct_SNP_effects_marg are all the same between the two objects. They also have the same causal effect structure so the trait effect matrices direct_trait_effects and total_trait_effects are the same. However, the summary statistic matrices beta_hat and s_estimate are different due to random sampling variation. The true standard errors, se_beta_hat are also different because the sample size is different. Finally, since there is sample overlap in the samples for new_dat1 but not in orig_dat, the R matrices which give the correlation of effect sizes across traits are different.

all.equal(orig_dat$beta_joint, new_dat1$beta_joint)
#> [1] TRUE
all.equal(orig_dat$beta_marg, new_dat1$beta_marg)
#> [1] TRUE

head(orig_dat$beta_hat)
#>               [,1]         [,2]
#> [1,] -0.1080910136 -0.009750299
#> [2,] -0.1571366506  0.013286133
#> [3,] -0.1125931870  0.004225466
#> [4,]  0.2322018052  0.020554680
#> [5,]  0.2105222770  0.019324228
#> [6,]  0.0002105625 -0.299937884
head(new_dat1$beta_hat)
#>              [,1]         [,2]
#> [1,] -0.106418649  0.019449404
#> [2,] -0.169053933  0.029430270
#> [3,] -0.101005542  0.009112222
#> [4,]  0.249366373  0.050046188
#> [5,]  0.233752764  0.051418782
#> [6,] -0.002706624 -0.291194649

head(orig_dat$se_beta_hat)
#>            [,1]       [,2]
#> [1,] 0.01482499 0.01048285
#> [2,] 0.01543033 0.01091089
#> [3,] 0.01443376 0.01020621
#> [4,] 0.01574852 0.01113589
#> [5,] 0.01632993 0.01154701
#> [6,] 0.01482499 0.01048285
head(new_dat1$se_beta_hat)
#>             [,1]        [,2]
#> [1,] 0.006629935 0.006629935
#> [2,] 0.006900656 0.006900656
#> [3,] 0.006454972 0.006454972
#> [4,] 0.007042952 0.007042952
#> [5,] 0.007302967 0.007302967
#> [6,] 0.006629935 0.006629935

head(orig_dat$s_estimate)
#>            [,1]       [,2]
#> [1,] 0.01459221 0.01049184
#> [2,] 0.01519329 0.01098407
#> [3,] 0.01429748 0.01022185
#> [4,] 0.01563252 0.01119631
#> [5,] 0.01623488 0.01164188
#> [6,] 0.01472800 0.01022575
head(new_dat1$s_estimate)
#>             [,1]        [,2]
#> [1,] 0.006608105 0.006621619
#> [2,] 0.006870459 0.006907517
#> [3,] 0.006450718 0.006463427
#> [4,] 0.006970491 0.007052360
#> [5,] 0.007221276 0.007289588
#> [6,] 0.006631894 0.006499691

Resampling Individual Level Data from the Same Population

We can also generate individual level data. For this we use the resample_inddata function. This function can be used to produce three types of output:

Genotypes only
Genotypes and phenotypes
Phenotypes only

resample_inddata uses the hapsim package to generate individual level data. The sample size argument, N, in resample_inddata can accept three types of input: scalar, vector, or data frame. These formats are the same as those used by sim_mv, so if N is a scalar or vector, resample_inddata will assume there is no sample overlap. This means that if we use N = 10 with two traits, we will get out genotype information for 20 individuals, of which half have phenotype 1 information and half have phenotype 2 information. To specify sample overlap, we can use the data frame format. As a reminder, a sample size data frame should have columns named trait_1, … trait_[M] and N. The trait_[x] columns will be interpreted as logicals and the N column should give the number of samples in each combination of studies.

Generating genotypes only

If resample_inddata is used to generate genotypes only, we are not really “resampling” anything because there is no need to input trait data. In this case, we only need to supply the total number of individuals, the number of variants, and the LD pattern if desired. Here we will generate data for 15 individuals with an LD pattern matching the one used in orig_dat. The returned data includes the genotype matrix and a vector of population allele frequencies.

genos_only <- resample_inddata(N = 15, 
                               J = 12,
                               R_LD = list(my_ld1), 
                               af = af1)
#> Generating genotype matrix only.
names(genos_only)
#> [1] "X"  "af"
dim(genos_only$X)
#> [1] 15 12
length(genos_only$af)
#> [1] 12
genos_only$X
#>     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#> S1     0    0    0    2    2    1    1    1    2     2     1     0
#> S2     1    1    0    2    2    0    0    0    2     2     1     1
#> S3     0    0    0    2    2    1    0    2    2     2     1     1
#> S4     0    0    1    0    0    0    0    1    0     0     2     2
#> S5     2    1    1    2    2    1    0    0    0     0     0     0
#> S6     1    0    2    1    1    0    0    0    2     2     1     0
#> S7     0    0    2    2    2    1    0    0    1     1     2     2
#> S8     0    0    0    2    2    0    1    0    2     2     2     2
#> S9     1    0    0    2    2    0    0    0    2     2     2     2
#> S10    2    2    2    1    1    1    1    1    1     1     2     2
#> S11    1    0    1    2    2    2    2    2    1     1     1     1
#> S12    0    1    1    2    2    0    2    2    2     2     1     1
#> S13    0    0    1    1    1    2    2    2    1     1     1     1
#> S14    0    0    0    1    1    1    1    0    2     2     1     1
#> S15    0    1    0    2    2    1    1    2    2     2     2     2

Generating Genotypes and Phenotypes

To generate both genotypes and phenotypes, we need to include a sim_mv object that contains effect sizes. For our example, we will use the data frame sample size format to indicate that three individuals have both phenotypes measured and seven have only one or the other measured. Note that we don’t need to include the J argument because the sim_mv object contains all the necessary information. The vector of M phenotypes for each individual is simulated as Y_i = Y_G, i + Y_E, i where Y_G, i is the vector of genetic components computed from the genotypes and effect sizes stored in the original sim_mv object and Y_E, i is a vector of environmental components sampled from a normal distribution with variance given by the Sigma_E matrix in the original sim_mv object.

N <- data.frame("trait_1" = c(1, 1, 0), "trait_2" = c(0, 1, 1), "N" = c(3, 3, 4))
new_ind_dat1 <- resample_inddata(N = N, 
                                 dat = orig_dat, 
                                 R_LD = list(my_ld1), 
                                 af = af1, 
                                 calc_sumstats = FALSE)
#> Generating both genotypes and phenotypes.
#> SNP effects provided for 12 SNPs and 2 traits.
#> I will assume that the environmental variance is the same in the old and new population.
#> I will assume that environmental correlation is the same in the old and new population. Note that this could result in different overall trait correlations.

Genotype data are stored in new_ind_dat1$X and phenotype data are in new_ind_dat1$Y.

names(new_ind_dat1)
#> [1] "X"          "Y"          "af"         "Sigma_G"    "Sigma_E"   
#> [6] "pheno_sd"   "h2"         "trait_corr" "beta_joint"
new_ind_dat1$X
#>     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#> S1     0    0    1    1    1    0    0    0    1     1     1     2
#> S2     2    2    2    1    1    0    0    0    2     2     2     1
#> S3     0    0    1    1    1    1    1    1    0     0     0     0
#> S4     0    0    1    2    2    0    0    0    1     1     0     1
#> S5     0    0    0    2    2    1    0    0    2     2     1     1
#> S6     0    0    0    2    2    1    1    0    1     2     1     1
#> S7     1    1    1    1    1    2    1    1    2     2     2     1
#> S8     1    0    1    2    2    0    1    1    2     2     1     1
#> S9     1    1    2    1    1    2    2    2    0     0     2     2
#> S10    0    0    0    1    1    0    0    0    2     2     1     1
new_ind_dat1$Y
#>           y_1         y_2
#> 1  -0.9660107          NA
#> 2  -1.6707661          NA
#> 3  -0.4127266          NA
#> 4   0.3558235  1.57133493
#> 5   0.5363123  0.05390283
#> 6  -1.2896112  0.12940470
#> 7          NA  0.48527806
#> 8          NA -2.00566078
#> 9          NA -2.01050960
#> 10         NA  0.97844193

The returned object also includes some additional information including Sigma_G, Sigma_E, and beta_joint which have the same meaning as the objects with the same name in a sim_mv object. If we had set calc_sumstats to TRUE, the object would also include effect size and standard error estimates for the association between each variant and each trait.

Generating Phenotypes Only

Finally, we can use resample_inddata to calculate phenotypes for a previously generated set of genotypes from a sim_mv object. To do this, we supply the genotypes matrix to the genos argument. If this is supplied, no genotypes matrix will be returned. Even though the genotype matrix is supplied, we also need to supply the R_LD and af arguments for the population LD and allele frequencies. These are used to correctly calculate the genetic variance-covariance matrix. Note that it is important that the total number of individuals implied by the sample size argument match the number of rows in the genotypes matrix. The following call will generate an error because we supplied 15 genotypes but the sample size data frame only includes 10 individuals.

phenos_only <- resample_inddata(N = N, 
                                dat = orig_dat, 
                                genos = genos_only$X,
                                R_LD = list(my_ld1),
                                af = af1,
                                calc_sumstats = FALSE)
#> Generating phenotypes only.
#> Error in check_matrix(genos, "genos", ntotal, J): Expected genos to have 10 rows, found 15

We can fix the poblem by only including the first 10 rowsof the genotype matrix.

phenos_only <- resample_inddata(N = N, 
                                dat = orig_dat, 
                                genos = genos_only$X[1:10,],
                                R_LD = list(my_ld1),
                                af = af1,
                                calc_sumstats = FALSE)
#> Generating phenotypes only.
#> SNP effects provided for 12 SNPs and 2 traits.
#> I will assume that the environmental variance is the same in the old and new population.
#> I will assume that environmental correlation is the same in the old and new population. Note that this could result in different overall trait correlations.
names(phenos_only)
#> [1] "Y"          "af"         "Sigma_G"    "Sigma_E"    "pheno_sd"  
#> [6] "h2"         "trait_corr" "beta_joint"
phenos_only$Y
#>           y_1        y_2
#> 1  -0.6443696         NA
#> 2   0.7978988         NA
#> 3  -0.2142834         NA
#> 4  -0.2455013 -0.3309879
#> 5  -0.4448521 -0.9023457
#> 6  -0.1022008 -0.6312573
#> 7          NA -0.1918427
#> 8          NA  1.0820581
#> 9          NA -0.4664964
#> 10         NA -0.7820182

Resampling Data from a Different Population

Both resample_sumstats and resample_inddata can be used as above with different LD and allele frequencies than used to create the original object. However, there are some special considerations about trait variance and scaling that we will discuss in this section.

Understanding Effect Size Units

To understand resampled data that use different LD or allele frequencies than the original data, it is important to understand the units of effect sizes produced by sim_mv. sim_mv always assumes that phenotype variance is equal to 1, so the interpretation of the effects in beta_joint is the change in phenotype in units of phenotype SD per change in genotype in units of either allele or genotype SD. To see the genotype unit, we can check the geno_scale object. For example,

orig_dat$geno_scale
#> [1] "allele"

By default, sim_mv will use the per-allele scale if allele frequencies are available and otherwise use the per-SD scale. We can also check the the phenotypes are scaled to unit variance in pheno_sd which will always be a vector of 1’s if the object was produced by sim_mv.

orig_dat$pheno_sd
#> [1] 1 1

Changing LD and Allele Frequencies

The resampling functions in GWASBrewer assume that effect sizes in the new population are the same as in the old population on the genotype and phenotype scale given. If you are changing populations, it is much more sensible to assume that effect sizes are constant on the per-allele scale than on the per-genotype SD scale. This means that if you want to regenerate data from a different population, it is a good idea to start with input data that have geno_scale equal to “allele”.

One consequence of differing allele frequencies and LD structure is that the total genetic variance will be different in the new population than in the old. By default, the resampling functions will assume that the environmental variance is the same in the two populations. This means that the overall variance of the phenotype in the new population will probably not be 1. The resampling functions will not rescale the phenotype in the new population because this would mean that the phenotypes in the two populations had different units and were not comparable. However, you can rescale the output after the fact using if you would like the phenotypes to have unit variance.

We can see this in action by resampling summary statistics using different allele frequencies and LD pattern than used originally.

af2 <- rep(0.1, 5)
my_ld2 <- matrix(0.3, nrow = 5, ncol = 5)
diag(my_ld2) <- 1
new_dat2 <- resample_sumstats(dat = orig_dat, 
                              N = N, 
                              R_LD = list(my_ld2), 
                              af = af2)
#> Genetic variance in the new population differs from the genetic variance in the old population.
#> I will assume that the environmental variance is the same in the old and new population.
#> I will assume that environmental correlation is the same in the old and new population. Note that this could result in different overall trait correlations.
#> Note that the phenotype in the new population has a different variance from the phenotype in the old population.
#> I will keep the phenotype on the same scale as the original data, so effect sizes in the old and new object are comparable. If you would like to rescale the phenotype to have variance 1, use rescale_sumstats.
#> SNP effects provided for 12 SNPs and 2 traits.

Notice that the function produces a message to let us know that the phenotype has a different variance in the new population. We can check by looking at pheno_sd. The genetic covariance matrix, heritability, and overall trait correlation will also be different. However, the environmental covariance will be the same.

new_dat2$pheno_sd
#> [1] 0.9649663 0.9164662
orig_dat$pheno_sd
#> [1] 1 1
new_dat2$h2
#> [1] 0.04683348 0.11545049
orig_dat$h2
#> [1] 0.1124495 0.2570578
new_dat2$Sigma_G
#>             [,1]        [,2]
#> [1,] 0.043609461 0.008093447
#> [2,] 0.008093447 0.096968051
orig_dat$Sigma_G
#>            [,1]       [,2]
#> [1,] 0.11244947 0.01934268
#> [2,] 0.01934268 0.25705783
new_dat2$Sigma_E
#>           [,1]      [,2]
#> [1,] 0.8875505 0.1736201
#> [2,] 0.1736201 0.7429422
orig_dat$Sigma_E
#>           [,1]      [,2]
#> [1,] 0.8875505 0.1736201
#> [2,] 0.1736201 0.7429422

Note that the features that are kept constant are the environmental variance Sigma_E and the per-allele joint effects, beta_joint.

Changing Environmental Variance or Covariance

Although the default behavior is to keep Sigma_E constant, we can change this using the arguments new_env_var, new_h2, new_R_E and new_R_obs. These are options for both resample_sumstats and resample_inddata. The new_env_var and new_h2 arguments provide different ways of specifying the environmental variance so only one of these can be used at a time. new_h2 gives the heritability in the new population while new_env_var directly gives the environmental variance. The parameters new_R_E and new_R_obs are alternate ways of specifying environmental correlation. new_R_E directly specifies the enviromental correlation while new_R_obs gives the total trait correlation. Only one of these two parameters can be supplied.

new_R_E <- diag(2)
new_R_E[1,2] <- new_R_E[2,1] <- 0.4
new_dat3 <- resample_sumstats(dat = orig_dat, 
                              N = N, 
                              R_LD = list(my_ld2), 
                              af = af2,
                              new_env_var = c(0.9, 1.3), 
                              new_R_E = new_R_E)
#> Genetic variance in the new population differs from the genetic variance in the old population.
#> Note that the phenotype in the new population has a different variance from the phenotype in the old population.
#> I will keep the phenotype on the same scale as the original data, so effect sizes in the old and new object are comparable. If you would like to rescale the phenotype to have variance 1, use rescale_sumstats.
#> SNP effects provided for 12 SNPs and 2 traits.
new_dat3$Sigma_E
#>           [,1]      [,2]
#> [1,] 0.9000000 0.4326662
#> [2,] 0.4326662 1.3000000
cov2cor(new_dat3$Sigma_E)
#>      [,1] [,2]
#> [1,]  1.0  0.4
#> [2,]  0.4  1.0
new_dat3$h2
#> [1] 0.04621558 0.06941322

Note that specifying new_h2 will always result in exactly the heritability specified. This is in contrast with the h2 argument of sim_mv which gives the expected heritability.

new_dat4 <- resample_sumstats(dat = orig_dat, 
                              N = N, 
                              R_LD = list(my_ld2), 
                              af = af2,
                              new_h2 = c(0.15, 0.25), 
                              new_R_E = new_R_E)
#> Genetic variance in the new population differs from the genetic variance in the old population.
#> Note that the phenotype in the new population has a different variance from the phenotype in the old population.
#> I will keep the phenotype on the same scale as the original data, so effect sizes in the old and new object are comparable. If you would like to rescale the phenotype to have variance 1, use rescale_sumstats.
#> SNP effects provided for 12 SNPs and 2 traits.
new_dat4$h2
#> [1] 0.15 0.25
new_dat4$Sigma_E
#>           [,1]      [,2]
#> [1,] 0.2471203 0.1072480
#> [2,] 0.1072480 0.2909042

Rescaling Effect Size Units

If for any reason we want to change the units of effects in an object produced by sim_mv or resmaple_sumstats, we can use rescale_sumstats. For example, in the code below, we rescale the effects in orig_dat to be on the per-SD scale. Note that by doing this, we will delete the allele frequency information in the snp_info table. This is because allele frequency is irrelevant for per-SD scaled effects. To convert back to per-allele scale, we would need to supply the allele frequency.


orig_rescale1 <- rescale_sumstats(dat = orig_dat, 
                                  output_geno_scale = "sd")
orig_rescale1$beta_joint
#>               [,1]          [,2]
#>  [1,]  0.000000000  0.0000000000
#>  [2,] -0.097136125  0.0188220320
#>  [3,]  0.000000000  0.0000000000
#>  [4,]  0.146918495  0.0293836989
#>  [5,]  0.000000000  0.0000000000
#>  [6,]  0.000000000  0.0061492991
#>  [7,] -0.003777444 -0.0007554889
#>  [8,]  0.000000000 -0.5047783058
#>  [9,]  0.000000000  0.0000000000
#> [10,]  0.000000000  0.0000000000
#> [11,]  0.000000000  0.0000000000
#> [12,] -0.280286489 -0.0560572978
orig_rescale1$geno_scale
#> [1] "sd"
orig_dat$beta_joint
#>               [,1]         [,2]
#>  [1,]  0.000000000  0.000000000
#>  [2,] -0.149884295  0.029043026
#>  [3,]  0.000000000  0.000000000
#>  [4,]  0.231374881  0.046274976
#>  [5,]  0.000000000  0.000000000
#>  [6,]  0.000000000  0.009116327
#>  [7,] -0.005828723 -0.001165745
#>  [8,]  0.000000000 -0.728584727
#>  [9,]  0.000000000  0.000000000
#> [10,]  0.000000000  0.000000000
#> [11,]  0.000000000  0.000000000
#> [12,] -0.432491442 -0.086498288

## back to per-allele scale
orig_rescale2 <- rescale_sumstats(dat = orig_rescale1, 
                                  output_geno_scale = "allele", 
                                  af = orig_dat$snp_info$AF)
orig_rescale2$beta_joint
#>               [,1]         [,2]
#>  [1,]  0.000000000  0.000000000
#>  [2,] -0.149884295  0.029043026
#>  [3,]  0.000000000  0.000000000
#>  [4,]  0.231374881  0.046274976
#>  [5,]  0.000000000  0.000000000
#>  [6,]  0.000000000  0.009116327
#>  [7,] -0.005828723 -0.001165745
#>  [8,]  0.000000000 -0.728584727
#>  [9,]  0.000000000  0.000000000
#> [10,]  0.000000000  0.000000000
#> [11,]  0.000000000  0.000000000
#> [12,] -0.432491442 -0.086498288

We can also change the scale of the outcomes.

orig_rescale3 <- rescale_sumstats(dat = orig_dat, 
                                  output_geno_scale = "allele", 
                                  output_pheno_sd = c(1.4, 0.8))

There are a few effects of changing the phenotype scale. First the effect sizes are different. In this case the effect sizes for trait 1 have been multiplied by 1.4 and the effect sizes for trait 2 have been multiplied by 0.8. Second, the genetic and environmental covariance matrices have been scaled appropriately.

orig_rescale3$Sigma_G
#>            [,1]       [,2]
#> [1,] 0.22040097 0.02166381
#> [2,] 0.02166381 0.16451701
orig_rescale3$Sigma_E
#>           [,1]      [,2]
#> [1,] 1.7395990 0.1944545
#> [2,] 0.1944545 0.4754830
orig_rescale3$Sigma_G + orig_rescale3$Sigma_E
#>           [,1]      [,2]
#> [1,] 1.9600000 0.2161183
#> [2,] 0.2161183 0.6400000

Note that the heritability, orig_rescale3$h2 has not changed but is no longer equal to the diagonal of Sigma_G. The trait correlation in trait_corr is also the same. The final difference is in the total and direct trait effects matrix. In our case, trait 1 has been multiplied by 1.4 and trait 2 was multiplied by 0.8. On the original scale, a one unit increase in trait 1 caused a 0.2 unit increase in trait 2. So on the new scale, a 1 unit increase in trait 1 causes a 0.2 ⋅ (0.8/1.4) = 0.114 unit increase in trait 2.

Note that the causal relationship between the traits implies the following relationship between effect sizes:

with(orig_dat, all.equal(direct_SNP_effects_joint[,2] + total_trait_effects[1,2]*direct_SNP_effects_joint[,1], beta_joint[,2]))
#> [1] TRUE

This is the basis of methods like Mendelian randomization. We can see that after scaling, this relationship is still true.

with(orig_rescale3, all.equal(direct_SNP_effects_joint[,2] + total_trait_effects[1,2]*direct_SNP_effects_joint[,1], beta_joint[,2]))
#> [1] TRUE