In some applications, we may need to generate multiple data sets for
the same set of effect sizes. This would mimic performing multiple GWAS
on the same trait. There are two functions,
resample_sumstats
and resample_inddata
that
facilitate this. This vignette will demonstrate these functions as well
as discuss important considerations when the simulation involves
sampling GWAS for multiple ancestries for the same trait.
To get started, we will generate a small sim_mv
object.
For a review of that function, see the “Simulating Data” vignette. For
this vignette, we will keep things very simple by having only 12
variants with very large effect sizes. Of course, this is not realistic,
but it will let us see exactly what is going on. In our example, we will
have two traits. The first trait has an effect of 0.2 on the second
trait. Recalling that sim_mv
always sets the variance of
each trait to 1, this means, that trait 1 explains 4% (0.22 = 0.04) of the variance of
trait 2. We will also use an LD pattern. Here we just specify the LD
pattern for five variants. This will be repeated 2.2 times to cover 12
variants. We will specify N
so that there is no sample
overlap.
set.seed(1000)
my_ld1 <- matrix(c( 1.00, 0.60, 0.40, -0.1, -0.07,
0.60, 1.00, 0.60, -0.1, -0.07,
0.40, 0.60, 1.00, -0.1, -0.07,
-0.10, -0.10, -0.10, 1.0, 0.90,
-0.07, -0.07, -0.07, 0.9, 1.00), nrow = 5, byrow = T)
af1 <- c(0.35, 0.3, 0.4, 0.72, 0.75)
G <- matrix(c(0, 0.2, 0, 0), nrow = 2, byrow =TRUE) # matrix of causal effects
orig_dat <- sim_mv(N = c(10000, 20000),
J = 12,
h2 = c(0.2, 0.3),
pi = 0.4,
G = G,
R_LD = list(my_ld1),
af = af1,
est_s = TRUE)
#> SNP effects provided for 12 SNPs and 2 traits.
In the simplest scenario, we want to simulate performing a new GWAS or collecting new individual level data for the original set of traits in the original population, meaning that the LD pattern and allele frequencies are the same.
To resample summary statistics, we can use
resample_sumstats
. Here, we will regenerate data assuming
that samples for the new GWAS are totally overlapping.
N1 <- matrix(50000, nrow = 2, ncol = 2)
new_dat1 <- resample_sumstats(dat = orig_dat,
N = N1,
R_LD = list(my_ld1),
af = af1,
est_s = TRUE)
#> I will assume that the environmental variance is the same in the old and new population.
#> I will assume that environmental correlation is the same in the old and new population. Note that this could result in different overall trait correlations.
#> SNP effects provided for 12 SNPs and 2 traits.
The new_dat1
object is another object of class
sim_mv
. The new and old data have the same effect sizes so
beta_joint
, beta_marg
,
direct_SNP_effects_joint
, and
direct_SNP_effects_marg
are all the same between the two
objects. They also have the same causal effect structure so the trait
effect matrices direct_trait_effects
and
total_trait_effects
are the same. However, the summary
statistic matrices beta_hat
and s_estimate
are
different due to random sampling variation. The true standard errors,
se_beta_hat
are also different because the sample size is
different. Finally, since there is sample overlap in the samples for
new_dat1
but not in orig_dat
, the
R
matrices which give the correlation of effect sizes
across traits are different.
all.equal(orig_dat$beta_joint, new_dat1$beta_joint)
#> [1] TRUE
all.equal(orig_dat$beta_marg, new_dat1$beta_marg)
#> [1] TRUE
head(orig_dat$beta_hat)
#> [,1] [,2]
#> [1,] -0.1080910136 -0.009750299
#> [2,] -0.1571366506 0.013286133
#> [3,] -0.1125931870 0.004225466
#> [4,] 0.2322018052 0.020554680
#> [5,] 0.2105222770 0.019324228
#> [6,] 0.0002105625 -0.299937884
head(new_dat1$beta_hat)
#> [,1] [,2]
#> [1,] -0.106418649 0.019449404
#> [2,] -0.169053933 0.029430270
#> [3,] -0.101005542 0.009112222
#> [4,] 0.249366373 0.050046188
#> [5,] 0.233752764 0.051418782
#> [6,] -0.002706624 -0.291194649
head(orig_dat$se_beta_hat)
#> [,1] [,2]
#> [1,] 0.01482499 0.01048285
#> [2,] 0.01543033 0.01091089
#> [3,] 0.01443376 0.01020621
#> [4,] 0.01574852 0.01113589
#> [5,] 0.01632993 0.01154701
#> [6,] 0.01482499 0.01048285
head(new_dat1$se_beta_hat)
#> [,1] [,2]
#> [1,] 0.006629935 0.006629935
#> [2,] 0.006900656 0.006900656
#> [3,] 0.006454972 0.006454972
#> [4,] 0.007042952 0.007042952
#> [5,] 0.007302967 0.007302967
#> [6,] 0.006629935 0.006629935
head(orig_dat$s_estimate)
#> [,1] [,2]
#> [1,] 0.01459221 0.01049184
#> [2,] 0.01519329 0.01098407
#> [3,] 0.01429748 0.01022185
#> [4,] 0.01563252 0.01119631
#> [5,] 0.01623488 0.01164188
#> [6,] 0.01472800 0.01022575
head(new_dat1$s_estimate)
#> [,1] [,2]
#> [1,] 0.006608105 0.006621619
#> [2,] 0.006870459 0.006907517
#> [3,] 0.006450718 0.006463427
#> [4,] 0.006970491 0.007052360
#> [5,] 0.007221276 0.007289588
#> [6,] 0.006631894 0.006499691
We can also generate individual level data. For this we use the
resample_inddata
function. This function can be used to
produce three types of output:
resample_inddata
uses the hapsim
package to
generate individual level data. The sample size argument,
N
, in resample_inddata
can accept three types
of input: scalar, vector, or data frame. These formats are the same as
those used by sim_mv
, so if N
is a scalar or
vector, resample_inddata
will assume there is no sample
overlap. This means that if we use N = 10
with two traits,
we will get out genotype information for 20 individuals, of which half
have phenotype 1 information and half have phenotype 2 information. To
specify sample overlap, we can use the data frame format. As a reminder,
a sample size data frame should have columns named trait_1
,
… trait_[M]
and N
. The trait_[x]
columns will be interpreted as logicals and the N
column
should give the number of samples in each combination of studies.
If resample_inddata
is used to generate genotypes only,
we are not really “resampling” anything because there is no need to
input trait data. In this case, we only need to supply the total number
of individuals, the number of variants, and the LD pattern if desired.
Here we will generate data for 15 individuals with an LD pattern
matching the one used in orig_dat
. The returned data
includes the genotype matrix and a vector of population allele
frequencies.
genos_only <- resample_inddata(N = 15,
J = 12,
R_LD = list(my_ld1),
af = af1)
#> Generating genotype matrix only.
names(genos_only)
#> [1] "X" "af"
dim(genos_only$X)
#> [1] 15 12
length(genos_only$af)
#> [1] 12
genos_only$X
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#> S1 0 0 0 2 2 1 1 1 2 2 1 0
#> S2 1 1 0 2 2 0 0 0 2 2 1 1
#> S3 0 0 0 2 2 1 0 2 2 2 1 1
#> S4 0 0 1 0 0 0 0 1 0 0 2 2
#> S5 2 1 1 2 2 1 0 0 0 0 0 0
#> S6 1 0 2 1 1 0 0 0 2 2 1 0
#> S7 0 0 2 2 2 1 0 0 1 1 2 2
#> S8 0 0 0 2 2 0 1 0 2 2 2 2
#> S9 1 0 0 2 2 0 0 0 2 2 2 2
#> S10 2 2 2 1 1 1 1 1 1 1 2 2
#> S11 1 0 1 2 2 2 2 2 1 1 1 1
#> S12 0 1 1 2 2 0 2 2 2 2 1 1
#> S13 0 0 1 1 1 2 2 2 1 1 1 1
#> S14 0 0 0 1 1 1 1 0 2 2 1 1
#> S15 0 1 0 2 2 1 1 2 2 2 2 2
To generate both genotypes and phenotypes, we need to include a
sim_mv
object that contains effect sizes. For our example,
we will use the data frame sample size format to indicate that three
individuals have both phenotypes measured and seven have only one or the
other measured. Note that we don’t need to include the J
argument because the sim_mv
object contains all the
necessary information. The vector of M
phenotypes for each
individual is simulated as Yi = YG, i + YE, i
where YG, i
is the vector of genetic components computed from the genotypes and
effect sizes stored in the original sim_mv
object and YE, i
is a vector of environmental components sampled from a normal
distribution with variance given by the Sigma_E
matrix in
the original sim_mv
object.
N <- data.frame("trait_1" = c(1, 1, 0), "trait_2" = c(0, 1, 1), "N" = c(3, 3, 4))
new_ind_dat1 <- resample_inddata(N = N,
dat = orig_dat,
R_LD = list(my_ld1),
af = af1,
calc_sumstats = FALSE)
#> Generating both genotypes and phenotypes.
#> SNP effects provided for 12 SNPs and 2 traits.
#> I will assume that the environmental variance is the same in the old and new population.
#> I will assume that environmental correlation is the same in the old and new population. Note that this could result in different overall trait correlations.
Genotype data are stored in new_ind_dat1$X
and phenotype
data are in new_ind_dat1$Y
.
names(new_ind_dat1)
#> [1] "X" "Y" "af" "Sigma_G" "Sigma_E"
#> [6] "pheno_sd" "h2" "trait_corr" "beta_joint"
new_ind_dat1$X
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#> S1 0 0 1 1 1 0 0 0 1 1 1 2
#> S2 2 2 2 1 1 0 0 0 2 2 2 1
#> S3 0 0 1 1 1 1 1 1 0 0 0 0
#> S4 0 0 1 2 2 0 0 0 1 1 0 1
#> S5 0 0 0 2 2 1 0 0 2 2 1 1
#> S6 0 0 0 2 2 1 1 0 1 2 1 1
#> S7 1 1 1 1 1 2 1 1 2 2 2 1
#> S8 1 0 1 2 2 0 1 1 2 2 1 1
#> S9 1 1 2 1 1 2 2 2 0 0 2 2
#> S10 0 0 0 1 1 0 0 0 2 2 1 1
new_ind_dat1$Y
#> y_1 y_2
#> 1 -0.9660107 NA
#> 2 -1.6707661 NA
#> 3 -0.4127266 NA
#> 4 0.3558235 1.57133493
#> 5 0.5363123 0.05390283
#> 6 -1.2896112 0.12940470
#> 7 NA 0.48527806
#> 8 NA -2.00566078
#> 9 NA -2.01050960
#> 10 NA 0.97844193
The returned object also includes some additional information
including Sigma_G
, Sigma_E
, and
beta_joint
which have the same meaning as the objects with
the same name in a sim_mv
object. If we had set
calc_sumstats
to TRUE
, the object would also
include effect size and standard error estimates for the association
between each variant and each trait.
Finally, we can use resample_inddata
to calculate
phenotypes for a previously generated set of genotypes from a
sim_mv
object. To do this, we supply the genotypes matrix
to the genos
argument. If this is supplied, no genotypes
matrix will be returned. Even though the genotype matrix is supplied, we
also need to supply the R_LD
and af
arguments
for the population LD and allele frequencies. These are used to
correctly calculate the genetic variance-covariance matrix. Note that it
is important that the total number of individuals implied by the sample
size argument match the number of rows in the genotypes matrix. The
following call will generate an error because we supplied 15 genotypes
but the sample size data frame only includes 10 individuals.
phenos_only <- resample_inddata(N = N,
dat = orig_dat,
genos = genos_only$X,
R_LD = list(my_ld1),
af = af1,
calc_sumstats = FALSE)
#> Generating phenotypes only.
#> Error in check_matrix(genos, "genos", ntotal, J): Expected genos to have 10 rows, found 15
We can fix the poblem by only including the first 10 rowsof the genotype matrix.
phenos_only <- resample_inddata(N = N,
dat = orig_dat,
genos = genos_only$X[1:10,],
R_LD = list(my_ld1),
af = af1,
calc_sumstats = FALSE)
#> Generating phenotypes only.
#> SNP effects provided for 12 SNPs and 2 traits.
#> I will assume that the environmental variance is the same in the old and new population.
#> I will assume that environmental correlation is the same in the old and new population. Note that this could result in different overall trait correlations.
names(phenos_only)
#> [1] "Y" "af" "Sigma_G" "Sigma_E" "pheno_sd"
#> [6] "h2" "trait_corr" "beta_joint"
phenos_only$Y
#> y_1 y_2
#> 1 -0.6443696 NA
#> 2 0.7978988 NA
#> 3 -0.2142834 NA
#> 4 -0.2455013 -0.3309879
#> 5 -0.4448521 -0.9023457
#> 6 -0.1022008 -0.6312573
#> 7 NA -0.1918427
#> 8 NA 1.0820581
#> 9 NA -0.4664964
#> 10 NA -0.7820182
Both resample_sumstats
and resample_inddata
can be used as above with different LD and allele frequencies than used
to create the original object. However, there are some special
considerations about trait variance and scaling that we will discuss in
this section.
To understand resampled data that use different LD or allele
frequencies than the original data, it is important to understand the
units of effect sizes produced by sim_mv
.
sim_mv
always assumes that phenotype variance is equal to
1, so the interpretation of the effects in beta_joint
is
the change in phenotype in units of phenotype SD per change in genotype
in units of either allele or genotype SD. To see the genotype unit, we
can check the geno_scale
object. For example,
By default, sim_mv
will use the per-allele scale if
allele frequencies are available and otherwise use the per-SD scale. We
can also check the the phenotypes are scaled to unit variance in
pheno_sd
which will always be a vector of 1’s if the object
was produced by sim_mv
.
The resampling functions in GWASBrewer
assume that
effect sizes in the new population are the same as in the old population
on the genotype and phenotype scale given. If you are changing
populations, it is much more sensible to assume that effect sizes are
constant on the per-allele scale than on the per-genotype SD scale. This
means that if you want to regenerate data from a different population,
it is a good idea to start with input data that have
geno_scale
equal to “allele”.
One consequence of differing allele frequencies and LD structure is that the total genetic variance will be different in the new population than in the old. By default, the resampling functions will assume that the environmental variance is the same in the two populations. This means that the overall variance of the phenotype in the new population will probably not be 1. The resampling functions will not rescale the phenotype in the new population because this would mean that the phenotypes in the two populations had different units and were not comparable. However, you can rescale the output after the fact using if you would like the phenotypes to have unit variance.
We can see this in action by resampling summary statistics using different allele frequencies and LD pattern than used originally.
af2 <- rep(0.1, 5)
my_ld2 <- matrix(0.3, nrow = 5, ncol = 5)
diag(my_ld2) <- 1
new_dat2 <- resample_sumstats(dat = orig_dat,
N = N,
R_LD = list(my_ld2),
af = af2)
#> Genetic variance in the new population differs from the genetic variance in the old population.
#> I will assume that the environmental variance is the same in the old and new population.
#> I will assume that environmental correlation is the same in the old and new population. Note that this could result in different overall trait correlations.
#> Note that the phenotype in the new population has a different variance from the phenotype in the old population.
#> I will keep the phenotype on the same scale as the original data, so effect sizes in the old and new object are comparable. If you would like to rescale the phenotype to have variance 1, use rescale_sumstats.
#> SNP effects provided for 12 SNPs and 2 traits.
Notice that the function produces a message to let us know that the
phenotype has a different variance in the new population. We can check
by looking at pheno_sd
. The genetic covariance matrix,
heritability, and overall trait correlation will also be different.
However, the environmental covariance will be the same.
new_dat2$pheno_sd
#> [1] 0.9649663 0.9164662
orig_dat$pheno_sd
#> [1] 1 1
new_dat2$h2
#> [1] 0.04683348 0.11545049
orig_dat$h2
#> [1] 0.1124495 0.2570578
new_dat2$Sigma_G
#> [,1] [,2]
#> [1,] 0.043609461 0.008093447
#> [2,] 0.008093447 0.096968051
orig_dat$Sigma_G
#> [,1] [,2]
#> [1,] 0.11244947 0.01934268
#> [2,] 0.01934268 0.25705783
new_dat2$Sigma_E
#> [,1] [,2]
#> [1,] 0.8875505 0.1736201
#> [2,] 0.1736201 0.7429422
orig_dat$Sigma_E
#> [,1] [,2]
#> [1,] 0.8875505 0.1736201
#> [2,] 0.1736201 0.7429422
Note that the features that are kept constant are the environmental
variance Sigma_E
and the per-allele joint effects,
beta_joint
.
Although the default behavior is to keep Sigma_E
constant, we can change this using the arguments
new_env_var
, new_h2
, new_R_E
and
new_R_obs
. These are options for both
resample_sumstats
and resample_inddata
. The
new_env_var
and new_h2
arguments provide
different ways of specifying the environmental variance so only one of
these can be used at a time. new_h2
gives the heritability
in the new population while new_env_var
directly gives the
environmental variance. The parameters new_R_E
and
new_R_obs
are alternate ways of specifying environmental
correlation. new_R_E
directly specifies the enviromental
correlation while new_R_obs
gives the total trait
correlation. Only one of these two parameters can be supplied.
new_R_E <- diag(2)
new_R_E[1,2] <- new_R_E[2,1] <- 0.4
new_dat3 <- resample_sumstats(dat = orig_dat,
N = N,
R_LD = list(my_ld2),
af = af2,
new_env_var = c(0.9, 1.3),
new_R_E = new_R_E)
#> Genetic variance in the new population differs from the genetic variance in the old population.
#> Note that the phenotype in the new population has a different variance from the phenotype in the old population.
#> I will keep the phenotype on the same scale as the original data, so effect sizes in the old and new object are comparable. If you would like to rescale the phenotype to have variance 1, use rescale_sumstats.
#> SNP effects provided for 12 SNPs and 2 traits.
new_dat3$Sigma_E
#> [,1] [,2]
#> [1,] 0.9000000 0.4326662
#> [2,] 0.4326662 1.3000000
cov2cor(new_dat3$Sigma_E)
#> [,1] [,2]
#> [1,] 1.0 0.4
#> [2,] 0.4 1.0
new_dat3$h2
#> [1] 0.04621558 0.06941322
Note that specifying new_h2
will always result in
exactly the heritability specified. This is in contrast with the
h2
argument of sim_mv
which gives the expected
heritability.
new_dat4 <- resample_sumstats(dat = orig_dat,
N = N,
R_LD = list(my_ld2),
af = af2,
new_h2 = c(0.15, 0.25),
new_R_E = new_R_E)
#> Genetic variance in the new population differs from the genetic variance in the old population.
#> Note that the phenotype in the new population has a different variance from the phenotype in the old population.
#> I will keep the phenotype on the same scale as the original data, so effect sizes in the old and new object are comparable. If you would like to rescale the phenotype to have variance 1, use rescale_sumstats.
#> SNP effects provided for 12 SNPs and 2 traits.
new_dat4$h2
#> [1] 0.15 0.25
new_dat4$Sigma_E
#> [,1] [,2]
#> [1,] 0.2471203 0.1072480
#> [2,] 0.1072480 0.2909042
If for any reason we want to change the units of effects in an object
produced by sim_mv
or resmaple_sumstats
, we
can use rescale_sumstats
. For example, in the code below,
we rescale the effects in orig_dat
to be on the per-SD
scale. Note that by doing this, we will delete the allele frequency
information in the snp_info
table. This is because allele
frequency is irrelevant for per-SD scaled effects. To convert back to
per-allele scale, we would need to supply the allele frequency.
orig_rescale1 <- rescale_sumstats(dat = orig_dat,
output_geno_scale = "sd")
orig_rescale1$beta_joint
#> [,1] [,2]
#> [1,] 0.000000000 0.0000000000
#> [2,] -0.097136125 0.0188220320
#> [3,] 0.000000000 0.0000000000
#> [4,] 0.146918495 0.0293836989
#> [5,] 0.000000000 0.0000000000
#> [6,] 0.000000000 0.0061492991
#> [7,] -0.003777444 -0.0007554889
#> [8,] 0.000000000 -0.5047783058
#> [9,] 0.000000000 0.0000000000
#> [10,] 0.000000000 0.0000000000
#> [11,] 0.000000000 0.0000000000
#> [12,] -0.280286489 -0.0560572978
orig_rescale1$geno_scale
#> [1] "sd"
orig_dat$beta_joint
#> [,1] [,2]
#> [1,] 0.000000000 0.000000000
#> [2,] -0.149884295 0.029043026
#> [3,] 0.000000000 0.000000000
#> [4,] 0.231374881 0.046274976
#> [5,] 0.000000000 0.000000000
#> [6,] 0.000000000 0.009116327
#> [7,] -0.005828723 -0.001165745
#> [8,] 0.000000000 -0.728584727
#> [9,] 0.000000000 0.000000000
#> [10,] 0.000000000 0.000000000
#> [11,] 0.000000000 0.000000000
#> [12,] -0.432491442 -0.086498288
## back to per-allele scale
orig_rescale2 <- rescale_sumstats(dat = orig_rescale1,
output_geno_scale = "allele",
af = orig_dat$snp_info$AF)
orig_rescale2$beta_joint
#> [,1] [,2]
#> [1,] 0.000000000 0.000000000
#> [2,] -0.149884295 0.029043026
#> [3,] 0.000000000 0.000000000
#> [4,] 0.231374881 0.046274976
#> [5,] 0.000000000 0.000000000
#> [6,] 0.000000000 0.009116327
#> [7,] -0.005828723 -0.001165745
#> [8,] 0.000000000 -0.728584727
#> [9,] 0.000000000 0.000000000
#> [10,] 0.000000000 0.000000000
#> [11,] 0.000000000 0.000000000
#> [12,] -0.432491442 -0.086498288
We can also change the scale of the outcomes.
orig_rescale3 <- rescale_sumstats(dat = orig_dat,
output_geno_scale = "allele",
output_pheno_sd = c(1.4, 0.8))
There are a few effects of changing the phenotype scale. First the effect sizes are different. In this case the effect sizes for trait 1 have been multiplied by 1.4 and the effect sizes for trait 2 have been multiplied by 0.8. Second, the genetic and environmental covariance matrices have been scaled appropriately.
orig_rescale3$Sigma_G
#> [,1] [,2]
#> [1,] 0.22040097 0.02166381
#> [2,] 0.02166381 0.16451701
orig_rescale3$Sigma_E
#> [,1] [,2]
#> [1,] 1.7395990 0.1944545
#> [2,] 0.1944545 0.4754830
orig_rescale3$Sigma_G + orig_rescale3$Sigma_E
#> [,1] [,2]
#> [1,] 1.9600000 0.2161183
#> [2,] 0.2161183 0.6400000
Note that the heritability, orig_rescale3$h2
has not
changed but is no longer equal to the diagonal of Sigma_G
.
The trait correlation in trait_corr
is also the same. The
final difference is in the total and direct trait effects matrix. In our
case, trait 1 has been multiplied by 1.4 and trait 2 was multiplied by
0.8. On the original scale, a one unit increase in trait 1 caused a 0.2
unit increase in trait 2. So on the new scale, a 1 unit increase in
trait 1 causes a 0.2 ⋅ (0.8/1.4) = 0.114 unit increase in
trait 2.
Note that the causal relationship between the traits implies the following relationship between effect sizes:
with(orig_dat, all.equal(direct_SNP_effects_joint[,2] + total_trait_effects[1,2]*direct_SNP_effects_joint[,1], beta_joint[,2]))
#> [1] TRUE
This is the basis of methods like Mendelian randomization. We can see that after scaling, this relationship is still true.