| Title: | Structural equation modeling based on GWAS summary statistics |
|---|---|
| Description: | Later |
| Authors: | Andrew Grotzinger, Matthijs van der Zee, Mijke Rhemtulla, Hill Ip, Michel Nivard, Elliot Tucker-Drob |
| Maintainer: | Andrew Grotzinger <[email protected]> |
| License: | GPL-3.0 |
| Version: | 0.0.5 |
| Built: | 2026-06-02 18:50:45 UTC |
| Source: | https://github.com/GenomicSEM/GenomicSEM |
k' is the number of variables in the model fit' is the fit function of the regression model names' is a vector of variable names in the order you used
.rearrange(k, fit, names).rearrange(k, fit, names)
Function to expand the S and V matrices to include SNP effects for multivariate GWAS in GenomicSEM
addSNPs(covstruc, SNPs,SNPSE=FALSE,parallel=TRUE,cores=NULL,GC="standard", ...)addSNPs(covstruc, SNPs,SNPSE=FALSE,parallel=TRUE,cores=NULL,GC="standard", ...)
covstruc |
Output from Genomic SEM 'ldsc' function |
SNPs |
Summary statistics file created using the 'sumstats' function |
SNPSE |
Whether the user wants to provide a different standard error (SE) of the SNP variance than the package default. The default is to use .0005 to reflect the fact that the SNP SE is assumed to be population fixed. |
parallel |
addSNPs automatically uses mclapply to create the S and V matrices in parallel. Sometimes running in parallel can cause memory issues within the computing cores. If this is the case, the parallel argument can be set to FALSE, and addSNPs will create the S and V matrices serially. |
cores |
addSNPs automatically uses mclapply to create the S and V matrices in parallel. If the user does not provide an argument to the cores option, then addSNPs will automatically use one less than the total number of cores available. |
GC |
Level of Genomic Control (GC) you want the function to use. The default is 'standard' which adjusts the univariate GWAS standard errors by multiplying them by the square root of the univariate LDSC intercept. Additional options include 'conserv' which corrects standard errors using the univariate LDSC intercept, and 'none' which does not correct the standard errors. |
The function expands the S and V matrices to include SNP effects. As many S and V matrices will be created as there are rows in the summary statistics file (i.e., one S and V matrix per SNP). The function returns a list with 3 named entries:
V_Full |
variance covariance matrix of the parameter estimates in S that includes an individual SNP effect |
S_Full |
genetic covaraiance matrix including individual SNP effect |
RS |
A list containing relevant genetic information (e.g., rsID, basepair, A1/A2) to be appended to the output from other functions (e.g., userGWAS) |
Function to run a common factor model based on output from multivariable LDSC
commonfactor(covstruc,estimation="DWLS", ...)commonfactor(covstruc,estimation="DWLS", ...)
covstruc |
Output from the multivariable LDSC function of Genomic SEM |
estimation |
Options are either Diagonally Weighted Least Squares ("DWLS"; the default) or Maximum Likelihood ("ML") |
The function estimates a common factor model, along with model fit indices, using output from GenomicSEM LDSC. The function returns a list with 2 named entries
modelfit |
The model fit results (e.g., model chi-square, AIC, CFI) from the specified model. |
results |
Parameter estimates and sandwich corrected standard errors from the specified model. |
Function to obtain SNP effects on common factor along with index of SNP heterogeneity
commonfactorGWAS(covstruc=NULL,SNPs=NULL,estimation="DWLS",cores=NULL,toler=FALSE,SNPSE=FALSE,parallel=TRUE,GC="standard",MPI=FALSE,TWAS=FALSE,smooth_check=FALSE, ...)commonfactorGWAS(covstruc=NULL,SNPs=NULL,estimation="DWLS",cores=NULL,toler=FALSE,SNPSE=FALSE,parallel=TRUE,GC="standard",MPI=FALSE,TWAS=FALSE,smooth_check=FALSE, ...)
covstruc |
Output from Genomic SEM 'ldsc' function |
SNPs |
Summary statistics file created using the 'sumstats' function |
estimation |
The estimation method to be used when running the factor model. The options are Diagonally Weighted Least Squares ("DWLS", this is the default) or Maximum Likelihood ("ML") |
cores |
The number of cores to use on your computer for parallel processing. If no number is provided, the default is to use one less core then is available on your computer |
toler |
The tolerance to use for matrix inversion. |
SNPSE |
Whether the user wants to provide a different standard error (SE) of the SNP variance than the package default. The default is to use .0005 to reflect the fact that the SNP SE is assumed to be population fixed. |
parallel |
Whether the function should run using parallel or serial processing. Default = TRUE |
GC |
Level of Genomic Control (GC) you want the function to use. The default is 'standard' which adjusts the univariate GWAS standard errors by multiplying them by the square root of the univariate LDSC intercept. Additional options include 'conserv' which corrects standard errors using the univariate LDSC intercept, and 'none' which does not correct the standard errors. |
MPI |
Whether the function should use multi-node processing (i.e., MPI). Please note that this should only be used on a computing cluster on which the R package Rmpi is already installed. |
TWAS |
Whether the function is being used to estimate a multivariate TWAS using read_fusion output for the SNPs argument. |
smooth_check |
Whether the function should save the consequent largest Zstatistic difference between the pre and post-smooth matrices. |
The function outputs a series of SNP effects with their SEs and estimate of QSNP (the heterogeneity index). The output is a single object.
Function to take output from multivariable S-LDSC and estimate enrichment of model parameter for user specified model
enrich(s_covstruc, model = "",params,fix= "regressions",std.lv=FALSE,rm_flank=TRUE,tau=FALSE,base=TRUE,toler=NULL,fixparam=NULL, ...)enrich(s_covstruc, model = "",params,fix= "regressions",std.lv=FALSE,rm_flank=TRUE,tau=FALSE,base=TRUE,toler=NULL,fixparam=NULL, ...)
s_covstruc |
Output from the multivariable S-LDSC function of Genomic SEM (s_ldsc) |
model |
Model to be specified using lavaan notation |
params |
Parameters of interest to be examined for enrichment (e.g., factor variances). |
fix |
What components of the model should be fixed for follow-up enrichment models. Default = "regressions", which will fix all regression parameters. |
std.lv |
Optional argument to denote whether all latent variables are standardized using unit variance identification (default = FALSE) |
rm_flank |
Optional argument to denote whether flanking window annotations should automatically be removed from output (default = TRUE) |
tau |
Optional argument to denote whether the user wants to use the tau genetic covariance matrices, as opposed to the default zero-order matrices, for estimation of enrichment (default = FALSE) |
base |
Optional argument to denote whether the user wants to include the full model output from the genome-wide (i.e., baseline) matrix (default = TRUE) |
toler |
Optional argument to manually set tolerance for matrix inverison. |
fixparam |
Optional argument to manually fix paramters when estimating the model within annotations |
Function to take output from multivariable S-LDSC and estimate enrichment of model parameter for user specified model
Function to run HDL (https://github.com/zhenin/HDL) to compute the genetic covariance between a series of traits based on genome wide summary statistics obtained from GWAS. The results generate by this function are necessary and sufficient to facilitate the fit of structural equation models (or other multivariate models) to the genetic covariance matrix. HDL is more powerfull than LDSC but if the LD structure in the reference file mismatches the GWAS LD structure, LDSC seems to perfrom better, expescially for estiamtes of heritability. For medium samples (N > 50.000) with moderate SNP-h2 (snp h2 > 0.07) where the LD structure isnt similar we would recomend ldsc, especially for GWAS. If you have small GWAS ( N < 25.000) the extra power HDL provides is worth the downward bias in snp h2 estimates relative to ldsc.
hdl(traits,sample.prev,population.prev,LDpath,Nref,trait.names=NULL,method, ...)hdl(traits,sample.prev,population.prev,LDpath,Nref,trait.names=NULL,method, ...)
traits |
A vector of strings which point to munged files for trait you want to include in a Genomic SEM model. the HDL function works with standard munged files |
sample.prev |
A vector of sample prevalences for dichotomous traits and |
population.prev |
A vector of population prevalences for dichotomous traits and |
LDpath |
String which contains the path to the folder in which the LD matrices used in the analysis are located. Expects LD matirices formated as required by the original HDL software. |
Nref |
Sample size of the reference file, default is 335265 |
trait.names |
A character vector specifying how the traits should be named in the genetic covariance (S) matrix. These variable names can subsequently be used in later steps for model specification. If no value is provided, the function will automatically name the variables using the generic from of V1-VX. |
method |
sting, either "piecewise" which estimates the heritability or genetic covariance locally in chunks across the genome and then sums these estimates, or "jackknife" which uses a genoem wide estiamte and uses a jackknife estimator for the variance of the parameter. defaults to "piecewise" the original HDL implementation is equal to "jackknife" |
The function returns a list with 3 named entries
S |
estimated genetic covariance matrix |
V |
variance covariance matrix of the parameter estimates in S |
I |
matrix containing the "cross trait intercepts", or the error covariance between traits induced by overlap, in terms of subjects, between the GWASes on which the analyses are based |
Ning, Z., Pawitan, Y. & Shen, X. High-definition likelihood inference of genetic correlations across human complex traits. Nat Genet (2020).
Function to run LD score regression (https://github.com/bulik/ldsc) to compute the genetic covariance between a series of traits based on genome wide summary statistics obtained from GWAS. The results generate by this function are necessary and sufficient to facilitate the fit of structural equation models (or other multivariate models) to the genetic covariance matrix.
ldsc(traits,sample.prev,population.prev,ld,wld,trait.names=NULL, sep_weights = FALSE,chr=22,n.blocks=200,ldsc.log=NULL,stand=FALSE,select=FALSE, ...)ldsc(traits,sample.prev,population.prev,ld,wld,trait.names=NULL, sep_weights = FALSE,chr=22,n.blocks=200,ldsc.log=NULL,stand=FALSE,select=FALSE, ...)
traits |
A vector of strings which point to LDSC munged files for trait you want to include in a Genomic SEM model. |
sample.prev |
A vector of sample prevalences for dichotomous traits and |
population.prev |
A vector of population prevalences for dichotomous traits and |
ld |
String which contains the path to the folder in which the LD scores used in the analysis are located. Expects LD scores formated as required by the original LD score regression software. |
wld |
String which contains the path to the folder in which the LD score weights used in the analysis are located. Expects LD scores formated as required by the original LD score regression software. |
trait.names |
A character vector specifying how the traits should be named in the genetic covariance (S) matrix. These variable names can subsequently be used in later steps for model specification. If no value is provided, the function will automatically name the variables using the generic from of V1-VX. |
sep_weights |
Logical which indicates wheter the weights are different form the LD scores used for the regression, defaults to FALSE. |
chr |
number of chromosomes over which the LDSC weights are split, defalts to 22 (Human) but can be switched for other species |
n.blocks |
Number of blocks to use for the jacknive procedure which is used to estiamte entries in V, higher values will be optimal if you have a large number of variables and also slower, defaults to 200 |
ldsc.log |
How you want to name your .log file for ldsc. The default is NULL as the package will automatically name the log based in the file names unless a log file name is provided to this arugment. |
stand |
Whether you want the package to also output a genetic correlation and sampling correlation matrix. Default is FALSE. |
select |
Whether you want the package to estimate LDSC using only even or odd chromosomes by setting select to "ODD" and "EVEN" respectively. It can also be set to a set of numbers, such as c(1,3,10), to run ldsc on a specific chromosome or chromosomes. Default is FALSE, in which case LDSC is estimated using all chromosomes. |
chisq.max |
Maximum value of the squared Z statistics for a SNP that is considered in the LD-score regression. Default behaviour is to set to the maximum of 80 and N*0.001 |
The function returns a list with 5 named entries
S |
estimated genetic covariance matrix |
V |
variance covariance matrix of the parameter estimates in S |
I |
matrix containing the "cross trait intercepts", or the error covariance between traits induced by overlap, in terms of subjects, between the GWASes on which the analyses are based |
N |
a vector contsaining the sample size (for the genetic variances) and the geometric mean of sample sizes (i.e. sqrt(N1,N2)) between two samples for the covariances |
m |
number of SNPs used to compute the LD scores with. |
Given a set of parameter values from Genomic SEM models, calculate the extent to which groups differ on these parameters.
localSRMD(unconstrained, constrained, lhsvar, rhsvar, ...)localSRMD(unconstrained, constrained, lhsvar, rhsvar, ...)
unconstrained |
A vector of parameter values from an unconstrained structural equation model, where focal parameters are estimated freely in each group. |
constrained |
A vector of parameter values from a constrained structural equation model, where focal parameters are constrained to be equal across groups. |
lhsvar |
A list containing the variances, in each group, of the variables in the usermodel() results column lhs |
rhsvar |
A list containing the variances, in each group, of the variables in the usermodel() results column rhs |
The function returns the average standardized extent to which estimates from a constrained set of structural equation model parameters differ from those obtained when the same set of parameters are freely estimated.
Function to expand the S and V matrices to include multiple SNP effects in a single matrix, along with LD information across these SNPs.
multiSNP (covstruc, SNPs, LD, SNPSE = FALSE, SNPlist = NA, ...)multiSNP (covstruc, SNPs, LD, SNPSE = FALSE, SNPlist = NA, ...)
covstruc |
Output from Genomic SEM multivariable LDSC |
SNPs |
Summary statistics file created using the sumstats function |
LD |
Matrix of LD information across the SNPs. If only independent SNPs are being provided a matrix of 0s can be entered. Note that the function requires that A1 and A2 be included in the LD matrix column names (e.g., rs12345_A_T) |
SNPSE |
User provided SE of the SNP variance for entry in the V matrix. If no number is provided the package defaults to using .0005 to reflect a practically fixed population value taken from a reference panel |
SNPlist |
List of rsIDs if the user wishes to subset out a set of SNPs from a full set of summary statistics |
The function expands the S and V matrices to include multiple SNP effects. These matrices include LD information across the SNPs.
Function to process GWAS summary statistis files and prepair them for LD score regression
munge(files,hm3,trait.names=NULL,N,info.filter = .9,maf.filter=0.01, column.names=list(),parallel=FALSE,cores=NULL,overwrite=TRUE ...)munge(files,hm3,trait.names=NULL,N,info.filter = .9,maf.filter=0.01, column.names=list(),parallel=FALSE,cores=NULL,overwrite=TRUE ...)
files |
A vector of file names, files must be located in the working directory, or a path must be provided. |
hm3 |
A file of SNPs with A1, A2 and rsID used to allign alleles across traits. We suggest using an (UNZIPPED) file of HAPMAP3 SNPs with some basic cleaning applied (e.g., MHC region removed) that is supplied and created by the original LD score regression developers and available here: https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2: |
trait.names |
A vector of trait names which will be used as names for the munged files |
N |
A vector of sample size |
info.filter |
Numeric value which is used as a lower bound for inputation quality (INFO) |
maf.filter |
Numeric value used as a lower bound for minor allel frequency |
column.names |
Optional list detailing which columns represent, SNP, MAF, etc. e.g. list(SNP=my_snp_column) |
parallel |
Indicates whether munge should process the summary statistics files in parallel or serial fashion. Default is TRUE, indicating that it will run in parallel. |
cores |
Indicates how many cores to use when running in parallel. The default is NULL, in which case munge will use 1 less than the total number of cores available in the local environment. |
overwrite |
Indicates whether existing .sumstats.gz files should be overwritten |
The function writes files of the ".sumstats" format, which can be used to estimate SNP heritability and genetic covariance using the ldsc() function. The function will also output a .log file that should be examined to ensure that column names are being interpret correctly.
paLDSC performs parallel analysis using LDSC-derived genetic (co)variance matrices to determine the number of non-spurious latent dimensions in genomic data. The function compares the eigenvalues from the LDSC matrix to those derived from null matrices generated under the multivariate LDSC sampling distribution.
paLDSC( S = S, V = V, r = NULL, p = NULL, save.pdf = F, diag = F, fa = F, fm = NULL, nfactors = NULL )paLDSC( S = S, V = V, r = NULL, p = NULL, save.pdf = F, diag = F, fa = F, fm = NULL, nfactors = NULL )
S |
A genetic correlation matrix ( |
V |
The corresponding multivariate sampling covariance matrix from LDSC ( |
r |
Number of Monte Carlo replications to simulate null correlation matrices. Default is |
p |
Percentile threshold for significance (default = |
save.pdf |
Logical. Whether to save scree plots to a PDF file. Default is |
diag |
Logical. If |
fa |
Logical. If |
fm |
Factor extraction method used when |
nfactors |
Number of factors to extract in factor analysis if |
only.values |
Logical. If |
This method adapts Horn’s (1965) classic parallel analysis approach to LDSC-based genomic data. Eigenvalues from the observed LDSC-derived matrix are compared against distributions of eigenvalues simulated under the null model using the multivariate LDSC sampling distribution. The number of components to retain is determined by comparing each observed eigenvalue to the corresponding percentile threshold from the simulated distribution.
If diag = TRUE, diagonal sampling variances are used to construct null matrices (faster, but less conservative). If diag = FALSE, the full LDSC sampling covariance matrix V is used to simulate the null distribution.
Factor analysis mode (fa = TRUE) follows the logic of psych::fa.parallel, allowing users to inspect common factor solutions.
A list containing observed eigenvalues, simulated eigenvalue distributions, and the suggested number of factors/components to extract.
de la Fuente, J., & Tucker-Drob, E. M. (2023). *paLDSC: Parallel Analysis Based on Multivariate LDSC*.
https://rpubs.com/JaFuente/paLDSC
Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. *Psychometrika*, 30(2), 179–185.
## Not run: # Load LDSC output load("LDSC_output_paLDSC_example.RData") # Run paLDSC with 10 replications paLDSC(S = LDSCoutputfull$S_Stand, V = LDSCoutputfull$V_Stand, r = 10) ## End(Not run)## Not run: # Load LDSC output load("LDSC_output_paLDSC_example.RData") # Run paLDSC with 10 replications paLDSC(S = LDSCoutputfull$S_Stand, V = LDSCoutputfull$V_Stand, r = 10) ## End(Not run)
QTrait tests whether the genetic association between a latent factor (defined by a set of indicator traits) and an external trait can be fully explained by the common factor, or whether specific indicator traits show heterogeneity in this relationship.
QTrait( LDSCoutput, indicators, traits, mresid = 0.25, mresidthreshold = 0.1, lsrmr = 0.25, lsrmrthreshold = 0.1, save.plots = TRUE, stdout = TRUE )QTrait( LDSCoutput, indicators, traits, mresid = 0.25, mresidthreshold = 0.1, lsrmr = 0.25, lsrmrthreshold = 0.1, save.plots = TRUE, stdout = TRUE )
LDSCoutput |
An object from the |
indicators |
A character vector specifying the names of indicator traits that define the latent factor. Must match trait names in |
traits |
A character vector of external traits for which QTrait statistics and genetic correlations will be estimated. Must match trait names in |
mresid |
A proportion (default = 0.25) of the RMS genetic correlation used to flag outlier indicator traits. |
mresidthreshold |
An absolute threshold (default = 0.10) for residual genetic correlation to define meaningful outliers. |
lsrmr |
Proportion (default = 0.25) used to define meaningful lSRMR based on the RMS genetic correlation. |
lsrmrthreshold |
Absolute threshold (default = 0.10) to define meaningful lSRMR. |
save.plots |
Logical. If TRUE, saves scatterplots of trait–indicator associations against loadings. Default is TRUE. |
stdout |
Logical. If TRUE, plots use standardized output (correlations vs. standardized loadings). If FALSE, uses covariances vs. unstandardized loadings. Default is TRUE. |
Common Pathway Model Output includes:
rGF1Trait_CPM, SErGF1Trait_CPM, pvalrGF1Trait_CPM: Genetic correlation, standard error, and p-value.
QTrait_CPM, df_CPM, p_value_CPM, Qsignificant_CPM: Q statistic, degrees of freedom, and p-value.
lSRMR_CPM, lSRMR_above_threshold_CPM, heterogeneity_CPM: Local SRMR and heterogeneity status.
Follow-Up Model Output includes analogous statistics when outlier indicators are freed:
rGF1Trait_FUM, SErGF1Trait_FUM, pvalrGF1Trait_FUM
QTrait_FUM, df_FUM, p_value_FUM, Qsignificant_FUM
lSRMR_FUM, lSRMR_above_threshold_FUM, heterogeneity_FUM
Unconstrained_paths: Names of outlier indicator traits.
A data frame summarizing genetic correlations between traits and the latent factor, QTrait statistics for both common and follow-up models, heterogeneity flags, lSRMR values, and identified outlier indicators.
Javier de la Fuente
https://rpubs.com/JaFuente/QTrait
## Not run: # Load example LDSC output load("LDSC_G_factor_QTRAIT_tutorial.RData") # Define indicators of genetic g indicators <- c("Matrix", "Memory", "RT", "Symbol_Digit", "TMTB", "Tower", "VNR") # Define external correlate traits <- "EA" # Run QTrait qtrait_out <- QTrait( LDSCoutput = LDSC_G_factor_QTRAIT_tutorial, indicators = indicators, traits = traits, mresid = 0.25, mresidthreshold = 0.10, lsrmr = 0.25, lsrmrthreshold = 0.10, save.plots = TRUE, stdout = TRUE ) print(qtrait_out) ## End(Not run)## Not run: # Load example LDSC output load("LDSC_G_factor_QTRAIT_tutorial.RData") # Define indicators of genetic g indicators <- c("Matrix", "Memory", "RT", "Symbol_Digit", "TMTB", "Tower", "VNR") # Define external correlate traits <- "EA" # Run QTrait qtrait_out <- QTrait( LDSCoutput = LDSC_G_factor_QTRAIT_tutorial, indicators = indicators, traits = traits, mresid = 0.25, mresidthreshold = 0.10, lsrmr = 0.25, lsrmrthreshold = 0.10, save.plots = TRUE, stdout = TRUE ) print(qtrait_out) ## End(Not run)
Function to format TWAS summary statistics for T-SEM
read_fusion(files,trait.names=NULL,binary=NULL,N=NULL,perm=FALSE, ...)read_fusion(files,trait.names=NULL,binary=NULL,N=NULL,perm=FALSE, ...)
files |
List of FUSION files. |
trait.names |
What names to use when naming the beta and SE columns for each trait |
binary |
Vector specifying whether each trait is binary (TRUE) or continuous (FALSE) |
N |
The sample size to use when backing out the betas and SEs from FUSION Z-statistics. This should reflect the sum of effective sample size for binary traits |
perm |
Whether you want to use the permutation test p-values from FUSION. Default = FALSE |
The function combines and formats FUSION TWAS summary statistics for subsequent T-SEM analyses. The function returns a single data.frame.
‘rgmodel' uses LDSC-derived output from Genomic SEM’s multivariable LD Score regression ('ldsc()') to specify and estimate a saturated genetic correlation matrix using the usermodel function. The function takes an object from 'ldsc()' and returns an expanded list that includes the genetic correlation matrix (R) and its sampling covariance matrix (V_R).
rgmodel(LDSCoutput)rgmodel(LDSCoutput)
LDSCoutput |
A list output from 'ldsc()' containing genetic covariance matrices and related data. |
model |
A lavaan-style syntax string or list of character vectors specifying the structural equation model. |
std.lv |
Logical; whether to standardize latent variances. Default is TRUE. |
estimation |
Logical; whether to estimate parameters. Default is TRUE. |
sub |
Optional character vector to subset phenotypes in the model. |
... |
Additional arguments passed to 'usermodel()'. |
A list containing an updated LDSC object with the following elements:
Observed genetic covariance matrix (on liability scale for case/control designs).
Sampling covariance matrix in lavaan format.
Matrix of LDSC intercepts and cross-trait (bivariate) intercepts.
Sample sizes for heritabilities and for co-heritabilities.
Number of SNPs used to construct the LD score.
Sampling covariance matrix for standardized genetic covariances, if present in input.
Standardized genetic covariance matrix, if present in input.
Genetic correlation matrix.
Sampling covariance matrix of the genetic correlation matrix.
Output list from 'usermodel()' if estimation = TRUE, containing parameter estimates.
usermodel, and the full tutorial at https://rpubs.com/JaFuente/rgmodel
Function to run Stratified LD score regression.
s_ldsc(traits,sample.prev=NULL,population.prev=NULL,ld,wld,frq,trait.names=NULL,n.blocks=200,ldsc.log=NULL,exclude_cont=TRUE, ...)s_ldsc(traits,sample.prev=NULL,population.prev=NULL,ld,wld,frq,trait.names=NULL,n.blocks=200,ldsc.log=NULL,exclude_cont=TRUE, ...)
traits |
A vector of file names which point to LDSC munged files for trait you want to include. |
sample.prev |
A vector of sample prevalences for dichotomous traits and |
population.prev |
A vector of population prevalences for dichotomous traits and |
ld |
A folder (or folders) of partitioned LD scores used as the independent variable in S-LDSC. |
wld |
A folder of non-partitioned LD scores used as regression weights. |
frq |
A folder of allele frequency files. |
trait.names |
A character vector specifying how the traits should be named. These variable names can subsequently be used in later steps for model specification. |
n.blocks |
Number of blocks to use for the jacknive procedure which is used to estiamte entries in V, higher values will be optimal if you have a large number of variables and also slower, defaults to 200. |
ldsc.log |
What to name the .log file if you want to overrride default to name file based on file names used as input. |
exclude_cont |
Whether to exclude continuous annotations from S-LDSC estimation. |
The function returns a list with 9 named entries
S |
The zero-order genetic covariance matrices for each annotation. |
V |
The zero-order sampling covariance matrices for each annotation. |
S_Tau |
The tau matrices for each annotation. |
V_Tau |
The tau sampling covariance matrices for each annotation. |
I |
matrix containing the "cross trait intercepts", or the error covariance between traits induced by overlap, in terms of subjects, between the GWASes on which the analyses are based |
N |
a vector contsaining the sample size (for the genetic variances) and the geometric mean of sample sizes (i.e. sqrt(N1,N2)) between two samples for the covariances |
m |
number of SNPs used to compute the LD scores with. |
Prop |
The proportional size of each annotation relative to the annotation containing all SNPs. |
Select |
A data.frame that codes flanking window and continuous annotations as 2 and all other annotations as 1. This is used by the 'enrich' function to exclude the flanking window and continuous annotations from enrichment estimates. |
simLDSC simulates GWAS summary statistics across multiple phenotypes given a population genetic covariance structure, phenotypic correlations, sample sizes, and LDSC intercepts. Useful for testing multivariate LDSC pipelines or validating SEM-based genomic models.
simLDSC( covmat, N, seed, ld, rPheno = NULL, int = NULL, N_overlap = 0.99, r = 1, gzip_output = TRUE, parallel = FALSE, cores = NULL )simLDSC( covmat, N, seed, ld, rPheno = NULL, int = NULL, N_overlap = 0.99, r = 1, gzip_output = TRUE, parallel = FALSE, cores = NULL )
covmat |
A population genetic covariance matrix (Σ) or lavaan model syntax. Can be provided directly as a matrix or |
N |
A sample size matrix (or |
seed |
An integer seed for random number generation. Default is 1234. |
ld |
A path to the folder containing LD score files (e.g., |
rPheno |
A phenotypic correlation matrix ( |
int |
A numeric vector of LDSC intercepts (length = k), or a single numeric value if the same intercept is used for all traits. |
N_overlap |
A numeric value between 0 and 1 representing the proportion of sample overlap, only used if |
r |
Number of replications to simulate. Default is 1. |
gzip_output |
Logical. Whether to compress output summary statistics as |
parallel |
Logical. Whether to run simulations in parallel. Default is |
cores |
Integer number of cores to use if |
This function generates simulated GWAS summary statistics from a specified multivariate genetic model using LDSC assumptions. The population genetic covariance matrix can be specified either directly or via lavaan-style syntax.
The function creates k simulated summary statistics files per replication. For example, simulating 3 traits across 1 replication yields:
iter1.GWAS1.sumstats.gz
iter1.GWAS2.sumstats.gz
iter1.GWAS3.sumstats.gz
The function writes simulated summary statistics files to disk for each trait and replication. No values are returned to the R environment.
A complete tutorial and walkthrough is available at:
https://rpubs.com/JaFuente/simLDSC
## Not run: # Define model-based genetic covariance matrix mod <- "F1 =~ 1*Pheno1 + 0.50*Pheno2 + 0.50*Pheno3 F1 ~~ 0.10*F1 Pheno1 ~~ 0*Pheno1 Pheno2 ~~ 0*Pheno2 Pheno3 ~~ 0*Pheno3" # Alternatively, specify it directly Spop <- matrix(NA, 3, 3) diag(Spop) <- c(0.10, 0.025, 0.025) Spop[lower.tri(Spop)] <- c(0.05, 0.05, 0.025) Spop[upper.tri(Spop)] <- c(0.05, 0.05, 0.025) # Sample size matrix N <- matrix(0, 3, 3) diag(N) <- c(75000, 300000, 300000) # Phenotypic correlations rPheno <- matrix(0.5, 3, 3) diag(rPheno) <- 1 # LDSC intercepts int <- c(1.02, 1.01, 1.03) # Run simulation simLDSC(covmat = mod, N = N, rPheno = rPheno, int = int, ld = "eur_w_ld_chr/") ## End(Not run)## Not run: # Define model-based genetic covariance matrix mod <- "F1 =~ 1*Pheno1 + 0.50*Pheno2 + 0.50*Pheno3 F1 ~~ 0.10*F1 Pheno1 ~~ 0*Pheno1 Pheno2 ~~ 0*Pheno2 Pheno3 ~~ 0*Pheno3" # Alternatively, specify it directly Spop <- matrix(NA, 3, 3) diag(Spop) <- c(0.10, 0.025, 0.025) Spop[lower.tri(Spop)] <- c(0.05, 0.05, 0.025) Spop[upper.tri(Spop)] <- c(0.05, 0.05, 0.025) # Sample size matrix N <- matrix(0, 3, 3) diag(N) <- c(75000, 300000, 300000) # Phenotypic correlations rPheno <- matrix(0.5, 3, 3) diag(rPheno) <- 1 # LDSC intercepts int <- c(1.02, 1.01, 1.03) # Run simulation simLDSC(covmat = mod, N = N, rPheno = rPheno, int = int, ld = "eur_w_ld_chr/") ## End(Not run)
Function to process GWAS summary statistics files and prepare them for a GWAS in genomicSEM
sumstats(files,ref,trait.names=NULL,se.logit,OLS=NULL,linprob=NULL,N=NULL,betas=NULL,info.filter = .6,maf.filter=0.01, keep.indel=FALSE,parallel=FALSE,cores=NULL,ambig=FALSE,direct.filter=FALSE, ...)sumstats(files,ref,trait.names=NULL,se.logit,OLS=NULL,linprob=NULL,N=NULL,betas=NULL,info.filter = .6,maf.filter=0.01, keep.indel=FALSE,parallel=FALSE,cores=NULL,ambig=FALSE,direct.filter=FALSE, ...)
files |
a vector of file names, files must be located in the working directory, or a path must be provided. |
ref |
A reference file of SNPs to keep in your GWAS, one based on 1000 genomes phase 3 is provided. |
trait.names |
a vector of trait names which will be used as names for the munged files |
se.logit |
a logical vector indicating whether the standard errors in each set of summary statistics is on the logit scale |
OLS |
a logical vector indicating whether the GWAS was for a continuous trait and used OLS (or a LMM) |
linprob |
a logical vector indicating whether the GWAS is a binary outcome with only Z-statistics or was analyzed using a linear probability model i.e. a dichotomous trait using OLS (or a LMM) |
N |
A vector of total sample sizes for continuous traits and the sum of effective sample sizes for binary traits |
betas |
A vector of column names of betas for continuous traits that are known to have been standardized prior to running the GWAS |
N |
A vector of sample size |
info.filter |
Numeric value which is used as a lower bound for imputation quality (INFO) |
maf.filter |
Numeric value used as a lower bound for minor allele frequency |
keep.indel |
Indicates whether insertion-deletion mutations (indels) should be included in your summary statistics. The default is FALSE. |
parallel |
Indicates whether sumstats should process the summary statistics files in parallel or serial fashion. Default is TRUE, indicating that it will run in parallel. |
cores |
Indicates how many cores to use when running in parallel. The default is NULL, in which case sumstats will use 1 less than the total number of cores available in the local environment. |
ambig |
Indicates whether strand ambiguous SNPs should be removed from output. |
direct.filter |
Indicates whether SNPs that have missing information for more than half of contributing cohorts, as indicated by missing information in the direction column, should be removed. |
The function ensures the SNPs in each file are aligned to the same reference allele, it attempts to filter strand issues, it retains SNPs present in the reference file. The function can deal with GWAS of continous traits, dichotomous traits using logistic regression and even dichotomous traits using (misspecified) OLS regression or a mixed model. The function returns .log files that should be inspected to ensure that all column names were appropriately interpreted.
Function to obtain model estimates for a user-specified model across SNPs.
userGWAS(covstruc=NULL,SNPs=NULL,estimation="DWLS",model="",printwarn=TRUE,sub=FALSE,cores=NULL,toler=FALSE,SNPSE=FALSE, parallel=TRUE,GC="standard",MPI=FALSE,smooth_check=FALSE,TWAS=FALSE,std.lv=FALSE,fix_measurement=TRUE, Q_SNP=FALSE,analytic=FALSE,batch_size=100000,usermod=NULL, ...)userGWAS(covstruc=NULL,SNPs=NULL,estimation="DWLS",model="",printwarn=TRUE,sub=FALSE,cores=NULL,toler=FALSE,SNPSE=FALSE, parallel=TRUE,GC="standard",MPI=FALSE,smooth_check=FALSE,TWAS=FALSE,std.lv=FALSE,fix_measurement=TRUE, Q_SNP=FALSE,analytic=FALSE,batch_size=100000,usermod=NULL, ...)
covstruc |
Output from Genomic SEM 'ldsc' function |
SNPs |
Summary statistics file created using the 'sumstats' function |
estimation |
The estimation method to be used when running the factor model. The options are Diagonally Weighted Least Squares ("DWLS", this is the default) or Maximum Likelihood ("ML") |
model |
The user-specified model to use in model estimation using lavaan syntax. The SNP is referred to as 'SNP' in the model. |
printwarn |
Whether you want warnings and errors printed for each run. This can take up significant space across all SNPs, but the default is set to TRUE as these warnings may not be safe to ignore. |
sub |
Whether you want to only output a piece of the model results (e.g., F1 ~ SNP). The argument takes a vector, as multiple pieces of the model result can be output. |
cores |
Indicates how many cores to use when running in parallel. The default is NULL, in which case sumstats will use 1 less than the total number of cores available in the local environment. |
toler |
The tolerance to use for matrix inversion. |
SNPSE |
Whether the user wants to provide a different standard error (SE) of the SNP variance than the package default. The default is to use .0005 to reflect the fact that the SNP SE is assumed to be population fixed. |
parallel |
Whether the function should run using parallel or serial processing. Default = TRUE |
GC |
Level of Genomic Control (GC) you want the function to use. The default is 'standard' which adjusts the univariate GWAS standard errors by multiplying them by the square root of the univariate LDSC intercept. Additional options include 'conserv' which corrects standard errors using the univariate LDSC intercept, and 'none' which does not correct the standard errors. |
MPI |
Whether the function should use multi-node processing (i.e., MPI). Please note that this should only be used on a computing cluster on which the R package Rmpi is already installed. |
smooth_check |
Whether the function should save the consequent largest Z-statistic difference between the pre and post-smooth matrices. |
TWAS |
Whether the function is being used to estimate a multivariate TWAS using read_fusion output for the SNPs argument. |
std.lv |
Optional argument to denote whether all latent variables are standardized using unit variance identification (default = FALSE) |
fix_measurement |
Optional argument to denote whether the measurement model should be fixed across all SNPs (default = TRUE) |
Q_SNP |
Logical. Whether to compute the Q_SNP heterogeneity statistic for each SNP. Default is |
analytic |
Logical. If |
batch_size |
Integer. Number of SNPs processed per batch when |
usermod |
Optional. A pre-fitted no-SNP model results data frame (the |
The function outputs results from the multivariate GWAS. If the sub argument is used, it will output as many list objects as there are sub objects requested.
If the sub argument is FALSE (as is the package default), the function will output as many list objects as there are SNPs.
When analytic = TRUE, the function returns a single data frame with one row per SNP containing factor-specific betas, standard errors, Z-statistics, p-values, and omnibus Q statistics.
Function to run a user specified model based on output from multivariable LDSC
usermodel(covstruc,estimation="DWLS", model = "",std.lv=FALSE,imp_cov=FALSE,fix_resid=TRUE,toler=FALSE, Q_Factor=FALSE, ...)usermodel(covstruc,estimation="DWLS", model = "",std.lv=FALSE,imp_cov=FALSE,fix_resid=TRUE,toler=FALSE, Q_Factor=FALSE, ...)
covstruc |
Output from the multivariable LDSC function of Genomic SEM |
estimation |
Options are either Diagonally Weighted Least Squares ("DWLS"; the default) or Maximum Likelihood ("ML") |
model |
Model to be specified using lavaan notation |
std.lv |
Optional argument to denote whether all latent variables are standardized using unit variance identification (default = FALSE) |
imp_cov |
Optional argument to denote whether the user wants the model implied and residual covariance matrix included in the usermodel output (default = FALSE) |
fix_resid |
Optional argument to denote whether the user wants Genomic SEM to try troubleshooting a model that does not converge by fixing residual variances to be above 0 (default = TRUE) |
toler |
Optional argument to set lower tolerance for matrix inversion used to produce sandwich corrected standard errors. (default = FALSE) |
Q_Factor |
Optional argument to get a heterogeneity statistic for factor correlations. (default = FALSE) |
The function estimates a user-specified model, along with model fit indices, using output from GenomicSEM LDSC.
Function to automate writing model syntax based on EFA loadings. This is most likely to be useful when examining larger numbers of traits (e.g., > 10).
write.model(Loadings,S_LD,cutoff,fix_resid=TRUE,bifactor=FALSE,mustload=FALSE,common=FALSE, ...)write.model(Loadings,S_LD,cutoff,fix_resid=TRUE,bifactor=FALSE,mustload=FALSE,common=FALSE, ...)
Loadings |
The matrix of EFA loadings. Note that the number of columns in this matrix determines how many factors are specifeid in the model. |
S_LD |
The LDSC genetic covariance matrix |
cutoff |
The EFA standardized loadings cutoff to determine which traits should load on a factor |
fix_resid |
Whether to apply constraint on all variables to keep residual variances above .001. Default is TRUE. |
bifactor |
Whether to specify a bifactor model in which a general factor predicts all included traits and the remaining factors are specifided to be orthogonal of one another. |
mustload |
Whether all variables should load on at least one factor, even if they dont meet the threshold specified using the cutoff argument. |
common |
Whether to specify a common factor model. |
The function outputs model syntax that can be used to run the model using the usermodel function in Genomic SEM.