navmix vignette

#Installation

library(devtools)
install_github("aj-grant/navmix", build_vignettes = TRUE)
library(navmix)

#Overview The navmix package implements a clustering method which clusters observed vectors based on their direction from the origin. It fits a mixture model of von Mises–Fisher distributions including a noise cluster to provide robustness to outliers. It also includes an automatic method for choosing the number of clusters using the BIC.

The inputs are an n × m matrix, where m > 1 and n > m. Each of the n rows represents an m-dimensional observation. Because the method clusters based on direction, the first step in the algorithm is to normalise these observed vectors to have norm 1. Optional paramters are:

K, the maximum number of clusters to fit (default = 10);
select_K, which determines whether to fit up to K clusters and output the fitted model with the maximum BIC (default), or whether to fit exactly K clusters;
common_kappa, which determines whether to force the parameters for each cluster to be equal (default = FALSE);
pj_ini, which sets the initial proportion of observations to be part of the noise cluster (default = 0.05). Note, if pj_ini is set to 0, no noise cluster will be fit.

The other parameters control convergence criterion and plots.

A motivation for performing the clustering technique is to cluster genetic association data. In this setting, the inputs are estimates of the association between n genetic variants and m different traits. These estimated associations may, for example, be obtained from GWAS summary statistics. The following section illustrates how to cluster genetic association data using navmix.

#Clustering genetic association data Consider the case where we wish to cluster n genetic variants according to their associations with m traits. Let B be an n × m matrix containing the estimated associations (β coefficients) and let S be an n × m matrix containing their standard errors. The recommended input is the standardised associations given by B / S.

B_std = B/S
cluster_out = navmix(B_std)

An alternative way to standardise the associations is to include estimates of the correlation between traits. This may improve the clustering if the traits are highly correlated and the genetic associations are estimated in the same, or overlapping, sample(s). In this case, the full covariance matrices of the genetic variant-trait associations can be estimated from the standard errors and trait correlation estimates. Let R be an m × m matrix with (i, j)th entry the estimated correlation between traits i and j. The standardised associations can be obtained using the function row_standardise.

B_std = navmix::row_standardise(B, S, R)
cluster_out = navmix(B_std)

Finally, the unstandardised associations may be used, in which case the input matrix is simply B.

cluster_out = navmix(B)

#Plots There are four plot options:

Heat map of the proportional associations for each observation with each trait;
Heat map of the means of the fitted vMF distrbutions;
Parallel plot of the means of the fitted vMF distributions;
Radial plot of the means of the fitted vMF distributions.

Note that the heat maps and parallel plot are output as ggplot objects, whereas the radial plot is not.

##Heat map of the proportional associations This plot is produced if the optional parameters plot and plot_heat are set to TRUE (both are TRUE by default). The plotted values are the normalised observations (representing, for example, the proportional associations of the genetic variants with each trait). The traits are re-ordered so that those that are more alike with respect to their associations with the observations are closer together in the plot. This re-ordering can be turned off by setting reorder_traits = FALSE.

##Plots of the means of the fitted vMF distributions The mean vectors of the fitted vMF distributions represent observations at the center of each cluster. If the optional paramter plot is set to TRUE, a heat map of the mean vectors is produced if plot_heat_mu = TRUE (default = FALSE), a parallel plot of the mean vectors is produced if plot_parallel = TRUE (default = TRUE), and a radial plot is produced if plot_radial = TRUE (default = FALSE). The parameter plot_radial_options is a list with optional parameters: plot_radial_separate determines whether the mean vectors are plotted on separate plots instead of the same plot (default = FALSE); radial_legend_pos determines the position of the legend in the plot; and radial_separate_col determines how many columns to output the radial plots if plot_radial_separate = FALSE (default = 2).