Skip to contents

Usage

genZmuY(
  n,
  p,
  k_unclustered,
  cluster_size,
  n_clusters,
  sig_clusters,
  beta_latent,
  beta_unclustered,
  snr,
  sigma_eps_sq
)

Arguments

n

Integer or numeric; the number of observations to generate. (The generated X and Z will have n rows, and the generated y and mu will have length n.)

p

Integer or numeric; the number of features to generate. The generated X will have p columns.

k_unclustered

Integer or numeric; the number of features in X that will have nonzero coefficients in the true model for y among those features not generated from the n_clusters latent variables (called "weak signal" features in the simulations from Faletto and Bien 2022). The coefficients on these features will be determined by beta_unclustered. Must be at least 1.

cluster_size

Integer or numeric; for each of the n_clusters latent variables, X will contain cluster_size noisy proxies that are correlated with the latent variable. Must be at least 2.

n_clusters

Integer or numeric; the number of latent variables to generate, each of which will be associated with an observed cluster in X. Must be at least 1. Default is 1.

sig_clusters

Integer or numeric; the number of generated latent features that will have nonzero coefficients in the true model for y (all of them will have coefficient beta_latent). Must be less than or equal to n_clusters. Default is 1.

beta_latent

Integer or numeric; the coefficient used for all sig_clusters latent variables that have nonzero coefficients in the true model for y. Can't equal 0. Default is 1.5.

beta_unclustered

Integer or numeric; the maximum coefficient in the model for y among the k_unclustered features in X not generated from the latent variables. The coefficients of the features will be beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.

snr

Integer or numeric; the signal-to-noise ratio of the response y. If sigma_eps_sq is not specified, the variance of the noise in y will be calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of snr and sigma_eps_sq must be specified. Default is NA.

sigma_eps_sq

Integer or numeric; the variance on the noise added to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.

Value

A list of the following elements.

Z

The latent features; either a numeric vector (if n_clusters > 1) or a numeric matrix (if n_clusters > 1). Note that (X, Z) is multivariate Gaussian.

itemmuA length n numeric vector; the expected response given X, Z, and the true coefficient vector (equal to y minus the added noise).

y

A length n numeric vector; the response generated from X, the latent features from Z, and the coefficient vector, along with additive noise.

other_X

A numeric matrix of n observations from a multivariate normal distribution generated using the specified parameters, containing the weak signal features and the noise features that will eventually be in X. (The only missing features are the proxies for the latent features Z.)

Generates Z, weak signal features in X, noise features in X, mu, and y from provided parameters

Faletto, G., & Bien, J. (2022). Cluster Stability Selection. arXiv preprint arXiv:2201.00494. https://arxiv.org/abs/2201.00494.

Gregory Faletto, Jacob Bien