6 Other useful functions
The following are some other functions that are useful for specific tasks.
genClusteredData()
generates jointly Gaussian data sets that include clusters of features that are correlated with a latent feature (the latent features themselves are also provided).getLassoLambda()
uses a data set to choose a good value of lambda for subsamples of size n/2 (as in cluster stability selection) by cross-validation.getModelSize()
estimates the best size for a lasso model on a provided data set.printCssDf()
converts the output of thecss()
function (an object of class cssr) to a data.frame that is ready to be printed. We provide end-user access toprintCssDf()
because this data.frame itself may be useful as an object that summarizes the output fromcss()
.print.cssr()
is the print function for objects of class cssr.-
checkGenClusteredDataInputs()
checks that the inputs togenClusteredData()
are as expected.makeCovarianceMatrix()
generates a covariance matrix for a jointly Gaussian random variable with distribution matching the provided inputs.makeCoefficients()
generates a coefficient vector for the response according to the provided inputs.- Finally,
genMuXZSd()
draws a random Z (matrix of latent variables), X (observed matrix, containing, in general, features correlated with latent features, directly observed features that are associated with y, and noise features), conditional means mu, and observed responses y.
-
getSelectionPrototypes()
identifies the prototypes from selected clusters (the feature in each cluster that is most correlated with the response)
#' Generate randomly sampled data including noisy observations of latent
#' variables
#'
#' TODO(gregfaletto) change cluster_size into a vector of sizes (maybe also
#' deprecate n_clusters as an input, since this would be inferred by the length
#' of cluster_sizes?)
#' Generate a data set including latent features Z, observed features X (which
#' may include noisy or noiseless observations of the latent features in Z),
#' an observed response y which is a linear model of features from Z and X as
#' well as independent mean zero noise, and mu (the responses from y without
#' the added noise). Data is generated in the same way as in the simulations
#' from Faletto and Bien (2022).
#' @param n Integer or numeric; the number of observations to generate. (The
#' generated X and Z will have n rows, and the generated y and mu will have
#' length n.)
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho Integer or numeric; the covariance of the proxies in each cluster
#' with the latent variable (and each other). Note that the correlation between
#' the features in the cluster will be rho/var. Can't equal 0. Default is 0.9.
#' @param var Integer or numeric; the variance of all of the observed features
#' in X (both the proxies for the latent variables and the k_unclustered other
#' features). Can't equal 0. Default is 1.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features; either a numeric vector (if n_clusters > 1) or a numeric
#' matrix (if n_clusters > 1). Note that (X, Z) is multivariate Gaussian.}
#' item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @export
genClusteredData <- function(n, p, k_unclustered, cluster_size, n_clusters=1,
sig_clusters=1, rho=0.9, var=1, beta_latent=1.5, beta_unclustered=1,
snr=as.numeric(NA), sigma_eps_sq=as.numeric(NA)){
# Check inputs
checkGenClusteredDataInputs(p, k_unclustered, cluster_size, n_clusters,
sig_clusters, rho, var, beta_latent, beta_unclustered, snr,
sigma_eps_sq)
# Generate covariance matrix (latent features are mixed in matrix, so each
# cluster will be of size cluster_size + 1)
<- makeCovarianceMatrix(p=p + n_clusters, nblocks=n_clusters,
Sigma block_size=cluster_size + 1, rho=rho, var=var)
# Generate coefficients
# Note that beta has length p + sig_clusters
<- makeCoefficients(p=p + n_clusters, k_unblocked=k_unclustered,
coefs beta_low=beta_unclustered, beta_high=beta_latent, nblocks=n_clusters,
sig_blocks=sig_clusters, block_size=cluster_size + 1)
# Generate mu, X, z, sd, y
<- genMuXZSd(n=n, p=p, beta=coefs$beta, Sigma=Sigma,
gen_mu_x_z_sd_res blocked_dgp_vars=coefs$blocked_dgp_vars, latent_vars=coefs$latent_vars,
block_size=cluster_size, n_blocks=n_clusters, snr=snr,
sigma_eps_sq=sigma_eps_sq)
<- gen_mu_x_z_sd_res$mu
mu <- gen_mu_x_z_sd_res$sd
sd
<- mu + sd * stats::rnorm(n)
y
return(list(X=gen_mu_x_z_sd_res$X, y=y, Z=gen_mu_x_z_sd_res$z, mu=mu))
}
checkGenClusteredDataInputs()
## NEED TO START WITH ROXYGEN FORMATTING
#' TODO(gregfaletto) fix this description!!!
#' Check inputs to genClusteredData
#'
#' Generate a data set including latent features Z, observed features X (which
#' may include noisy or noiseless observations of the latent features in Z),
#' an observed response y which is a linear model of features from Z and X as
#' well as independent mean zero noise, and mu (the responses from y without
#' the added noise). Data is generated in the same way as in the simulations
#' from Faletto and Bien (2022).
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho Integer or numeric; the covariance of the proxies in each cluster
#' with the latent variable (and each other). Note that the correlation between
#' the features in the cluster will be rho/var. Can't equal 0. Default is 0.9.
#' @param var Integer or numeric; the variance of all of the observed features
#' in X (both the proxies for the latent variables and the k_unclustered other
#' features). Can't equal 0. Default is 1.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features; either a numeric vector (if n_clusters > 1) or a numeric
#' matrix (if n_clusters > 1). Note that (X, Z) is multivariate Gaussian.}
#' item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
checkGenClusteredDataInputs <- function(p, k_unclustered, cluster_size,
n_clusters, sig_clusters, rho, var, beta_latent, beta_unclustered, snr,
sigma_eps_sq){
stopifnot(is.numeric(sig_clusters) | is.integer(sig_clusters))
stopifnot(sig_clusters <= n_clusters)
stopifnot(sig_clusters >= 0)
stopifnot(sig_clusters == round(sig_clusters))
stopifnot(is.numeric(n_clusters) | is.integer(n_clusters))
stopifnot(n_clusters == round(n_clusters))
# TODO(gregfaletto): is it easy to remove the requirement that n_clusters is
# at least 1 (so that it's possible to generate data with no latent
# features)? If so, should only check that cluster_size >= 1 if n_clusters
# >= 1, and in makeCovarianceMatrix function only need block_size >= 1
# rather than 2.
stopifnot(n_clusters >= 1)
stopifnot(cluster_size >= 1)
stopifnot(abs(rho) <= abs(var))
stopifnot(rho != 0)
stopifnot(var > 0)
stopifnot(beta_latent != 0)
stopifnot(beta_unclustered != 0)
stopifnot(is.numeric(k_unclustered) | is.integer(k_unclustered))
stopifnot(k_unclustered >= 0)
stopifnot(k_unclustered == round(k_unclustered))
stopifnot(p >= n_clusters*cluster_size + k_unclustered)
# Same as make_sparse_blocked_linear_model_random, but ith coefficient
# of weak signal features is beta_unclustered/sqrt(i) in order to have
# a definitive ranking of weak signal features.
if(is.na(snr) & is.na(sigma_eps_sq)){
stop("Must specify one of snr or sigma_eps_sq")
}
if(!is.na(snr)){
stopifnot(snr > 0)
}if(!is.na(sigma_eps_sq)){
stopifnot(sigma_eps_sq > 0)
} }
Tests for checkGenClusteredDataInputs()
::test_that("checkGenClusteredDataInputs works", {
testthatset.seed(7612)
# Should get no error
checkGenClusteredDataInputs(p=19, k_unclustered=2, cluster_size=5, n_clusters=3,
sig_clusters=2, rho=.8, var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=NA, sigma_eps_sq=.5)
checkGenClusteredDataInputs(p=19, k_unclustered=2, cluster_size=5, n_clusters=3,
sig_clusters=2, rho=.8, var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1, sigma_eps_sq=NA)
# sig_clusters
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters="2", rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"is.numeric(sig_clusters) | is.integer(sig_clusters) is not TRUE",
fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=4, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"sig_clusters <= n_clusters is not TRUE", fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=-1, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"sig_clusters >= 0 is not TRUE", fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=.6, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"sig_clusters == round(sig_clusters) is not TRUE",
fixed=TRUE)
# n_clusters
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5,
n_clusters="3",
sig_clusters=2, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"is.numeric(n_clusters) | is.integer(n_clusters) is not TRUE",
fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3.2,
sig_clusters=2, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"n_clusters == round(n_clusters) is not TRUE",
fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=0,
sig_clusters=0, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"n_clusters >= 1 is not TRUE", fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=.3, n_clusters=3,
sig_clusters=2, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"cluster_size >= 1 is not TRUE", fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=16, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"p >= n_clusters * cluster_size + k_unclustered is not TRUE",
fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"abs(rho) <= abs(var) is not TRUE", fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"rho != 0 is not TRUE", fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=-1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"var > 0 is not TRUE", fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=0,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"beta_latent != 0 is not TRUE", fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=1.5,
beta_unclustered=0, snr=1,
sigma_eps_sq=NA),
"beta_unclustered != 0 is not TRUE", fixed=TRUE)
# k_unclustered
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered="2",
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"is.numeric(k_unclustered) | is.integer(k_unclustered) is not TRUE",
fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=-2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"k_unclustered >= 0 is not TRUE", fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=.2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"k_unclustered == round(k_unclustered) is not TRUE",
fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=NA,
sigma_eps_sq=NA),
"Must specify one of snr or sigma_eps_sq", fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=-.2,
sigma_eps_sq=NA),
"snr > 0 is not TRUE", fixed=TRUE)
::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=NA,
sigma_eps_sq=-.3),
"sigma_eps_sq > 0 is not TRUE", fixed=TRUE)
})
## Test passed 😀
#' Generate covariance matrix for simulated clustered data
#'
#' @param p Integer or numeric; the total number of features in the covariance
#' matrix to be created, including latent features, the associated noisy proxies
#' with each latent feature, and other (weak signal and noise) features.
#' @param n_blocks Integer or numeric; the number of latent variables in the
#' data, each of which is associated with an observed cluster in X. Must be at
#' least 1.
#' @param block_size Integer or numeric; for each of the n_blocks latent
#' variables, the covariance matrix will include the original latent feature
#' plus block_size - 1 noisy proxies that are correlated with the latent
#' variable.
#' @param rho Integer or numeric; the covariance of the proxies in each cluster
#' with the latent variable (and each other). Note that the correlation between
#' the features in the cluster will be rho/var. rho cannot equal 0.
#' @param var Integer or numeric; the variance of all of the observed features
#' in X (both the proxies for the latent variables and the k_unclustered other
#' features). var cannot equal 0.
#' @return A `p` x `p` numeric matrix representing the covariance matrix for
#' the latent features, the associated proxies, and the remaining features. All
#' features not in a block will be independent from each other and the blocks
#' and have variance var.
#' @author Gregory Faletto, Jacob Bien
makeCovarianceMatrix <- function(p, nblocks, block_size, rho, var) {
# Check inputs
stopifnot(nblocks >= 1)
stopifnot(rho != 0)
stopifnot(var != 0)
stopifnot(abs(rho) <= abs(var))
stopifnot(block_size >= 2)
stopifnot(p >= nblocks*block_size)
# start with p x p identity matrix
<- var*diag(p)
Sigma
# create matrix with nblocks rows, each containing a vector of
# indices of highly correlated features
<- matrix(seq(nblocks*block_size), nrow=nblocks, byrow=TRUE)
block_feats
stopifnot(length(unique(block_feats)) == length(block_feats))
# add covariances of highly correlated features to sigma
for(i in 1:nblocks){
for(j in 1:(block_size - 1)){
for(k in (j+1):block_size){
<- block_feats[i, j]
feat_1 <- block_feats[i, k]
feat_2 <- rho
Sigma[feat_1, feat_2] <- rho
Sigma[feat_2, feat_1]
}
}
}stopifnot(is.numeric(Sigma))
stopifnot(is.matrix(Sigma))
stopifnot(nrow(Sigma) == p & ncol(Sigma) == p)
stopifnot(all(Sigma == t(Sigma)))
return(Sigma)
}
Tests for makeCovarianceMatrix()
::test_that("makeCovarianceMatrix works", {
testthatset.seed(7612)
<- makeCovarianceMatrix(p=8, nblocks=2, block_size=3, rho=0.8, var=2)
ret
::expect_true(is.numeric(ret))
testthat::expect_true(is.matrix(ret))
testthat::expect_equal(nrow(ret), 8)
testthat::expect_equal(ncol(ret), 8)
testthat::expect_true(all(ret == t(ret)))
testthat# Test on-diagonal and off-diagonal entries to see if they are as expected
# (only need to check top half of matrix; we already confirmed the matrix is
# symmetric)
::expect_true(all(diag(ret) == 2))
testthat::expect_equal(ret[1, 2], 0.8)
testthat::expect_equal(ret[1, 3], 0.8)
testthat::expect_equal(ret[2, 3], 0.8)
testthat::expect_equal(ret[4, 5], 0.8)
testthat::expect_equal(ret[4, 6], 0.8)
testthat::expect_equal(ret[5, 6], 0.8)
testthat::expect_true(all(ret[1:3, 4:8] == 0))
testthat::expect_true(all(ret[4:6, c(1:3, 7:8)] == 0))
testthat::expect_true(all(ret[7, c(1:6, 8)] == 0))
testthat::expect_true(all(ret[8, 1:7] == 0))
testthat
# Bad inputs
::expect_error(makeCovarianceMatrix(p=8, nblocks=2, block_size=3,
testthatrho=0.8, var=.2),
"abs(rho) <= abs(var) is not TRUE", fixed=TRUE)
::expect_error(makeCovarianceMatrix(p=8, nblocks=-2, block_size=3,
testthatrho=0.8, var=2),
"nblocks >= 1 is not TRUE", fixed=TRUE)
::expect_error(makeCovarianceMatrix(p=8, nblocks=2, block_size=3,
testthatrho=0, var=2),
"rho != 0 is not TRUE", fixed=TRUE)
::expect_error(makeCovarianceMatrix(p=8, nblocks=2, block_size=3,
testthatrho=0.9, var=0),
"var != 0 is not TRUE", fixed=TRUE)
::expect_error(makeCovarianceMatrix(p=5, nblocks=2, block_size=3,
testthatrho=0.8, var=2),
"p >= nblocks * block_size is not TRUE",
fixed=TRUE)
})
## Test passed 😀
#' Generated coefficients for y in latent variable model
#'
#' @param p Integer or numeric; the number of features that will be observed in
#' x plus the number of latent variables (each corresponding to a cluster).
#' @param k_unblocked Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_low.
#' @param beta_low Integer or numeric; the maximum coefficient in the
#' model for y among the k_unblocked features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_low/sqrt(1:k_unblocked).
#' @param beta_high Integer or numeric; the coefficient used for all
#' sig_blocks latent variables that have nonzero coefficients in the true
#' model for y.
#' @param nblocks Integer or numeric; the number of latent variables that were
#' generated, each of which will be associated with an observed cluster in X.
#' @param sig_blocks Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). In particular, the first sig_blocks
#' latent variables will have coefficient beta_latent, and the remaining nblocks
#' - sig_blocks features will have coefficient 0. Must be less than or equal to
#' n_clusters.
#' @param block_size Integer or numeric; for each of the n_blocks latent
#' variables, the covariance matrix will include the original latent feature
#' plus block_size - 1 noisy proxies that are correlated with the latent
#' variable.
#' @return A named list with the following elements: \item{beta}{A vector of
#' length `p` containing the coefficients for the true model for y. All entries
#' will equal 0 except for the sig_blocks latent variables that will have
#' coefficient beta_high and the k_unblocked independent features with
#' coefficient determined by beta_low.} \item{blocked_dgp_vars}{An integer
#' vector of length sig_blocks containing the indices of the features
#' corresponding to the latent features that will have nonzero coefficient
#' beta_high in the true model for y.} \item{sig_unblocked_vars}{An integer
#' vector of length k_unblocked containing the indices of the observed features
#' that are independent of the blocked features and have coefficient beta_low in
#' the true model for y. If k_unblocked = 0, this will just be NA.}
#' \item{insig_blocked_vars}{An integer vector containing the indices of the
#' features corresponding to the latent features that will have coefficient 0 in
#' the true model for y. If nblocks=0, this will just be NA.}
#' \item{latent_vars}{An integer vector of length nblocks containing the indices
#' of all of the latent features.}
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
makeCoefficients <- function(p, k_unblocked, beta_low, beta_high, nblocks,
sig_blocks, block_size){
# Check inputs
stopifnot(k_unblocked >= 0)
stopifnot(sig_blocks <= nblocks)
stopifnot(p >= nblocks*block_size + k_unblocked)
stopifnot(sig_blocks >= 0)
# Initialize beta
<- numeric(p)
beta
# identify indices of first coefficient in each significant block (these
# features will have coefficient beta_high)
<- NA
latent_vars if(nblocks >= 1){
<- as.integer(((0:(nblocks - 1))*block_size + 1))
latent_vars
stopifnot(all(latent_vars) %in% 1:p)
stopifnot(all(latent_vars) %in% 1:(block_size*nblocks))
stopifnot(length(unique(latent_vars)) == nblocks)
stopifnot(length(latent_vars) == nblocks)
}
<- latent_vars[1:sig_blocks]
blocked_dgp_vars
stopifnot(sig_blocks == length(blocked_dgp_vars))
<- beta_high
beta[blocked_dgp_vars]
# identify remaining coefficients in blocks (which ought to be set to 0)
<- NA
insig_blocked_vars
if(nblocks >= 1){
<- setdiff(1:(block_size*nblocks), blocked_dgp_vars)
insig_blocked_vars stopifnot(all(beta[insig_blocked_vars] == 0))
}# find significant unblocked variables (if applicable) and fill in
# coefficients
<- NA
sig_unblocked_vars
if(k_unblocked > 0){
# Range of weak signal coefficients
<- beta_low/sqrt(1:k_unblocked)
beta_lows <- (nblocks*block_size + 1):
sig_unblocked_vars *block_size + k_unblocked)
(nblocks<- as.integer(sig_unblocked_vars)
sig_unblocked_vars
stopifnot(length(sig_unblocked_vars) == k_unblocked)
stopifnot(length(unique(sig_unblocked_vars)) == k_unblocked)
stopifnot(all(sig_unblocked_vars) %in% 1:p)
<- beta_lows
beta[sig_unblocked_vars]
}
stopifnot(length(intersect(blocked_dgp_vars, sig_unblocked_vars)) == 0)
stopifnot(length(intersect(sig_unblocked_vars, insig_blocked_vars)) == 0)
stopifnot(length(intersect(blocked_dgp_vars, insig_blocked_vars)) == 0)
stopifnot(length(insig_blocked_vars) + length(blocked_dgp_vars) ==
*block_size)
nblocks
stopifnot(sig_blocks + length(insig_blocked_vars) + k_unblocked <= p)
stopifnot(sum(beta != 0) == sig_blocks + k_unblocked)
stopifnot(is.numeric(beta) | is.integer(beta))
return(list(beta=beta, blocked_dgp_vars=blocked_dgp_vars,
sig_unblocked_vars=sig_unblocked_vars,
insig_blocked_vars=insig_blocked_vars, latent_vars=latent_vars))
}
Tests for makeCoefficients()
::test_that("makeCoefficients works", {
testthatset.seed(5722)
<- makeCoefficients(p=9+2, k_unblocked=2, beta_low=.9, beta_high=-2,
ret nblocks=2, sig_blocks=1, block_size=3)
::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("beta", "blocked_dgp_vars",
testthat"sig_unblocked_vars",
"insig_blocked_vars", "latent_vars"))
::expect_true(is.numeric(ret$beta))
testthat::expect_equal(length(ret$beta), 9 + 2)
testthat::expect_equal(sum(ret$beta == 0), 9 + 2 - 3)
testthat# Checking structure
::expect_equal(ret$beta[1], -2)
testthat::expect_true(all(ret$beta[2:6] == 0))
testthat::expect_true(all(ret$beta[7:8] == c(.9, .9/sqrt(2))))
testthat::expect_true(all(ret$beta[9:11] == 0))
testthat
::expect_true(is.integer(ret$blocked_dgp_vars) | is.numeric(ret$blocked_dgp_vars))
testthat::expect_equal(ret$blocked_dgp_vars, 1)
testthat
::expect_identical(ret$sig_unblocked_vars, 7:8)
testthat
::expect_identical(ret$insig_blocked_vars, 2:6)
testthat
::expect_identical(ret$latent_vars, c(1L, 4L))
testthat
# Bad inputs
::expect_error(makeCoefficients(p=9+2, k_unblocked=-2, beta_low=.9,
testthatbeta_high=-2, nblocks=2, sig_blocks=1,
block_size=3),
"k_unblocked >= 0 is not TRUE", fixed=TRUE)
::expect_error(makeCoefficients(p=9+2, k_unblocked=2, beta_low=.9,
testthatbeta_high=-2, nblocks=2, sig_blocks=3,
block_size=3),
"sig_blocks <= nblocks is not TRUE", fixed=TRUE)
::expect_error(makeCoefficients(p=7, k_unblocked=2, beta_low=.9,
testthatbeta_high=-2, nblocks=2, sig_blocks=1,
block_size=3),
"p >= nblocks * block_size + k_unblocked is not TRUE",
fixed=TRUE)
::expect_error(makeCoefficients(p=9+2, k_unblocked=2, beta_low=.9,
testthatbeta_high=-2, nblocks=2,
sig_blocks=-1, block_size=3),
"sig_blocks >= 0 is not TRUE", fixed=TRUE)
})
## Test passed 😸
#' Generate observed and latent variables along with conditional mean
#'
#' @param n Integer or numeric; the number of observations to generate. (The
#' generated X and Z will have n rows, and the generated y and mu will have
#' length n.)
#' @param p Integer or numeric; the number of observed features (the generated X
#' will have p columns).
#' @param beta A numeric or integer vector of length `p` + sig_blocks containing
#' the coefficients for the true model for y.
#' @param Sigma A (`p` + n_blocks) x (`p` + n_blocks) numeric matrix
#' representing the covariance matrix for the latent features, the associated
#' proxies, and the remaining features.
#' @param blocked_dgp_vars An integer vector of length sig_blocks containing the
#' indices of the features corresponding to the latent features that have
#' nonzero coefficient beta_high in the true model for y.
#' @param latent_vars An integer vector of length n_blocks containing the
#' indices of all of the latent features.
#' @param block_size Integer or numeric; for each of the n_blocks latent
#' variables, X will contain block_size noisy proxies that are correlated with
#' the latent variable.
#' @param n_blocks Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A named list with the following elements: \item{X}{An `n` x `p`
#' numeric matrix containing the observed proxies for the latent variables as
#' well as the observed unblocked (iid) variables.} \item{mu}{A length `n`
#' numeric vector; the expected response given X, Z, and the true
#' coefficient vector (equal to y minus the added noise).} \item{z}{An `n` x
#' n_blocks numeric matrix containing the n_blocks latent variables. Note that
#' (X, z) is multivariate Gaussian.} \item{sd}{Numeric; the standard deviation
#' of the noise added to mu to get y (calculated either from snr or
#' sigma_eps_sq).}
#' @author Gregory Faletto, Jacob Bien
genMuXZSd <- function(n, p, beta, Sigma, blocked_dgp_vars,
n_blocks=1, snr=NA, sigma_eps_sq=NA){
latent_vars, block_size, # Check inputs
stopifnot(length(blocked_dgp_vars) <= n_blocks)
stopifnot(nrow(Sigma) == p + n_blocks)
stopifnot(ncol(Sigma) == p + n_blocks)
if(any(!is.na(sigma_eps_sq))){
stopifnot(is.numeric(sigma_eps_sq) | is.integer(sigma_eps_sq))
stopifnot(length(sigma_eps_sq) == 1)
stopifnot(sigma_eps_sq >= 0)
else{
} if(any(is.na(snr))){
stop("Must provide one of snr or sigma_eps_sq")
}stopifnot(is.numeric(snr) | is.integer(snr))
stopifnot(length(snr) == 1)
stopifnot(snr > 0)
}
stopifnot(length(beta) == p + n_blocks)
stopifnot(all(beta[blocked_dgp_vars] != 0))
stopifnot(length(latent_vars) == n_blocks)
<- MASS::mvrnorm(n=n, mu=rep(0, p + n_blocks), Sigma=Sigma)
x
stopifnot(length(beta) == ncol(x))
<- as.numeric(x %*% beta)
mu
# Remove true blocked signal feature from each block from x now that I've
# generated mu
if(n_blocks > 0){
<- matrix(as.numeric(NA), nrow=n, ncol=n_blocks)
z stopifnot(length(latent_vars) > 0)
else{
} <- NA
z stopifnot(length(latent_vars) == 0)
}
if(length(latent_vars) > 0){
if(length(latent_vars) > 0){
1:n_blocks] <- x[, latent_vars]
z[,
}
}
<- x[, setdiff(1:(p + n_blocks), latent_vars)]
x
# If SNR is null, use sigma_eps_sq
if(!is.na(sigma_eps_sq)){
<- sqrt(sigma_eps_sq)
sd else{
}<- sqrt(sum(mu^2) / (n * snr)) # taking snr = ||mu||^2 /(n * sigma^2)
sd
}
# Check output
stopifnot(length(mu) == n)
stopifnot(nrow(x) == n)
stopifnot(ncol(x) == p)
if(any(!is.na(z))){
stopifnot(nrow(z) == n)
stopifnot(ncol(z) == n_blocks)
}
stopifnot(is.numeric(sd) | is.integer(sd))
stopifnot(length(sd) == 1)
stopifnot(!is.na(sd))
stopifnot(sd >= 0)
return(list(X=x, mu=mu, z=z, sd=sd))
}
Tests for genMuXZSd()
::test_that("genMuXZSd works", {
testthatset.seed(61232)
<- makeCovarianceMatrix(p=9+2, nblocks=2, block_size=3+1, rho=0.2499,
Sigma var=.25)
<- makeCoefficients(p=9+2, k_unblocked=2, beta_low=.9, beta_high=-2,
coefs nblocks=2, sig_blocks=1, block_size=3+1)
<- genMuXZSd(n=25, p=9, beta=coefs$beta, Sigma=Sigma,
ret blocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=coefs$latent_vars, n_blocks=2, block_size=3,
snr=NA, sigma_eps_sq=1.2)
::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "mu", "z", "sd"))
testthat
::expect_true(is.numeric(ret$X))
testthat::expect_true(is.matrix(ret$X))
testthat::expect_equal(nrow(ret$X), 25)
testthat::expect_equal(ncol(ret$X), 9)
testthat# X is Gaussian with mean 0 and variance 1/4; expect all observations to lie
# within 5 standard deviations of mean
::expect_true(all(abs(ret$X) < 5*sqrt(0.25)))
testthat
# Test that clusters are correlated--within-cluster correlation should be
# high, correlation with other features should be low
::expect_true(min(cor(ret$X[, 1:3])) > .9)
testthat::expect_true(max(abs(cor(ret$X[, 1:3], ret$X[, 4:9]))) < .6)
testthat
::expect_true(min(cor(ret$X[, 4:6])) > .9)
testthat::expect_true(max(abs(cor(ret$X[, 4:6],
testthat$X[, c(1:3, 7:9)]))) < .6)
ret
<- cor(ret$X[, 7:9])
cor_indeps ::expect_true(max(abs(cor_indeps[lower.tri(cor_indeps)])) < .6)
testthat
::expect_true(is.numeric(ret$mu))
testthat::expect_equal(length(ret$mu), 25)
testthat
::expect_true(is.numeric(ret$z))
testthat::expect_true(is.matrix(ret$z))
testthat::expect_equal(nrow(ret$z), 25)
testthat::expect_equal(ncol(ret$z), 2)
testthat
::expect_true(is.numeric(ret$sd))
testthat::expect_equal(ret$sd, sqrt(1.2))
testthat
# Specify SNR instead of sd
<- genMuXZSd(n=5, p=9, beta=coefs$beta, Sigma=Sigma,
ret blocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=coefs$latent_vars, n_blocks=2, block_size=3,
snr=1, sigma_eps_sq=NA)
::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "mu", "z", "sd"))
testthat
::expect_true(is.numeric(ret$X))
testthat::expect_true(is.matrix(ret$X))
testthat::expect_equal(nrow(ret$X), 5)
testthat::expect_equal(ncol(ret$X), 9)
testthat
::expect_true(is.numeric(ret$mu))
testthat::expect_equal(length(ret$mu), 5)
testthat
::expect_true(is.numeric(ret$z))
testthat::expect_true(is.matrix(ret$z))
testthat::expect_equal(nrow(ret$z), 5)
testthat::expect_equal(ncol(ret$z), 2)
testthat
::expect_true(is.numeric(ret$sd))
testthat::expect_true(ret$sd > 0)
testthat
# Try a single latent variable (z should be a one-column matrix)
<- makeCovarianceMatrix(p=9+1, nblocks=1, block_size=3, rho=0.8, var=2)
Sigma
<- makeCoefficients(p=9+1, k_unblocked=2, beta_low=.9, beta_high=-2,
coefs nblocks=1, sig_blocks=1, block_size=3)
<- genMuXZSd(n=5, p=9, beta=coefs$beta, Sigma=Sigma,
ret blocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=coefs$latent_vars, n_blocks=1, block_size=3,
snr=NA, sigma_eps_sq=1.2)
::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "mu", "z", "sd"))
testthat
::expect_true(is.numeric(ret$z))
testthat::expect_true(is.matrix(ret$z))
testthat::expect_equal(nrow(ret$z), 5)
testthat::expect_equal(ncol(ret$z), 1)
testthat
# Bad inputs
<- makeCovarianceMatrix(p=9+2, nblocks=2, block_size=3, rho=0.8, var=2)
Sigma
<- makeCoefficients(p=9+2, k_unblocked=2, beta_low=.9, beta_high=-2,
coefs nblocks=2, sig_blocks=1, block_size=3)
::expect_error(genMuXZSd(n=5, p=9, beta=coefs$beta, Sigma=Sigma,
testthatblocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=coefs$latent_vars, n_blocks=0,
block_size=3, snr=NA, sigma_eps_sq=1.2),
"length(blocked_dgp_vars) <= n_blocks is not TRUE", fixed=TRUE)
::expect_error(genMuXZSd(n=5, p=9, beta=coefs$beta, Sigma=Sigma,
testthatblocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=coefs$latent_vars, n_blocks=1,
block_size=3, snr=NA, sigma_eps_sq=1.2),
"nrow(Sigma) == p + n_blocks is not TRUE",
fixed=TRUE)
::expect_error(genMuXZSd(n=5, p=9, beta=coefs$beta, Sigma=Sigma,
testthatblocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=coefs$latent_vars, n_blocks=2,
block_size=3, snr=NA, sigma_eps_sq="1.2"),
"is.numeric(sigma_eps_sq) | is.integer(sigma_eps_sq) is not TRUE",
fixed=TRUE)
::expect_error(genMuXZSd(n=5, p=9, beta=coefs$beta, Sigma=Sigma,
testthatblocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=coefs$latent_vars, n_blocks=2,
block_size=3, snr=NA, sigma_eps_sq=1:2),
"length(sigma_eps_sq) == 1 is not TRUE",
fixed=TRUE)
::expect_error(genMuXZSd(n=5, p=9, beta=coefs$beta, Sigma=Sigma,
testthatblocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=coefs$latent_vars, n_blocks=2,
block_size=3, snr=NA, sigma_eps_sq=-1),
"sigma_eps_sq >= 0 is not TRUE", fixed=TRUE)
::expect_error(genMuXZSd(n=5, p=9, beta=coefs$beta, Sigma=Sigma,
testthatblocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=coefs$latent_vars, n_blocks=2,
block_size=3, snr=NA, sigma_eps_sq=NA),
"Must provide one of snr or sigma_eps_sq", fixed=TRUE)
::expect_error(genMuXZSd(n=5, p=9, beta=coefs$beta, Sigma=Sigma,
testthatblocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=coefs$latent_vars, n_blocks=2,
block_size=3, snr="1", sigma_eps_sq=NA),
"is.numeric(snr) | is.integer(snr) is not TRUE",
fixed=TRUE)
::expect_error(genMuXZSd(n=5, p=9, beta=coefs$beta, Sigma=Sigma,
testthatblocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=coefs$latent_vars, n_blocks=2,
block_size=3, snr=1:2, sigma_eps_sq=NA),
"length(snr) == 1 is not TRUE", fixed=TRUE)
::expect_error(genMuXZSd(n=5, p=9, beta=coefs$beta, Sigma=Sigma,
testthatblocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=coefs$latent_vars, n_blocks=2,
block_size=3, snr=-1, sigma_eps_sq=NA),
"snr > 0 is not TRUE", fixed=TRUE)
::expect_error(genMuXZSd(n=5, p=9, beta=1:9, Sigma=Sigma,
testthatblocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=coefs$latent_vars, n_blocks=2,
block_size=3, snr=1, sigma_eps_sq=NA),
"length(beta) == p + n_blocks is not TRUE", fixed=TRUE)
::expect_error(genMuXZSd(n=5, p=9, beta=rep(0, 9 + 2), Sigma=Sigma,
testthatblocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=coefs$latent_vars, n_blocks=2,
block_size=3, snr=1, sigma_eps_sq=NA),
"all(beta[blocked_dgp_vars] != 0) is not TRUE",
fixed=TRUE)
::expect_error(genMuXZSd(n=5, p=9, beta=coefs$beta, Sigma=Sigma,
testthatblocked_dgp_vars=coefs$blocked_dgp_vars,
latent_vars=1:3, n_blocks=2,
block_size=3, snr=1, sigma_eps_sq=NA),
"length(latent_vars) == n_blocks is not TRUE",
fixed=TRUE)
})
## Test passed 🥇
Finally, tests for genClusteredData()
::test_that("genClusteredData works", {
testthatset.seed(23478)
<- genClusteredData(n=25, p=19, k_unclustered=2, cluster_size=5, n_clusters=3,
ret sig_clusters=2, rho=3.99, var=4, beta_latent=1.5,
beta_unclustered=-2, snr=NA, sigma_eps_sq=.5)
::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
testthat
::expect_true(is.numeric(ret$X))
testthat::expect_true(is.matrix(ret$X))
testthat::expect_equal(ncol(ret$X), 19)
testthat::expect_equal(nrow(ret$X), 25)
testthat# X is Gaussian with mean 0 and variance 4; expect all observations to lie
# within 5 standard deviations of mean
::expect_true(all(abs(ret$X) < 5*2))
testthat# Test that clusters are correlated--within-cluster correlation should be
# high, correlation with other features should be low
::expect_true(min(cor(ret$X[, 1:5])) > .9)
testthat::expect_true(max(abs(cor(ret$X[, 1:5], ret$X[, 6:19]))) < .6)
testthat
::expect_true(min(cor(ret$X[, 6:10])) > .9)
testthat::expect_true(max(abs(cor(ret$X[, 6:10],
testthat$X[, c(1:5, 11:19)]))) < .6)
ret
::expect_true(min(cor(ret$X[, 11:15])) > .9)
testthat::expect_true(max(abs(cor(ret$X[, 11:15],
testthat$X[, c(1:10, 16:19)]))) < .6)
ret
<- cor(ret$X[, 16:19])
cor_indeps ::expect_true(max(abs(cor_indeps[lower.tri(cor_indeps)])) < .6)
testthat
::expect_true(is.numeric(ret$y))
testthat::expect_equal(length(ret$y), 25)
testthat
::expect_true(is.numeric(ret$Z))
testthat::expect_true(is.matrix(ret$Z))
testthat::expect_equal(nrow(ret$Z), 25)
testthat::expect_equal(ncol(ret$Z), 3)
testthat
::expect_true(is.numeric(ret$mu))
testthat::expect_equal(length(ret$mu), 25)
testthat# Because y is Gaussian with mean mu and standard deviation .5 conditional on
# mu, expect all observations to lie within 5 sds of mu
::expect_true(all(abs(ret$y - ret$mu) < 5*.5))
testthat
# Specify SNR instead of sigma_eps_sq
<- genClusteredData(n=5, p=19, k_unclustered=2, cluster_size=5, n_clusters=3,
ret sig_clusters=2, rho=.8, var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1, sigma_eps_sq=NA)
::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
testthat
# If sigma_eps_sq is specified, snr should be ignored. (Set an SNR that
# implies a very large noise variance to test this)
<- genClusteredData(n=5, p=19, k_unclustered=2, cluster_size=5, n_clusters=3,
ret sig_clusters=2, rho=.8, var=4, beta_latent=1.5,
beta_unclustered=-2, snr=.01, sigma_eps_sq=.25)
::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
testthat
# Because y is Gaussian with mean mu and standard deviation .5 conditional on
# mu, expect all observations to lie within 5 sds of mu
::expect_true(all(abs(ret$y - ret$mu) < 5*.25))
testthat
# Try a single latent variable (z should be a one-column matrix)
<- genClusteredData(n=5, p=19, k_unclustered=2, cluster_size=5, n_clusters=1,
ret sig_clusters=1, rho=.8, var=4, beta_latent=1.5,
beta_unclustered=-2, snr=NA, sigma_eps_sq=.5)
::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
testthat
::expect_true(is.numeric(ret$Z))
testthat::expect_true(is.matrix(ret$Z))
testthat::expect_equal(nrow(ret$Z), 5)
testthat::expect_equal(ncol(ret$Z), 1)
testthat
# Bad inputs
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters="2", rho=.8, var=1.1,
beta_latent=1.5, beta_unclustered=-2,
snr=1, sigma_eps_sq=NA),
"is.numeric(sig_clusters) | is.integer(sig_clusters) is not TRUE",
fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=4, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"sig_clusters <= n_clusters is not TRUE", fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=-1, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"sig_clusters >= 0 is not TRUE", fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=.6, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"sig_clusters == round(sig_clusters) is not TRUE",
fixed=TRUE)
# n_clusters
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5,
n_clusters="3",
sig_clusters=2, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"is.numeric(n_clusters) | is.integer(n_clusters) is not TRUE",
fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3.2,
sig_clusters=2, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"n_clusters == round(n_clusters) is not TRUE",
fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=0,
sig_clusters=0, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"n_clusters >= 1 is not TRUE", fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=.3, n_clusters=3,
sig_clusters=2, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"cluster_size >= 1 is not TRUE", fixed=TRUE)
::expect_error(genClusteredData(p=16, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"p >= n_clusters * cluster_size + k_unclustered is not TRUE",
fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"abs(rho) <= abs(var) is not TRUE", fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"rho != 0 is not TRUE", fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=-1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"var > 0 is not TRUE", fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=0,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"beta_latent != 0 is not TRUE", fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=1.5,
beta_unclustered=0, snr=1,
sigma_eps_sq=NA),
"beta_unclustered != 0 is not TRUE", fixed=TRUE)
# k_unclustered
::expect_error(genClusteredData(n=5, p=19, k_unclustered="2",
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"is.numeric(k_unclustered) | is.integer(k_unclustered) is not TRUE",
fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=-2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"k_unclustered >= 0 is not TRUE", fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=.2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"k_unclustered == round(k_unclustered) is not TRUE",
fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=NA,
sigma_eps_sq=NA),
"Must specify one of snr or sigma_eps_sq", fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=-.2,
sigma_eps_sq=NA),
"snr > 0 is not TRUE", fixed=TRUE)
::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
testthatcluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
var=1.1, beta_latent=1.5,
beta_unclustered=-2, snr=NA,
sigma_eps_sq=-.3),
"sigma_eps_sq > 0 is not TRUE", fixed=TRUE)
})
## Test passed 😀
#' Get lambda value for lasso
#'
#' Chooses a lambda value for the lasso used on a subsample of size n/2 (as in
#' cluster stability selection) by cross-validation.
#' @param X An n x p numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the p >= 2 features/predictors that will be used by cluster stability
#' selection.
#' @param y The response; an n-dimensional numeric or integer vector. (Unlike
#' in the more general css setup, this response must be real-valued since
#' lambda will be determined using the lasso with cross-validation.)
#' @param lambda_choice Character; either "min" or "1se". If "min", chooses
#' the lambda that minimizes the cross-validated error; if "1se", chooses the
#' largest lambda within one standard error of the minimum error lambda
#' (resulting in a smaller selected set, which may be desirable because the
#' model size corresponding to the minimum error lambda tends to be larger
#' than optimal. See, for example, Bühlmann and Meinshausen 2006, Prop. 1 and
#' Bühlmann and van de Geer 2011, Section 2.5.1.). Default is "1se".
#' @param nfolds Numeric or integer; the number of folds for cross-validation.
#' Must be at least 4 (as specified by cv.glmnet). Default is 10.
#' @return A numeric; the selected value of lambda.
#' @author Gregory Faletto, Jacob Bien
#' @references Bühlmann, P., & Meinshausen, N. (2006). High-Dimensional Graphs
#' and Variable Selection With the Lasso. \emph{The Annals of Statistics},
#' 34(3), 1436–1462. \url{https://doi.org/10.1214/009053606000000281}.
#' \cr Peter Bühlmann and Sara van de Geer. Statistics for High-Dimensional
#' Data. \emph{Springer Series in Statistics}. Springer, Heidelberg, 2011. ISBN
#' 978-3-642-20191-2. \url{http://dx.doi.org/10.1007/978-3-642-20192-9}. \cr
#' Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization
#' Paths for Generalized Linear Models via Coordinate Descent. \emph{Journal of
#' Statistical Software}, 33(1), 1-22. URL \url{https://www.jstatsoft.org/v33/i01/}.
#' @export
getLassoLambda <- function(X, y, lambda_choice="1se", nfolds=10){
stopifnot(is.character(lambda_choice))
stopifnot(length(lambda_choice) == 1)
stopifnot(!is.na(lambda_choice))
stopifnot(lambda_choice %in% c("min", "1se"))
stopifnot(is.matrix(X))
stopifnot(is.numeric(X) | is.integer(X))
<- nrow(X)
n
stopifnot(is.numeric(nfolds) | is.integer(nfolds))
stopifnot(length(nfolds) == 1)
stopifnot(nfolds == round(nfolds))
stopifnot(nfolds > 3)
# Since we are using the lasso, we require y to be a real-valued response
# (unlike for the general cluster stability selection procedure, where y
# can have a more general format as long as a suitable feature selection
# function is provided by the user)
stopifnot(is.numeric(y) | is.integer(y))
stopifnot(n == length(y))
# Sample size to use: inflate n/2 by a factor of nfolds/(nfolds - 1),
# so that each individual lasso fit is of size floor(n/2)
<- min(round(n/2*nfolds/(nfolds - 1)), n)
n_sample <- min(nfolds, n_sample)
nfolds
<- sample(1:n, n_sample)
inds_size <- glmnet::cv.glmnet(x=X[inds_size, ], y=y[inds_size],
size_results family="gaussian", nfolds=nfolds)
<- size_results[[paste("lambda.", lambda_choice, sep="")]]
lambda_ret
# Check output
stopifnot(length(lambda_ret) == 1)
stopifnot(is.numeric(lambda_ret) | is.integer(lambda_ret))
stopifnot(lambda_ret >= 0)
return(lambda_ret)
}
Tests for getLassoLambda()
:
::test_that("getLassoLambda works", {
testthatset.seed(7252)
<- matrix(stats::rnorm(10*6), nrow=10, ncol=6)
x <- stats::rnorm(10)
y
<- getLassoLambda(X=x, y=y, lambda_choice="1se", nfolds=4)
ret
::expect_true(is.numeric(ret) | is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_true(ret >= 0)
testthat
<- getLassoLambda(X=x, y=y, lambda_choice="min", nfolds=5)
ret
::expect_true(is.numeric(ret) | is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_true(ret >= 0)
testthat
# Bad inputs
::expect_error(getLassoLambda(X="x", y=y), "is.matrix(X) is not TRUE",
testthatfixed=TRUE)
::expect_error(getLassoLambda(X=x[1:9, ], y=y),
testthat"n == length(y) is not TRUE", fixed=TRUE)
::expect_error(getLassoLambda(X=x, y=TRUE),
testthat"is.numeric(y) | is.integer(y) is not TRUE",
fixed=TRUE)
# Error has quotation marks in it
::expect_error(getLassoLambda(X=x, y=y, lambda_choice="mni"))
testthat
::expect_error(getLassoLambda(X=x, y=y,
testthatlambda_choice=c("min", "1se")),
"length(lambda_choice) == 1 is not TRUE",
fixed=TRUE)
::expect_error(getLassoLambda(X=x, y=y, lambda_choice=1),
testthat"is.character(lambda_choice) is not TRUE",
fixed=TRUE)
::expect_error(getLassoLambda(X=x, y=y, nfolds="5"),
testthat"is.numeric(nfolds) | is.integer(nfolds) is not TRUE",
fixed=TRUE)
::expect_error(getLassoLambda(X=x, y=y, nfolds=1:4),
testthat"length(nfolds) == 1 is not TRUE", fixed=TRUE)
::expect_error(getLassoLambda(X=x, y=y, nfolds=3.2),
testthat"nfolds == round(nfolds) is not TRUE", fixed=TRUE)
::expect_error(getLassoLambda(X=x, y=y, nfolds=3),
testthat"nfolds > 3 is not TRUE", fixed=TRUE)
})
## ── Warning ('<text>:7'): getLassoLambda works ──────────────────────────────────
## Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold
## Backtrace:
## 1. litr (local) getLassoLambda(X = x, y = y, lambda_choice = "1se", nfolds = 4)
## 2. glmnet::cv.glmnet(...)
## 3. glmnet:::cv.glmnet.raw(...)
##
## ── Warning ('<text>:14'): getLassoLambda works ─────────────────────────────────
## Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold
## Backtrace:
## 1. litr (local) getLassoLambda(X = x, y = y, lambda_choice = "min", nfolds = 5)
## 2. glmnet::cv.glmnet(...)
## 3. glmnet:::cv.glmnet.raw(...)
getModelSize()
:
#' Automated estimation of model size
#'
#' This function is uses the lasso with cross-validation to estimate the
#' model size. Before using the lasso, in each cluster all features will be
#' dropped from X except the one feature with the highest marginal correlation
#' with y, as in the protolasso (Reid and Tibshirani 2016).
#'
#' @param X An n x p numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the p >= 2 features/predictors.
#' @param y A length-n numeric vector containing the responses; `y[i]` is the
#' response corresponding to observation `X[i, ]`. (Note that for the css
#' function, y does not have to be a numeric response, but for this function,
#' the underlying selection procedure is the lasso, so y must be a real-valued
#' response.)
#' @param clusters A named list where each entry is an integer vector of indices
#' of features that are in a common cluster, as in the output of css.
#' (The length of list clusters is equal to the number of clusters.) All
#' identified clusters must be non-overlapping, and all features must appear in
#' exactly one cluster (any unclustered features should be in their own
#' "cluster" of size 1). CAUTION: if the provided X is a data.frame that
#' contains a categorical feature with more than two levels, then the resulting
#' matrix made from model.matrix will have a different number of columns than
#' the provided data.frame, some of the feature numbers will change, and the
#' clusters argument will not work properly (in the current version of the
#' package). To get correct results in this case, please use model.matrix to
#' convert the data.frame to a numeric matrix on your own, then provide this
#' matrix and cluster assignments with respect to this matrix.
#' @return An integer; the estimated size of the model. The minimum returned
#' value is 1, even if the lasso with cross-validation chose no features.
#' @author Gregory Faletto, Jacob Bien
#' @references Reid, S., & Tibshirani, R. (2016). Sparse regression and marginal
#' testing using cluster prototypes. \emph{Biostatistics}, 17(2), 364–376.
#' \url{https://doi.org/10.1093/biostatistics/kxv049}.
#' @export
getModelSize <- function(X, y, clusters){
stopifnot(is.matrix(X) | is.data.frame(X))
# Check if x is a matrix; if it's a data.frame, convert to matrix.
if(is.data.frame(X)){
<- ncol(X)
p
<- stats::model.matrix(~ ., X)
X <- X[, colnames(X) != "(Intercept)"]
X
if(length(clusters) > 0 & (p != ncol(X))){
stop("When stats::model.matrix converted the provided data.frame X to a matrix, the number of columns changed (probably because the provided data.frame contained a factor variable with at least three levels). Please convert X to a matrix yourself using model.matrix and provide cluster assignments according to the columns of the new matrix.")
}
}
stopifnot(is.matrix(X))
stopifnot(all(!is.na(X)))
stopifnot(is.numeric(X) | is.integer(X))
<- nrow(X)
n
# Since the model size will be determined by cross-validation, the provided
# y must be real-valued (this should be checked internally in other
# functions before getModelSize is called, but this check is here just in
# case).
if(!is.numeric(y) & !is.integer(y)){
stop("getModelSize is trying to determine max_num_clusts using the lasso with cross-validation, but the y provided to getModelSize was not real-valued.")
}stopifnot(length(y) == n)
# Check clusters argument
<- checkCssClustersInput(clusters)
clusters
### Format clusters into a list where all features are in exactly one
# cluster (any unclustered features are put in their own "cluster" of size
# 1).
<- as.character(NA)
clust_names if(!is.null(names(clusters)) & is.list(clusters)){
<- names(clusters)
clust_names
}
<- formatClusters(clusters, p=ncol(X),
clusters clust_names=clust_names)$clusters
<- X
X_size
if(length(clusters) > 0){
# Create modified design matrix by dropping all but one feature from
# each cluster
<- integer()
feats_to_drop for(i in 1:length(clusters)){
<- clusters[[i]]
cluster_i if(length(cluster_i) > 1){
<- identifyPrototype(cluster_i, X, y)
feat_to_keep <- c(feats_to_drop, setdiff(cluster_i,
feats_to_drop
feat_to_keep))
}
}if(length(feats_to_drop) > 0){
<- X_size[, -1*feats_to_drop]
X_size
}
}
<- glmnet::cv.glmnet(x=X_size, y=y, family="gaussian")
size_results <- as.numeric(glmnet::coef.glmnet(size_results, s="lambda.1se"))
coefs
# Number of nonzero coefficients (subtract one in order to ignore intercept)
<- length(coefs[coefs != 0]) - 1
size
# Check output
stopifnot(is.numeric(size) | is.integer(size))
stopifnot(!is.na(size))
stopifnot(length(size) == 1)
stopifnot(size == round(size))
return(as.integer(max(size, 1)))
}
Tests for getModelSize()
:
::test_that("getModelSize works", {
testthatset.seed(1723)
<- genClusteredData(n=15, p=11, k_unclustered=1, cluster_size=3,
data n_clusters=2, sigma_eps_sq=10^(-6))
<- data$X
x <- data$y
y
<- list(red_cluster=1L:3L, green_cluster=4L:6L)
good_clusters
<- getModelSize(X=x, y=y, clusters=good_clusters)
ret
::expect_true(is.numeric(ret) | is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_equal(ret, round(ret))
testthat::expect_true(ret >= 1)
testthat# 11 features, but two clusters of size 3, so maximum size should
# be 11 - 2 - 2 = 7
::expect_true(ret <= 7)
testthat
## Trying other inputs
<- list(1L:3L, 5L:8L)
unnamed_clusters
<- getModelSize(X=x, y=y, clusters=unnamed_clusters)
ret
::expect_true(is.numeric(ret) | is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_equal(ret, round(ret))
testthat::expect_true(ret >= 1)
testthat# 11 features, but 3 in one cluster and 4 in another, so maximum size should
# be 11 - 2 - 3 = 6
::expect_true(ret <= 6)
testthat
# Single cluster
<- getModelSize(X=x, y=y, clusters=2:5)
ret
::expect_true(is.numeric(ret) | is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_equal(ret, round(ret))
testthat::expect_true(ret >= 1)
testthat# 11 features, but 4 in one cluster, so maximum size should be 11 - 3 = 8
::expect_true(ret <= 8)
testthat
# Intentionally don't provide clusters for all feature, mix up formatting,
# etc.
<- list(red_cluster=1:3, 5:8)
good_clusters
<- getModelSize(X=x, y=y, clusters=good_clusters)
ret
::expect_true(is.numeric(ret) | is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_equal(ret, round(ret))
testthat::expect_true(ret >= 1)
testthat# 11 features, but 3 in one cluster and 4 in another, so maximum size should
# be 11 - 2 - 3 = 6
::expect_true(ret <= 6)
testthat
## Trying bad inputs
::expect_error(getModelSize(X="x", y=y, clusters=good_clusters),
testthat"is.matrix(X) | is.data.frame(X) is not TRUE", fixed=TRUE)
::expect_error(getModelSize(X=x[1:5, ], y=y, clusters=good_clusters),
testthat"length(y) == n is not TRUE", fixed=TRUE)
::expect_error(getModelSize(X=x, y=FALSE, clusters=good_clusters),
testthat"getModelSize is trying to determine max_num_clusts using the lasso with cross-validation, but the y provided to getModelSize was not real-valued.",
fixed=TRUE)
::expect_error(getModelSize(X=x, y=y, clusters=list(3:7, 7:10)),
testthat"Overlapping clusters detected; clusters must be non-overlapping. Overlapping clusters: 1, 2.",
fixed=TRUE)
::expect_error(getModelSize(X=x, y=y, clusters=list(5:8, 5:8)),
testthat"length(clusters) == length(unique(clusters)) is not TRUE",
fixed=TRUE)
::expect_error(getModelSize(X=x, y=y, clusters=6:50),
testthat"length(all_clustered_feats) == p is not TRUE",
fixed=TRUE)
::expect_error(getModelSize(X=x, y=y,
testthatclusters=list(2:3, as.integer(NA))),
"!is.na(clusters) are not all TRUE",
fixed=TRUE)
::expect_error(getModelSize(X=x, y=y, clusters=list(2:3, c(4, 4, 5))),
testthat"length(clusters[[i]]) == length(unique(clusters[[i]])) is not TRUE",
fixed=TRUE)
::expect_error(getModelSize(X=x, y=y, clusters=list(1:4, -1)),
testthat"all(clusters[[i]] >= 1) is not TRUE",
fixed=TRUE)
::expect_error(getModelSize(X=x, y=y, clusters=list(1:4,
testthatc(2.3, 1.2))),
"all(clusters[[i]] == round(clusters[[i]])) is not TRUE",
fixed=TRUE)
})
## ── Warning ('<text>:12'): getModelSize works ───────────────────────────────────
## Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold
## Backtrace:
## 1. litr (local) getModelSize(X = x, y = y, clusters = good_clusters)
## 2. glmnet::cv.glmnet(x = X_size, y = y, family = "gaussian")
## 3. glmnet:::cv.glmnet.raw(...)
##
## ── Warning ('<text>:27'): getModelSize works ───────────────────────────────────
## Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold
## Backtrace:
## 1. litr (local) getModelSize(X = x, y = y, clusters = unnamed_clusters)
## 2. glmnet::cv.glmnet(x = X_size, y = y, family = "gaussian")
## 3. glmnet:::cv.glmnet.raw(...)
##
## ── Warning ('<text>:39'): getModelSize works ───────────────────────────────────
## Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold
## Backtrace:
## 1. litr (local) getModelSize(X = x, y = y, clusters = 2:5)
## 2. glmnet::cv.glmnet(x = X_size, y = y, family = "gaussian")
## 3. glmnet:::cv.glmnet.raw(...)
##
## ── Warning ('<text>:54'): getModelSize works ───────────────────────────────────
## Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold
## Backtrace:
## 1. litr (local) getModelSize(X = x, y = y, clusters = good_clusters)
## 2. glmnet::cv.glmnet(x = X_size, y = y, family = "gaussian")
## 3. glmnet:::cv.glmnet.raw(...)
printCssDf()
:
#' Prepares a data.frame summarazing cluster stability selection output to print
#'
#' Print a summary of the information from the css function.
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param cutoff Numeric; the outputted data.frame will display only those
#' clusters with selection proportions equal to at least cutoff. Must be between
#' 0 and 1. Default is 0 (in which case either all clusters are displayed, or
#' max_num_clusts are, if max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.) Default is 1.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @return A data.frame; each row contains a cluster, arranged in decreasing
#' order of cluster selection proportion from top to bottom. The columns are
#' ClustName (the name of the cluster that was either provided to css or made by
#' css if no name was provided); ClustProtoName (the name of the selection
#' prototype from the cluster, which is the feature with the greatest individual
#' selection proportion among all the cluster members, with ties broken by
#' choosing the feature with the highest correlation with the response if the
#' response is real-valued; only returned if the features are named),
#' ClustProtoNum (the column number of the prototype in the X matrix provided to
#' css), and ClustSize (the size of the cluster).
#' @author Gregory Faletto, Jacob Bien
#' @export
printCssDf <- function(css_results, cutoff=0, min_num_clusts=1,
max_num_clusts=NA){
# Check inputs
stopifnot(class(css_results) == "cssr")
checkCutoff(cutoff)
<- ncol(css_results$feat_sel_mat)
p
checkMinNumClusts(min_num_clusts, p, length(css_results$clusters))
<- checkMaxNumClusts(max_num_clusts, min_num_clusts, p,
max_num_clusts length(css_results$clusters))
<- getCssSelections(css_results, cutoff=cutoff,
sel_clusts min_num_clusts=min_num_clusts,
max_num_clusts=max_num_clusts)$selected_clusts
# sel_clusts is guaranteed to have length at least 1 by
# getCssSelections
# Get prototypes (feature from each cluster with highest selection
# proportion, breaking ties by using marginal correlations of features with
# y from data provided to css if y is real-valued)
<- getSelectionPrototypes(css_results, sel_clusts)
prototypes
# Cluster selection proportions
if(length(sel_clusts) > 1){
<- colMeans(css_results$clus_sel_mat[,
sel_clust_sel_props names(sel_clusts)])
else{
} <- mean(css_results$clus_sel_mat[,
sel_clust_sel_props names(sel_clusts)])
}
# Data.frame: name of cluster, cluster prototype, selection proportion,
# cluster size
if(!is.null(names(prototypes))){
<- data.frame(ClustName=names(sel_clusts),
print_df ClustProtoName=names(prototypes), ClustProtoNum=unname(prototypes),
ClustSelProp=sel_clust_sel_props, ClustSize=lengths(sel_clusts))
else{
} <- data.frame(ClustName=names(sel_clusts),
print_df ClustProtoNum=unname(prototypes), ClustSelProp=sel_clust_sel_props,
ClustSize=lengths(sel_clusts))
}
<- print_df[order(print_df$ClustSelProp, decreasing=TRUE), ]
print_df
rownames(print_df) <- NULL
stopifnot(is.data.frame(print_df))
stopifnot(nrow(print_df) >= 1)
return(print_df)
}
#' Identify selection prototypes from selected clusters
#'
#' Takes in list of selected clusters and returns an integer vector of the
#' indices of the features that were most frequently selected from each cluster
#'
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param selected_clusts A list of integer vectors; each vector must contain
#' the indices of features in a cluster.
#' @return An integer vector (of length equal to the number of clusters) of the
#' indices of the feature prototypes (the features from each cluster that were
#' selected the most often individually by the base method in cluster stability
#' selection). In the case of a tie, the tie is broken by choosing the feature
#' most correlated with the response in the full data set provided to css.
#' @author Gregory Faletto, Jacob Bien
getSelectionPrototypes <- function(css_results, selected_clusts){
# Check inputs
stopifnot(class(css_results) == "cssr")
stopifnot(is.list(selected_clusts))
<- length(selected_clusts)
n_selected_clusts stopifnot(n_selected_clusts >= 1)
stopifnot(all(lengths(selected_clusts) >= 1))
<- rep(as.integer(NA), n_selected_clusts)
prototypes for(i in 1:n_selected_clusts){
<- selected_clusts[[i]]
clust_i <- colMeans(css_results$feat_sel_mat)[clust_i]
sel_props_i <- clust_i[sel_props_i == max(sel_props_i)]
proto_i stopifnot(length(proto_i) >= 1)
if(length(proto_i) > 1){
if(is.numeric(css_results$y) | is.integer(css_results$y)){
# Break tie by looking at marginal correlations
<- stats::cor(css_results$X[, proto_i], css_results$y)[, 1]
corrs_i <- abs(corrs_i)
corrs_i <- proto_i[corrs_i == max(corrs_i)]
proto_i
}
}# If there is still a tie, break by choosing the first feature of those
# remaining
<- proto_i[1]
prototypes[i] names(prototypes)[i] <- colnames(css_results$X)[proto_i]
}
# Check output
stopifnot(is.integer(prototypes))
stopifnot(all(!is.na(prototypes)))
stopifnot(length(prototypes) == length(unique(prototypes)))
return(prototypes)
}
Tests for getSelectionPrototypes()
:
::test_that("getSelectionPrototypes works", {
testthatset.seed(67234)
<- genClusteredData(n=15, p=11, k_unclustered=1, cluster_size=3,
data n_clusters=2, sig_clusters=1, sigma_eps_sq=10^(-6))
<- data$X
x <- data$y
y
<- list(red_cluster=1L:3L, green_cluster=4L:6L)
good_clusters
<- css(X=x, y=y, lambda=0.01, clusters=good_clusters)
css_results
<- getSelectionPrototypes(css_results, selected_clusts=good_clusters)
ret
::expect_true(is.integer(ret))
testthat::expect_true(all(!is.na(ret)))
testthat::expect_equal(length(ret), length(good_clusters))
testthat::expect_equal(length(ret), length(unique(ret)))
testthatfor(i in 1:length(ret)){
::expect_true(ret[i] %in% good_clusters[[i]])
testthat# Find the largest selection proportion of any feature in cluster i
<- max(colMeans(css_results$feat_sel_mat[, good_clusters[[i]]]))
max_prop # Find the selection proportion of the identified prototype
<- colMeans(css_results$feat_sel_mat)[ret[i]]
proto_prop ::expect_equal(max_prop, proto_prop)
testthat
}
# Try with only one selected cluster (still should be in a list)
<- getSelectionPrototypes(css_results,
ret selected_clusts=list(red_cluster=1L:3L))
::expect_true(is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_true(ret %in% 1L:3L)
testthat# Find the largest selection proportion of any feature in the cluster
<- max(colMeans(css_results$feat_sel_mat[, 1L:3L]))
max_prop # Find the selection proportion of the identified prototype
<- colMeans(css_results$feat_sel_mat)[ret]
proto_prop ::expect_equal(max_prop, proto_prop)
testthat
## Trying bad inputs
# Error contains quotation marks
::expect_error(getSelectionPrototypes(x, good_clusters))
testthat
::expect_error(getSelectionPrototypes(css_results, 1L:3L),
testthat"is.list(selected_clusts) is not TRUE", fixed=TRUE)
::expect_error(getSelectionPrototypes(css_results, list()),
testthat"n_selected_clusts >= 1 is not TRUE", fixed=TRUE)
::expect_error(getSelectionPrototypes(css_results,
testthatlist(red_cluster=1L:3L,
green_cluster=4L:6L,
bad_cluster=integer())),
"all(lengths(selected_clusts) >= 1) is not TRUE",
fixed=TRUE)
})
## Test passed 😸
Tests for printCssDf()
:
::test_that("printCssDf works", {
testthatset.seed(67234)
<- genClusteredData(n=15, p=11, k_unclustered=1, cluster_size=3,
data n_clusters=2, sig_clusters=1, sigma_eps_sq=10^(-6))
<- data$X
x <- data$y
y
<- list(red_cluster=1L:3L, green_cluster=4L:6L)
good_clusters
<- css(X=x, y=y, lambda=0.01, clusters=good_clusters, B=10)
css_results
<- printCssDf(css_results)
ret
::expect_true(is.data.frame(ret))
testthat::expect_identical(colnames(ret), c("ClustName", "ClustProtoNum",
testthat"ClustSelProp", "ClustSize"))
# Total number of clusters is 11 - (3 - 1) - (3 - 1) = 7
::expect_equal(nrow(ret), 7)
testthat
::expect_true(is.character(ret$ClustName))
testthat::expect_true(all(names(good_clusters) %in% ret$ClustName))
testthat::expect_equal(length(ret$ClustName), length(unique(ret$ClustName)))
testthat
::expect_true(is.integer(ret$ClustProtoNum))
testthat::expect_true(ret[ret$ClustName=="red_cluster",
testthat"ClustProtoNum"] %in% 1L:3L)
::expect_true(ret[ret$ClustName=="green_cluster",
testthat"ClustProtoNum"] %in% 4L:6L)
<- !(ret$ClustName %in% c("red_cluster", "green_cluster"))
other_rows ::expect_true(all(ret[other_rows, "ClustProtoNum"] %in% 7L:11L))
testthat::expect_true(length(ret[other_rows, "ClustProtoNum"]) ==
testthatlength(unique(ret[other_rows, "ClustProtoNum"])))
::expect_true(is.numeric(ret$ClustSelProp))
testthat::expect_identical(ret$ClustSelProp, sort(ret$ClustSelProp,
testthatdecreasing=TRUE))
::expect_true(is.integer(ret$ClustSize))
testthat::expect_equal(ret[ret$ClustName=="red_cluster", "ClustSize"], 3)
testthat::expect_equal(ret[ret$ClustName=="green_cluster", "ClustSize"], 3)
testthat::expect_true(all(ret[other_rows, "ClustSize"] == 1))
testthat
# Try again naming features
<- x
x_named colnames(x_named) <- LETTERS[1:11]
<- css(X=x_named, y=y, lambda=0.01,
css_results_name_feats clusters=good_clusters, B=10)
<- printCssDf(css_results_name_feats)
ret
::expect_true(is.data.frame(ret))
testthat::expect_identical(colnames(ret), c("ClustName", "ClustProtoName",
testthat"ClustProtoNum", "ClustSelProp",
"ClustSize"))
# Total number of clusters is 11 - (3 - 1) - (3 - 1) = 7
::expect_equal(nrow(ret), 7)
testthat
::expect_true(is.character(ret$ClustName))
testthat::expect_true(all(names(good_clusters) %in% ret$ClustName))
testthat::expect_equal(length(ret$ClustName), length(unique(ret$ClustName)))
testthat
::expect_true(is.character(ret$ClustProtoName))
testthat::expect_true(ret[ret$ClustName=="red_cluster",
testthat"ClustProtoName"] %in% LETTERS[1:3])
::expect_true(ret[ret$ClustName=="green_cluster",
testthat"ClustProtoName"] %in% LETTERS[4:6])
<- !(ret$ClustName %in% c("red_cluster", "green_cluster"))
other_rows ::expect_true(all(ret[other_rows, "ClustProtoName"] %in% LETTERS[7:11]))
testthat::expect_true(length(ret[other_rows, "ClustProtoName"]) ==
testthatlength(unique(ret[other_rows, "ClustProtoName"])))
::expect_true(is.integer(ret$ClustProtoNum))
testthat::expect_true(ret[ret$ClustName=="red_cluster",
testthat"ClustProtoNum"] %in% 1L:3L)
::expect_true(ret[ret$ClustName=="green_cluster",
testthat"ClustProtoNum"] %in% 4L:6L)
::expect_true(all(ret[other_rows, "ClustProtoNum"] %in% 7L:11L))
testthat::expect_true(length(ret[other_rows, "ClustProtoNum"]) ==
testthatlength(unique(ret[other_rows, "ClustProtoNum"])))
::expect_true(is.numeric(ret$ClustSelProp))
testthat::expect_identical(ret$ClustSelProp, sort(ret$ClustSelProp,
testthatdecreasing=TRUE))
::expect_true(is.integer(ret$ClustSize))
testthat::expect_equal(ret[ret$ClustName=="red_cluster", "ClustSize"], 3)
testthat::expect_equal(ret[ret$ClustName=="green_cluster", "ClustSize"], 3)
testthat::expect_true(all(ret[other_rows, "ClustSize"] == 1))
testthat
# Unnamed clusters
<- list(1:3, 4:6)
unnamed_clusters
<- css(X=x, y=y, lambda=0.01, clusters=unnamed_clusters,
css_results_unnamed B=10)
<- printCssDf(css_results_unnamed)
ret
::expect_true(is.data.frame(ret))
testthat::expect_identical(colnames(ret), c("ClustName", "ClustProtoNum",
testthat"ClustSelProp", "ClustSize"))
# Total number of clusters is 11 - (3 - 1) - (3 - 1) = 7
::expect_equal(nrow(ret), 7)
testthat
::expect_true(is.character(ret$ClustName))
testthat::expect_equal(length(ret$ClustName), length(unique(ret$ClustName)))
testthat
::expect_true(is.integer(ret$ClustProtoNum))
testthat
::expect_true(is.numeric(ret$ClustSelProp))
testthat::expect_identical(ret$ClustSelProp, sort(ret$ClustSelProp,
testthatdecreasing=TRUE))
::expect_true(is.integer(ret$ClustSize))
testthat
# Try other settings for cutoff, min_num_clusts, max_num_clusts, etc.
<- printCssDf(css_results, max_num_clusts=3)
ret
::expect_true(is.data.frame(ret))
testthat::expect_identical(colnames(ret), c("ClustName", "ClustProtoNum",
testthat"ClustSelProp", "ClustSize"))
::expect_true(nrow(ret) <= 3)
testthat
::expect_true(is.character(ret$ClustName))
testthat::expect_equal(length(ret$ClustName), length(unique(ret$ClustName)))
testthat
::expect_true(is.integer(ret$ClustProtoNum))
testthat<- !(ret$ClustName %in% c("red_cluster", "green_cluster"))
other_rows ::expect_true(all(ret[other_rows, "ClustProtoNum"] %in% 7L:11L))
testthat::expect_true(length(ret[other_rows, "ClustProtoNum"]) ==
testthatlength(unique(ret[other_rows, "ClustProtoNum"])))
::expect_true(is.numeric(ret$ClustSelProp))
testthat::expect_identical(ret$ClustSelProp, sort(ret$ClustSelProp,
testthatdecreasing=TRUE))
::expect_true(is.integer(ret$ClustSize))
testthat::expect_true(all(ret[other_rows, "ClustSize"] == 1))
testthat
if("red_cluster" %in% ret$ClustName){
::expect_true(ret[ret$ClustName=="red_cluster",
testthat"ClustProtoNum"] %in% 1L:3L)
::expect_equal(ret[ret$ClustName=="red_cluster", "ClustSize"], 3)
testthat
}
if("green_cluster" %in% ret$ClustName){
::expect_true(ret[ret$ClustName=="green_cluster",
testthat"ClustProtoNum"] %in% 4L:6L)
::expect_equal(ret[ret$ClustName=="green_cluster", "ClustSize"], 3)
testthat
}
<- printCssDf(css_results, min_num_clusts=2, cutoff=1)
ret
::expect_true(is.data.frame(ret))
testthat::expect_identical(colnames(ret), c("ClustName", "ClustProtoNum",
testthat"ClustSelProp", "ClustSize"))
# Total number of clusters is 11 - (3 - 1) - (3 - 1) = 7
::expect_true(nrow(ret) >= 2)
testthat::expect_true(nrow(ret) <= 7)
testthat
::expect_true(is.character(ret$ClustName))
testthat::expect_equal(length(ret$ClustName), length(unique(ret$ClustName)))
testthat
::expect_true(is.integer(ret$ClustProtoNum))
testthat<- !(ret$ClustName %in% c("red_cluster", "green_cluster"))
other_rows ::expect_true(all(ret[other_rows, "ClustProtoNum"] %in% 7L:11L))
testthat::expect_true(length(ret[other_rows, "ClustProtoNum"]) ==
testthatlength(unique(ret[other_rows, "ClustProtoNum"])))
::expect_true(is.numeric(ret$ClustSelProp))
testthat::expect_identical(ret$ClustSelProp, sort(ret$ClustSelProp,
testthatdecreasing=TRUE))
::expect_true(is.integer(ret$ClustSize))
testthat::expect_true(all(ret[other_rows, "ClustSize"] == 1))
testthat
if("red_cluster" %in% ret$ClustName){
::expect_true(ret[ret$ClustName=="red_cluster",
testthat"ClustProtoNum"] %in% 1L:3L)
::expect_equal(ret[ret$ClustName=="red_cluster", "ClustSize"], 3)
testthat
}
if("green_cluster" %in% ret$ClustName){
::expect_true(ret[ret$ClustName=="green_cluster",
testthat"ClustProtoNum"] %in% 4L:6L)
::expect_equal(ret[ret$ClustName=="green_cluster", "ClustSize"], 3)
testthat
}
#
<- printCssDf(css_results, cutoff=1)
ret
::expect_true(is.data.frame(ret))
testthat::expect_identical(colnames(ret), c("ClustName", "ClustProtoNum",
testthat"ClustSelProp", "ClustSize"))
::expect_true(nrow(ret) >= 1)
testthat::expect_true(nrow(ret) <= 7)
testthat
::expect_true(is.character(ret$ClustName))
testthat::expect_equal(length(ret$ClustName), length(unique(ret$ClustName)))
testthat
::expect_true(is.integer(ret$ClustProtoNum))
testthat<- !(ret$ClustName %in% c("red_cluster", "green_cluster"))
other_rows ::expect_true(all(ret[other_rows, "ClustProtoNum"] %in% 7L:11L))
testthat::expect_true(length(ret[other_rows, "ClustProtoNum"]) ==
testthatlength(unique(ret[other_rows, "ClustProtoNum"])))
::expect_true(is.numeric(ret$ClustSelProp))
testthat::expect_identical(ret$ClustSelProp, sort(ret$ClustSelProp,
testthatdecreasing=TRUE))
::expect_true(is.integer(ret$ClustSize))
testthat::expect_true(all(ret[other_rows, "ClustSize"] == 1))
testthat
if("red_cluster" %in% ret$ClustName){
::expect_true(ret[ret$ClustName=="red_cluster",
testthat"ClustProtoNum"] %in% 1L:3L)
::expect_equal(ret[ret$ClustName=="red_cluster", "ClustSize"], 3)
testthat
}
if("green_cluster" %in% ret$ClustName){
::expect_true(ret[ret$ClustName=="green_cluster",
testthat"ClustProtoNum"] %in% 4L:6L)
::expect_equal(ret[ret$ClustName=="green_cluster", "ClustSize"], 3)
testthat
}
## Trying bad inputs
# Error has quotation marks in it
::expect_error(printCssDf("css_results"))
testthat
::expect_error(printCssDf(css_results, cutoff=-.1),
testthat"cutoff >= 0 is not TRUE", fixed=TRUE)
::expect_error(printCssDf(css_results, min_num_clusts=3.2),
testthat"min_num_clusts == round(min_num_clusts) is not TRUE",
fixed=TRUE)
::expect_error(printCssDf(css_results, max_num_clusts="5"),
testthat"is.numeric(max_num_clusts) | is.integer(max_num_clusts) is not TRUE",
fixed=TRUE)
})
## Test passed 🎉
#' Print cluster stability selection output
#'
#' Print a summary of the information from the css function (using output from
#' printCssDf function).
#' @param x An object of class "cssr" (the output of the function css).
#' @param cutoff Numeric; print.cssr will display only those
#' clusters with selection proportions equal to at least cutoff. Must be between
#' 0 and 1. Default is 0 (in which case either all clusters are displayed, or
#' max_num_clusts are, if max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.) Default is 1.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @param ... Additional arguments to generic print.data.frame function
#' @return A data.frame; each row contains a cluster, arranged in decreasing
#' order of cluster selection proportion from top to bottom. The columns are
#' ClustName (the name of the cluster that was either provided to css or made by
#' css if no name was provided); ClustProtoName (the name of the selection
#' prototype from the cluster, which is the feature with the greatest individual
#' selection proportion among all the cluster members, with ties broken by
#' choosing the feature with the highest correlation with the response if the
#' response is real-valued; only returned if the features are named),
#' ClustProtoNum (the column number of the prototype in the X matrix provided to
#' css), and ClustSize (the size of the cluster).
#' @author Gregory Faletto, Jacob Bien
#' @export
print.cssr <- function(x, cutoff=0, min_num_clusts=1, max_num_clusts=NA, ...){
<- printCssDf(css_results=x, cutoff=cutoff,
df min_num_clusts=min_num_clusts, max_num_clusts=max_num_clusts)
print.data.frame(df, ...)
}