6 Other useful functions
The following are some other functions that are useful for specific tasks.
genClusteredData()generates jointly Gaussian data sets that include clusters of features that are correlated with a latent feature (the latent features themselves are also provided), as in the simulation in Example 1, Section 5.1, and Section 5.2 of Faletto and Bien (2022).genClusteredDataWeighted()generates jointly Gaussian data sets that include clusters of features that are correlated with a latent feature with unequal correlations (the latent features themselves are also provided), as in the simulation Section 5.3 of Faletto and Bien (2022).genClusteredDataWeightedRandom()generates jointly Gaussian data sets that include clusters of features that are correlated with a latent feature with unequal correlations (the latent features themselves are also provided) that are determined at random.getLassoLambda()uses a data set to choose a good value of lambda for subsamples of size n/2 (as in cluster stability selection) by cross-validation.getModelSize()estimates the best size for a lasso model on a provided data set.printCssDf()converts the output of thecss()function (an object of class cssr) to a data.frame that is ready to be printed. We provide end-user access toprintCssDf()because this data.frame itself may be useful as an object that summarizes the output fromcss().print.cssr()is the print function for objects of class cssr.-
-
checkGenClusteredDataInputs()checks that the inputs togenClusteredData()are as expected. -
genZmuY()generates the latent Z, the weak signal and noise features in X, the latent mu, and the observed y from the specified parameters. -
getNoiseVar()calculates the variance of the noise to add to Z so that the proxies have the specified correlation with each other.
-
-
-
checkGenClusteredDataWeightedInputs()verifies the inputs togenClusteredDataWeighted().
-
-
-
getSelectionPrototypes()identifies the prototypes from selected clusters (the feature in each cluster that is most correlated with the response)
-
#' Generate randomly sampled data including noisy observations of latent
#' variables
#'
#' Generate a data set including latent features Z, observed features X (which
#' may include noisy or noiseless observations of the latent features in Z),
#' an observed response y which is a linear model of features from Z and X as
#' well as independent mean zero noise, and mu (the responses from y without
#' the added noise). Data is generated in the same way as in the simulations
#' from Faletto and Bien (2022).
#' @param n Integer or numeric; the number of observations to generate. (The
#' generated X and Z will have n rows, and the generated y and mu will have
#' length n.)
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered. Must be at least 1.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho Integer or numeric; the correlation of the proxies in each cluster
#' with the latent variable. Must be greater than 0. Default is 0.9.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features; either a numeric vector (if n_clusters > 1) or a numeric
#' matrix (if n_clusters > 1). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @importFrom stats rnorm runif
#' @export
genClusteredData <- function(n, p, k_unclustered, cluster_size, n_clusters=1,
sig_clusters=1, rho=0.9, beta_latent=1.5, beta_unclustered=1,
snr=as.numeric(NA), sigma_eps_sq=as.numeric(NA)){
# Check inputs
checkGenClusteredDataInputs(p, k_unclustered, cluster_size, n_clusters,
sig_clusters, rho, beta_latent, beta_unclustered, snr,
sigma_eps_sq)
ret <- genZmuY(n=n, p=p, k_unclustered=k_unclustered,
cluster_size=cluster_size, n_clusters=n_clusters,
sig_clusters=sig_clusters, beta_latent=beta_latent,
beta_unclustered=beta_unclustered, snr=snr, sigma_eps_sq=sigma_eps_sq)
Z <- ret$Z
y <- ret$y
mu <- ret$mu
other_X <- ret$other_X
# Finally, generate clusters of proxies to complete X. First, get needed
# variances of noise to add
noise_var <- getNoiseVar(rho)
# Generate these noise features
noise_mat <- matrix(stats::rnorm(n*n_clusters*cluster_size, mean=0,
sd=sqrt(noise_var)), n, n_clusters*cluster_size)
# Create matrix of proxies
proxy_mat <- matrix(as.numeric(NA), n, n_clusters*cluster_size)
if(n_clusters > 1){
for(i in 1:n_clusters){
first_ind <- (i - 1)*cluster_size + 1
last_ind <- i*cluster_size
proxy_mat[, first_ind:last_ind] <- Z[, i] +
noise_mat[, first_ind:last_ind]
}
} else{
stopifnot(ncol(noise_mat) == cluster_size)
proxy_mat[, 1:cluster_size] <- Z + noise_mat
}
X <- cbind(proxy_mat, other_X)
Z <- as.matrix(Z)
# Check output
stopifnot(length(mu) == n)
stopifnot(nrow(X) == n)
stopifnot(ncol(X) == p)
if(any(!is.na(Z))){
stopifnot(nrow(Z) == n)
stopifnot(ncol(Z) == n_clusters)
}
return(list(X=X, y=y, Z=Z, mu=mu))
}#' Check inputs to genClusteredData
#'
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho Integer or numeric; the covariance of the proxies in each cluster
#' with the latent variable (and each other). Note that the correlation between
#' the features in the cluster will be rho/var. Can't equal 0. Default is 0.9.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features; either a numeric vector (if n_clusters > 1) or a numeric
#' matrix (if n_clusters > 1). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
checkGenClusteredDataInputs <- function(p, k_unclustered, cluster_size,
n_clusters, sig_clusters, rho, beta_latent, beta_unclustered, snr,
sigma_eps_sq){
stopifnot(is.numeric(sig_clusters) | is.integer(sig_clusters))
stopifnot(sig_clusters <= n_clusters)
stopifnot(sig_clusters >= 0)
stopifnot(sig_clusters == round(sig_clusters))
stopifnot(is.numeric(n_clusters) | is.integer(n_clusters))
stopifnot(n_clusters == round(n_clusters))
# TODO(gregfaletto): is it easy to remove the requirement that n_clusters is
# at least 1 (so that it's possible to generate data with no latent
# features)? If so, should only check that cluster_size >= 1 if n_clusters
# >= 1, and in makeCovarianceMatrix function only need block_size >= 1
# rather than 2.
stopifnot(n_clusters >= 1)
stopifnot(cluster_size >= 2)
stopifnot(rho > 0)
stopifnot(beta_latent != 0)
stopifnot(beta_unclustered != 0)
stopifnot(is.numeric(k_unclustered) | is.integer(k_unclustered))
stopifnot(k_unclustered >= 2)
stopifnot(k_unclustered == round(k_unclustered))
stopifnot(p >= n_clusters*cluster_size + k_unclustered)
# Same as make_sparse_blocked_linear_model_random, but ith coefficient
# of weak signal features is beta_unclustered/sqrt(i) in order to have
# a definitive ranking of weak signal features.
if(is.na(snr) & is.na(sigma_eps_sq)){
stop("Must specify one of snr or sigma_eps_sq")
}
if(is.na(snr)){
stopifnot(all(!is.na(sigma_eps_sq)))
stopifnot(is.numeric(sigma_eps_sq) | is.integer(sigma_eps_sq))
stopifnot(length(sigma_eps_sq) == 1)
stopifnot(sigma_eps_sq >= 0)
} else{
stopifnot(is.numeric(snr) | is.integer(snr))
stopifnot(length(snr) == 1)
stopifnot(snr > 0)
}
}Tests for checkGenClusteredDataInputs()
testthat::test_that("checkGenClusteredDataInputs works", {
set.seed(7612)
# Should get no error
checkGenClusteredDataInputs(p=19, k_unclustered=2, cluster_size=5, n_clusters=3,
sig_clusters=2, rho=.8, beta_latent=1.5,
beta_unclustered=-2, snr=NA, sigma_eps_sq=.5)
checkGenClusteredDataInputs(p=19, k_unclustered=2, cluster_size=5, n_clusters=3,
sig_clusters=2, rho=.8, beta_latent=1.5,
beta_unclustered=-2, snr=1, sigma_eps_sq=NA)
# sig_clusters
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters="2", rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"is.numeric(sig_clusters) | is.integer(sig_clusters) is not TRUE",
fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=4, rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"sig_clusters <= n_clusters is not TRUE", fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=-1, rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"sig_clusters >= 0 is not TRUE", fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=.6, rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"sig_clusters == round(sig_clusters) is not TRUE",
fixed=TRUE)
# n_clusters
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
cluster_size=5,
n_clusters="3",
sig_clusters=2, rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"is.numeric(n_clusters) | is.integer(n_clusters) is not TRUE",
fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
cluster_size=5, n_clusters=3.2,
sig_clusters=2, rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"n_clusters == round(n_clusters) is not TRUE",
fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
cluster_size=5, n_clusters=0,
sig_clusters=0, rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"n_clusters >= 1 is not TRUE", fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
cluster_size=.3, n_clusters=3,
sig_clusters=2, rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"cluster_size >= 2 is not TRUE", fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=16, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"p >= n_clusters * cluster_size + k_unclustered is not TRUE",
fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"rho > 0 is not TRUE", fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=0,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"beta_latent != 0 is not TRUE", fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=1.5,
beta_unclustered=0, snr=1,
sigma_eps_sq=NA),
"beta_unclustered != 0 is not TRUE", fixed=TRUE)
# k_unclustered
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered="2",
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"is.numeric(k_unclustered) | is.integer(k_unclustered) is not TRUE",
fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=-2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"k_unclustered >= 2 is not TRUE", fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2.2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"k_unclustered == round(k_unclustered) is not TRUE",
fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=1.5,
beta_unclustered=-2, snr=NA,
sigma_eps_sq=NA),
"Must specify one of snr or sigma_eps_sq", fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=1.5,
beta_unclustered=-2, snr=-.2,
sigma_eps_sq=NA),
"snr > 0 is not TRUE", fixed=TRUE)
testthat::expect_error(checkGenClusteredDataInputs(p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=1.5,
beta_unclustered=-2, snr=NA,
sigma_eps_sq=-.3),
"sigma_eps_sq >= 0 is not TRUE", fixed=TRUE)
})## Test passed with 18 successes 😀.
genZmuY()
#' Generates Z, weak signal features in X, noise features in X, mu, and y
#' from provided parameters
#'
#' @param n Integer or numeric; the number of observations to generate. (The
#' generated X and Z will have n rows, and the generated y and mu will have
#' length n.)
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered. Must be at least 1.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{Z}{The
#' latent features; either a numeric vector (if n_clusters > 1) or a numeric
#' matrix (if n_clusters > 1). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.}
#' \item{other_X}{A numeric matrix of n observations from a multivariate normal
#' distribution generated using the specified parameters, containing the weak
#' signal features and the noise features that will eventually be in X. (The
#' only missing features are the proxies for the latent features Z.)}
#'
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
genZmuY <- function(n, p, k_unclustered, cluster_size, n_clusters, sig_clusters,
beta_latent, beta_unclustered, snr, sigma_eps_sq){
# Generate Z, weak signal features, and noise features (total of
# p - n_clusters*(cluster_size - 1)) features)
p_orig_feat_mat <- p - n_clusters*(cluster_size - 1)
stopifnot(p_orig_feat_mat >= k_unclustered + n_clusters)
orig_feat_mat <- matrix(stats::rnorm(n*p_orig_feat_mat), n, p_orig_feat_mat)
# First n_clusters features are Z. Next k_unclustered features are weak
# signal features. Any remaining features are noise features.
Z <- orig_feat_mat[, 1:n_clusters]
other_X <- orig_feat_mat[, (n_clusters + 1):p_orig_feat_mat]
# Ready to create mu and y
if(n_clusters > 1){
if(sig_clusters > 1){
mu <- Z[, 1:sig_clusters] %*% rep(beta_latent, sig_clusters)
} else{
mu <- Z[, 1:sig_clusters] * beta_latent
}
} else{
mu <- Z*beta_latent
}
for(j in 1:k_unclustered){
mu <- mu + beta_unclustered/sqrt(j)*other_X[, j]
}
mu <- as.numeric(mu)
# If SNR is null, use sigma_eps_sq
if(!is.na(sigma_eps_sq)){
sd <- sqrt(sigma_eps_sq)
}else{
sd <- sqrt(sum(mu^2) / (n * snr)) # taking snr = ||mu||^2 /(n * sigma^2)
}
stopifnot(is.numeric(sd) | is.integer(sd))
stopifnot(length(sd) == 1)
stopifnot(!is.na(sd))
stopifnot(sd >= 0)
y <- as.numeric(mu + rnorm(n, mean=0, sd=sd))
return(list(Z=Z, mu=mu, y=y, other_X=other_X))
}No tests for genZmuY().
getNoiseVar():
#' Get variance of noise to add to Z in order to yield proxies X with desired
#' correlations with Z
#'
#' @param cor A numeric vector of desired correlations for each proxy to have
#' with Z. Note: correlations must be positive.
#' @return A vector of variances of independent Gaussian random variables to add
#' to Z in order to yield proxies with the desired correlations with Z.
#' @author Gregory Faletto, Jacob Bien
#' @export
getNoiseVar <- function(cor){
# Correlation between standard normal Z and X = Z + epsilon where epsilon
# is normal, independent of Z, and has mean 0 and variance sig_eps_sq:
#
# E[Z X]/sqrt{Var(Z) Var(X)}
# = (E[Z^2] + E[Z*epsilon])/sqrt{1*(1 + sig_eps_sq)}
# = (1 + 0)/sqrt{1 + sig_eps_sq}
#
# So we have
# cor = 1/sqrt{1 + sig_eps_sq}
# \iff 1 + sig_eps_sq = 1/cor^2
# \iff sig_eps_sq = 1/cor^2 - 1
stopifnot(is.numeric(cor) | is.integer(cor))
stopifnot(all(!is.na(cor)))
stopifnot(length(cor) >= 1)
stopifnot(all(cor > 0))
stopifnot(all(cor <= 1))
return(1/cor^2 - 1)
}No tests for getNoiseVar().
Finally, tests for genClusteredData()
testthat::test_that("genClusteredData works", {
set.seed(23478)
ret <- genClusteredData(n=25, p=19, k_unclustered=2, cluster_size=5,
n_clusters=3, sig_clusters=2, rho=.99,
beta_latent=1.5, beta_unclustered=-2, snr=NA,
sigma_eps_sq=.5)
testthat::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
testthat::expect_true(is.numeric(ret$X))
testthat::expect_true(is.matrix(ret$X))
testthat::expect_equal(ncol(ret$X), 19)
testthat::expect_equal(nrow(ret$X), 25)
# X is Gaussian with mean 0 and variance 4; expect all observations to lie
# within 5 standard deviations of mean
testthat::expect_true(all(abs(ret$X) < 5*2))
# Test that clusters are correlated--within-cluster correlation should be
# high, correlation with other features should be low
testthat::expect_true(min(cor(ret$X[, 1:5])) > .9)
testthat::expect_true(max(abs(cor(ret$X[, 1:5], ret$X[, 6:19]))) < .6)
testthat::expect_true(min(cor(ret$X[, 6:10])) > .9)
testthat::expect_true(max(abs(cor(ret$X[, 6:10],
ret$X[, c(1:5, 11:19)]))) < .6)
testthat::expect_true(min(cor(ret$X[, 11:15])) > .9)
testthat::expect_true(max(abs(cor(ret$X[, 11:15],
ret$X[, c(1:10, 16:19)]))) < .6)
cor_indeps <- cor(ret$X[, 16:19])
testthat::expect_true(max(abs(cor_indeps[lower.tri(cor_indeps)])) < .6)
testthat::expect_true(is.numeric(ret$y))
testthat::expect_equal(length(ret$y), 25)
testthat::expect_true(is.numeric(ret$Z))
testthat::expect_true(is.matrix(ret$Z))
testthat::expect_equal(nrow(ret$Z), 25)
testthat::expect_equal(ncol(ret$Z), 3)
testthat::expect_true(is.numeric(ret$mu))
testthat::expect_equal(length(ret$mu), 25)
# Because y is Gaussian with mean mu and standard deviation .5 conditional on
# mu, expect all observations to lie within 5 sds of mu
testthat::expect_true(all(abs(ret$y - ret$mu) < 5*.5))
# Specify SNR instead of sigma_eps_sq
ret <- genClusteredData(n=5, p=19, k_unclustered=2, cluster_size=5,
n_clusters=3, sig_clusters=2, rho=.8, beta_latent=1.5,
beta_unclustered=-2, snr=1, sigma_eps_sq=NA)
testthat::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
# If sigma_eps_sq is specified, snr should be ignored. (Set an SNR that
# implies a very large noise variance to test this)
ret <- genClusteredData(n=5, p=19, k_unclustered=2, cluster_size=5, n_clusters=3,
sig_clusters=2, rho=.8, beta_latent=1.5,
beta_unclustered=-2, snr=.01, sigma_eps_sq=.25)
testthat::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
# Because y is Gaussian with mean mu and standard deviation .5 conditional on
# mu, expect all observations to lie within 5 sds of mu
testthat::expect_true(all(abs(ret$y - ret$mu) < 5*.25))
# Try a single latent variable (z should be a one-column matrix)
ret <- genClusteredData(n=5, p=19, k_unclustered=2, cluster_size=5,
n_clusters=1, sig_clusters=1, rho=.8, beta_latent=1.5,
beta_unclustered=-2, snr=NA, sigma_eps_sq=.5)
testthat::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
testthat::expect_true(is.numeric(ret$Z))
testthat::expect_true(is.matrix(ret$Z))
testthat::expect_equal(nrow(ret$Z), 5)
testthat::expect_equal(ncol(ret$Z), 1)
# Bad inputs
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters="2", rho=.8,
beta_latent=1.5, beta_unclustered=-2,
snr=1, sigma_eps_sq=NA),
"is.numeric(sig_clusters) | is.integer(sig_clusters) is not TRUE",
fixed=TRUE)
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=4, rho=.8,
beta_latent=1.5, beta_unclustered=-2,
snr=1, sigma_eps_sq=NA),
"sig_clusters <= n_clusters is not TRUE", fixed=TRUE)
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=-1, rho=.8,
beta_latent=1.5, beta_unclustered=-2,
snr=1, sigma_eps_sq=NA),
"sig_clusters >= 0 is not TRUE", fixed=TRUE)
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=.6, rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"sig_clusters == round(sig_clusters) is not TRUE",
fixed=TRUE)
# n_clusters
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
cluster_size=5,
n_clusters="3",
sig_clusters=2, rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"is.numeric(n_clusters) | is.integer(n_clusters) is not TRUE",
fixed=TRUE)
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
cluster_size=5, n_clusters=3.2,
sig_clusters=2, rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"n_clusters == round(n_clusters) is not TRUE",
fixed=TRUE)
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
cluster_size=5, n_clusters=0,
sig_clusters=0, rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"n_clusters >= 1 is not TRUE", fixed=TRUE)
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
cluster_size=.3, n_clusters=3,
sig_clusters=2, rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"cluster_size >= 2 is not TRUE", fixed=TRUE)
testthat::expect_error(genClusteredData(p=16, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"p >= n_clusters * cluster_size + k_unclustered is not TRUE",
fixed=TRUE)
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"rho > 0 is not TRUE", fixed=TRUE)
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=0,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"beta_latent != 0 is not TRUE", fixed=TRUE)
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=1.5,
beta_unclustered=0, snr=1,
sigma_eps_sq=NA),
"beta_unclustered != 0 is not TRUE", fixed=TRUE)
# k_unclustered
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered="2",
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"is.numeric(k_unclustered) | is.integer(k_unclustered) is not TRUE",
fixed=TRUE)
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=-2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"k_unclustered >= 2 is not TRUE", fixed=TRUE)
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2.2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=1.5,
beta_unclustered=-2, snr=1,
sigma_eps_sq=NA),
"k_unclustered == round(k_unclustered) is not TRUE",
fixed=TRUE)
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=1.5,
beta_unclustered=-2, snr=NA,
sigma_eps_sq=NA),
"Must specify one of snr or sigma_eps_sq", fixed=TRUE)
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=1.5,
beta_unclustered=-2, snr=-.2,
sigma_eps_sq=NA),
"snr > 0 is not TRUE", fixed=TRUE)
testthat::expect_error(genClusteredData(n=5, p=19, k_unclustered=2,
cluster_size=5, n_clusters=3,
sig_clusters=2, rho=0.8,
beta_latent=1.5,
beta_unclustered=-2, snr=NA,
sigma_eps_sq=-.3),
"sigma_eps_sq >= 0 is not TRUE", fixed=TRUE)
})## Test passed with 52 successes 😀.
#' Generate randomly sampled data including noisy observations of latent
#' variables, where proxies differ in their relevance (noise level)
#'
#' Generate a data set including latent features Z, observed features X (which
#' may include noisy or noiseless observations of the latent features in Z),
#' an observed response y which is a linear model of features from Z and X as
#' well as independent mean zero noise, and mu (the responses from y without
#' the added noise). Data is generated in the same way as in the simulations
#' from Section 5.3 of Faletto and Bien (2022).
#' @param n Integer or numeric; the number of observations to generate. (The
#' generated X and Z will have n rows, and the generated y and mu will have
#' length n.)
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable.
#' @param n_strong_cluster_vars Integer or numeric; among the cluster_size
#' proxies in each cluster, the first n_strong_cluster_vars will have a high
#' covariance (rho_high) with the latent variable and the next cluster_size -
#' n_strong_cluster_vars will have a low covariance (rho_low) with the latent
#' variable.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho_high Integer or numeric; the covariance of the "strong proxies" in
#' each cluster with the latent variable (and each other). Note that the
#' correlation between the "strong proxy" features in the cluster will be
#' rho_high/var. rho_high cannot equal 0 and must be at least as large as
#' rho_low. Default is 0.9.
#' @param rho_low Integer or numeric; the covariance of the "weak proxies" in
#' each cluster with the latent variable (and each other). Note that the
#' correlation between the "weak proxy" features in the cluster will be
#' rho_low/var. rho_low cannot equal 0 and must be no larger than rho_high.
#' Default is 0.5.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features; either a numeric vector (if n_clusters > 1) or a numeric
#' matrix (if n_clusters > 1). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @export
genClusteredDataWeighted <- function(n, p, k_unclustered, cluster_size,
n_strong_cluster_vars, n_clusters=1, sig_clusters=1, rho_high=0.9,
rho_low=0.5, beta_latent=1.5, beta_unclustered=1, snr=as.numeric(NA),
sigma_eps_sq=as.numeric(NA)){
# Check inputs
checkGenClusteredDataWeightedInputs(p, k_unclustered, cluster_size,
n_strong_cluster_vars, n_clusters, sig_clusters, rho_high, rho_low,
beta_latent, beta_unclustered, snr, sigma_eps_sq)
ret <- genZmuY(n=n, p=p, k_unclustered=k_unclustered,
cluster_size=cluster_size, n_clusters=n_clusters,
sig_clusters=sig_clusters, beta_latent=beta_latent,
beta_unclustered=beta_unclustered, snr=snr, sigma_eps_sq=sigma_eps_sq)
Z <- ret$Z
y <- ret$y
mu <- ret$mu
other_X <- ret$other_X
# Finally, generate clusters of proxies to complete X.
noise_var_high <- getNoiseVar(rho_high)
noise_var_low <- getNoiseVar(rho_low)
# Create matrix of proxies
Z <- as.matrix(Z)
proxy_mat <- matrix(as.numeric(NA), n, n_clusters*cluster_size)
for(i in 1:n_clusters){
for(j in 1:n_strong_cluster_vars){
proxy_mat[, (i - 1)*cluster_size + j] <- Z[, i] + rnorm(n,
mean=0, sd=sqrt(noise_var_high))
}
for(j in (n_strong_cluster_vars + 1):cluster_size){
proxy_mat[, (i - 1)*cluster_size + j] <- Z[, i] + rnorm(n,
mean=0, sd=sqrt(noise_var_low))
}
}
X <- cbind(proxy_mat, other_X)
# Check output
stopifnot(length(mu) == n)
stopifnot(nrow(X) == n)
stopifnot(ncol(X) == p)
if(any(!is.na(Z))){
stopifnot(nrow(Z) == n)
stopifnot(ncol(Z) == n_clusters)
}
return(list(X=X, y=y, Z=Z, mu=mu))
}checkGenClusteredDataWeightedInputs()
#' Check inputs to genClusteredDataWeighted
#'
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable.
#' @param n_strong_cluster_vars Integer or numeric; among the cluster_size
#' proxies in each cluster, n_strong_cluster_vars will have a high covariance
#' (rho_high) with the latent variable and cluster_size - n_strong_cluster_vars
#' will have a low covariance (rho_low) with the latent variable.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho_high Integer or numeric; the covariance of the "strong proxies" in
#' each cluster with the latent variable (and each other). Note that the
#' correlation between the "strong proxy" features in the cluster will be
#' rho_high/var. rho_high cannot equal 0 and must be at least as large as
#' rho_low. Default is 0.9.
#' @param rho_low Integer or numeric; the covariance of the "weak proxies" in
#' each cluster with the latent variable (and each other). Note that the
#' correlation between the "weak proxy" features in the cluster will be
#' rho_low/var. rho_low cannot equal 0 and must be no larger than rho_high.
#' Default is 0.5.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features; either a numeric vector (if n_clusters > 1) or a numeric
#' matrix (if n_clusters > 1). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
checkGenClusteredDataWeightedInputs <- function(p, k_unclustered, cluster_size,
n_strong_cluster_vars, n_clusters, sig_clusters, rho_high, rho_low,
beta_latent, beta_unclustered, snr, sigma_eps_sq){
stopifnot(is.numeric(sig_clusters) | is.integer(sig_clusters))
stopifnot(sig_clusters <= n_clusters)
stopifnot(sig_clusters >= 0)
stopifnot(sig_clusters == round(sig_clusters))
stopifnot(is.numeric(n_clusters) | is.integer(n_clusters))
stopifnot(n_clusters == round(n_clusters))
# TODO(gregfaletto): is it easy to remove the requirement that n_clusters is
# at least 1 (so that it's possible to generate data with no latent
# features)? If so, should only check that cluster_size >= 1 if n_clusters
# >= 1, and in makeCovarianceMatrix function only need block_size >= 1
# rather than 2.
stopifnot(n_clusters >= 1)
stopifnot(cluster_size >= 2)
stopifnot(is.integer(n_strong_cluster_vars) |
is.numeric(n_strong_cluster_vars))
stopifnot(!is.na(n_strong_cluster_vars))
stopifnot(length(n_strong_cluster_vars) == 1)
stopifnot(n_strong_cluster_vars == round(n_strong_cluster_vars))
stopifnot(n_strong_cluster_vars >= 1)
stopifnot(n_strong_cluster_vars < cluster_size)
stopifnot(rho_high <= 1)
stopifnot(rho_high > 0)
stopifnot(rho_low > 0)
stopifnot(rho_high >= rho_low)
stopifnot(beta_latent != 0)
stopifnot(beta_unclustered != 0)
stopifnot(is.numeric(k_unclustered) | is.integer(k_unclustered))
stopifnot(k_unclustered >= 2)
stopifnot(k_unclustered == round(k_unclustered))
stopifnot(p >= n_clusters*cluster_size + k_unclustered)
# Same as make_sparse_blocked_linear_model_random, but ith coefficient
# of weak signal features is beta_unclustered/sqrt(i) in order to have
# a definitive ranking of weak signal features.
if(is.na(snr) & is.na(sigma_eps_sq)){
stop("Must specify one of snr or sigma_eps_sq")
}
if(!is.na(snr)){
stopifnot(snr > 0)
}
if(!is.na(sigma_eps_sq)){
stopifnot(sigma_eps_sq > 0)
}
}No tests for checkGenClusteredDataWeightedInputs()
Tests for genClusteredDataWeighted()
testthat::test_that("genClusteredDataWeighted works", {
set.seed(23478)
ret <- genClusteredDataWeighted(n=25, p=19, k_unclustered=2, cluster_size=5,
n_strong_cluster_vars=3, n_clusters=3,
sig_clusters=2, rho_high=.99, rho_low=.5,
beta_latent=1.5, beta_unclustered=-2,
snr=NA, sigma_eps_sq=.5)
testthat::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
testthat::expect_true(is.numeric(ret$X))
testthat::expect_true(is.matrix(ret$X))
testthat::expect_equal(ncol(ret$X), 19)
testthat::expect_equal(nrow(ret$X), 25)
# X is Gaussian with mean 0 and variance 4; expect all observations to lie
# within 5 standard deviations of mean
testthat::expect_true(all(abs(ret$X) < 5*2))
# Test that clusters are correlated--within-cluster correlation should be
# high, correlation with other features should be low
testthat::expect_true(min(cor(ret$X[, 1:3])) > .9)
testthat::expect_true(max(abs(cor(ret$X[, 1:5], ret$X[, 6:19]))) < .6)
testthat::expect_true(min(cor(ret$X[, 6:8])) > .9)
testthat::expect_true(max(abs(cor(ret$X[, 6:10],
ret$X[, c(1:5, 11:19)]))) < .6)
testthat::expect_true(min(cor(ret$X[, 11:13])) > .9)
testthat::expect_true(max(abs(cor(ret$X[, 11:15],
ret$X[, c(1:10, 16:19)]))) < .7)
cor_indeps <- cor(ret$X[, 16:19])
testthat::expect_true(max(abs(cor_indeps[lower.tri(cor_indeps)])) < .6)
testthat::expect_true(is.numeric(ret$y))
testthat::expect_equal(length(ret$y), 25)
testthat::expect_true(is.numeric(ret$Z))
testthat::expect_true(is.matrix(ret$Z))
testthat::expect_equal(nrow(ret$Z), 25)
testthat::expect_equal(ncol(ret$Z), 3)
testthat::expect_true(is.numeric(ret$mu))
testthat::expect_equal(length(ret$mu), 25)
# Because y is Gaussian with mean mu and standard deviation .5 conditional on
# mu, expect all observations to lie within 5 sds of mu
testthat::expect_true(all(abs(ret$y - ret$mu) < 5*.5))
# Specify SNR instead of sigma_eps_sq
ret <- genClusteredDataWeighted(n=25, p=19, k_unclustered=2, cluster_size=5,
n_strong_cluster_vars=3, n_clusters=3,
sig_clusters=2, rho_high=.99, rho_low=.5,
beta_latent=1.5, beta_unclustered=-2,
snr=1, sigma_eps_sq=NA)
testthat::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
# If sigma_eps_sq is specified, snr should be ignored. (Set an SNR that
# implies a very large noise variance to test this)
ret <- genClusteredDataWeighted(n=25, p=19, k_unclustered=2, cluster_size=5,
n_strong_cluster_vars=3, n_clusters=3,
sig_clusters=2, rho_high=.99, rho_low=.5,
beta_latent=1.5, beta_unclustered=-2,
snr=.01, sigma_eps_sq=.25)
testthat::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
# Because y is Gaussian with mean mu and standard deviation .5 conditional on
# mu, expect all observations to lie within 5 sds of mu
testthat::expect_true(all(abs(ret$y - ret$mu) < 5*sqrt(.25)))
# Try a single latent variable (z should be a one-column matrix)
ret <- genClusteredDataWeighted(n=25, p=19, k_unclustered=2, cluster_size=5,
n_strong_cluster_vars=3, n_clusters=1,
sig_clusters=1, rho_high=.99, rho_low=.5,
beta_latent=1.5, beta_unclustered=-2,
snr=NA, sigma_eps_sq=.5)
testthat::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
testthat::expect_true(is.numeric(ret$Z))
testthat::expect_true(is.matrix(ret$Z))
testthat::expect_equal(nrow(ret$Z), 25)
testthat::expect_equal(ncol(ret$Z), 1)
})## Test passed with 34 successes 🥳.
genClusteredDataWeightedRandom()
#' Generate randomly sampled data including noisy observations of latent
#' variables, where proxies differ in their relevance (noise level)
#'
#' Generate a data set including latent features Z, observed features X (which
#' may include noisy or noiseless observations of the latent features in Z),
#' an observed response y which is a linear model of features from Z and X as
#' well as independent mean zero noise, and mu (the responses from y without
#' the added noise).
#' @param n Integer or numeric; the number of observations to generate. (The
#' generated X and Z will have n rows, and the generated y and mu will have
#' length n.)
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho_high Integer or numeric; the maximum correlation of the proxies
#' each cluster with each other. Default is 1.
#' @param rho_low Integer or numeric; the minimum correlation of the proxies in
#' each cluster with each other. rho_low cannot equal 0 and must be no larger
#' than rho_high. Default is 0.5.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features; either a numeric vector (if n_clusters > 1) or a numeric
#' matrix (if n_clusters > 1). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @export
genClusteredDataWeightedRandom <- function(n, p, k_unclustered, cluster_size,
n_clusters=1, sig_clusters=1, rho_high=1, rho_low=0.5, beta_latent=1.5,
beta_unclustered=1, snr=as.numeric(NA), sigma_eps_sq=as.numeric(NA)){
# Check inputs
checkGenClusteredDataWeightedRandomInputs(p, k_unclustered, cluster_size,
n_clusters, sig_clusters, rho_high, rho_low,
beta_latent, beta_unclustered, snr, sigma_eps_sq)
ret <- genZmuY(n=n, p=p, k_unclustered=k_unclustered,
cluster_size=cluster_size, n_clusters=n_clusters,
sig_clusters=sig_clusters, beta_latent=beta_latent,
beta_unclustered=beta_unclustered, snr=snr, sigma_eps_sq=sigma_eps_sq)
Z <- ret$Z
y <- ret$y
mu <- ret$mu
other_X <- ret$other_X
# Finally, generate clusters of proxies to complete X.
# Create matrix of proxies
Z <- as.matrix(Z)
proxy_mat <- matrix(as.numeric(NA), n, n_clusters*cluster_size)
for(i in 1:n_clusters){
for(j in 1:cluster_size){
# Choose correlation at random
rho_ij <- runif(n=1, min=rho_low, max=rho_high)
# Get noise variance
noise_var_ij <- getNoiseVar(rho_ij)
proxy_mat[, (i - 1)*cluster_size + j] <- Z[, i] + rnorm(n,
mean=0, sd=sqrt(noise_var_ij))
}
}
X <- cbind(proxy_mat, other_X)
# Check output
stopifnot(length(mu) == n)
stopifnot(nrow(X) == n)
stopifnot(ncol(X) == p)
if(any(!is.na(Z))){
stopifnot(nrow(Z) == n)
stopifnot(ncol(Z) == n_clusters)
}
return(list(X=X, y=y, Z=Z, mu=mu))
}checkGenClusteredDataWeightedRandomInputs()
#' Check inputs to genClusteredDataWeightedRandom
#'
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho_high Integer or numeric; the maximum correlation of the proxies
#' each cluster with each other. Default is 1.
#' @param rho_low Integer or numeric; the minimum correlation of the proxies in
#' each cluster with each other. rho_low cannot equal 0 and must be no larger
#' than rho_high. Default is 0.5.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features; either a numeric vector (if n_clusters > 1) or a numeric
#' matrix (if n_clusters > 1). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
checkGenClusteredDataWeightedRandomInputs <- function(p, k_unclustered,
cluster_size, n_clusters, sig_clusters, rho_high, rho_low, beta_latent,
beta_unclustered, snr, sigma_eps_sq){
stopifnot(is.numeric(sig_clusters) | is.integer(sig_clusters))
stopifnot(sig_clusters <= n_clusters)
stopifnot(sig_clusters >= 0)
stopifnot(sig_clusters == round(sig_clusters))
stopifnot(is.numeric(n_clusters) | is.integer(n_clusters))
stopifnot(n_clusters == round(n_clusters))
# TODO(gregfaletto): is it easy to remove the requirement that n_clusters is
# at least 1 (so that it's possible to generate data with no latent
# features)? If so, should only check that cluster_size >= 1 if n_clusters
# >= 1, and in makeCovarianceMatrix function only need block_size >= 1
# rather than 2.
stopifnot(n_clusters >= 1)
stopifnot(cluster_size >= 2)
stopifnot(rho_high <= 1)
stopifnot(rho_high > 0)
stopifnot(rho_low > 0)
stopifnot(rho_high >= rho_low)
stopifnot(beta_latent != 0)
stopifnot(beta_unclustered != 0)
stopifnot(is.numeric(k_unclustered) | is.integer(k_unclustered))
stopifnot(k_unclustered >= 2)
stopifnot(k_unclustered == round(k_unclustered))
stopifnot(p >= n_clusters*cluster_size + k_unclustered)
# Same as make_sparse_blocked_linear_model_random, but ith coefficient
# of weak signal features is beta_unclustered/sqrt(i) in order to have
# a definitive ranking of weak signal features.
if(is.na(snr) & is.na(sigma_eps_sq)){
stop("Must specify one of snr or sigma_eps_sq")
}
if(!is.na(snr)){
stopifnot(snr > 0)
}
if(!is.na(sigma_eps_sq)){
stopifnot(sigma_eps_sq > 0)
}
}No tests for checkGenClusteredDataWeightedRandomInputs()
Tests for genClusteredDataWeightedRandom()
testthat::test_that("genClusteredDataWeightedRandom works", {
set.seed(23478)
ret <- genClusteredDataWeightedRandom(n=25, p=19, k_unclustered=2, cluster_size=5,
n_clusters=3,
sig_clusters=2, rho_high=1, rho_low=.5,
beta_latent=1.5, beta_unclustered=-2,
snr=NA, sigma_eps_sq=.5)
testthat::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
testthat::expect_true(is.numeric(ret$X))
testthat::expect_true(is.matrix(ret$X))
testthat::expect_equal(ncol(ret$X), 19)
testthat::expect_equal(nrow(ret$X), 25)
# X is Gaussian with mean 0 and variance 4; expect all observations to lie
# within 5 standard deviations of mean
testthat::expect_true(all(abs(ret$X) < 5*2))
# Test that clusters are correlated--within-cluster correlation should be
# high, correlation with other features should be low
testthat::expect_true(min(cor(ret$X[, 1:3])) > .2)
testthat::expect_true(max(abs(cor(ret$X[, 1:5], ret$X[, 6:19]))) < .6)
testthat::expect_true(min(cor(ret$X[, 6:8])) > .2)
testthat::expect_true(max(abs(cor(ret$X[, 6:10],
ret$X[, c(1:5, 11:19)]))) < .6)
testthat::expect_true(min(cor(ret$X[, 11:13])) > .2)
testthat::expect_true(max(abs(cor(ret$X[, 11:15],
ret$X[, c(1:10, 16:19)]))) < .7)
cor_indeps <- cor(ret$X[, 16:19])
testthat::expect_true(max(abs(cor_indeps[lower.tri(cor_indeps)])) < .6)
testthat::expect_true(is.numeric(ret$y))
testthat::expect_equal(length(ret$y), 25)
testthat::expect_true(is.numeric(ret$Z))
testthat::expect_true(is.matrix(ret$Z))
testthat::expect_equal(nrow(ret$Z), 25)
testthat::expect_equal(ncol(ret$Z), 3)
testthat::expect_true(is.numeric(ret$mu))
testthat::expect_equal(length(ret$mu), 25)
# Because y is Gaussian with mean mu and standard deviation .5 conditional on
# mu, expect all observations to lie within 5 sds of mu
testthat::expect_true(all(abs(ret$y - ret$mu) < 5*.5))
# Specify SNR instead of sigma_eps_sq
ret <- genClusteredDataWeightedRandom(n=25, p=19, k_unclustered=2, cluster_size=5,
n_clusters=3,
sig_clusters=2, rho_high=1, rho_low=.5,
beta_latent=1.5, beta_unclustered=-2,
snr=1, sigma_eps_sq=NA)
testthat::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
# If sigma_eps_sq is specified, snr should be ignored. (Set an SNR that
# implies a very large noise variance to test this)
ret <- genClusteredDataWeightedRandom(n=25, p=19, k_unclustered=2, cluster_size=5,
n_clusters=3,
sig_clusters=2, rho_high=.99, rho_low=.5,
beta_latent=1.5, beta_unclustered=-2,
snr=.01, sigma_eps_sq=.25)
testthat::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
# Because y is Gaussian with mean mu and standard deviation .5 conditional on
# mu, expect all observations to lie within 5 sds of mu
testthat::expect_true(all(abs(ret$y - ret$mu) < 5*sqrt(.25)))
# Try a single latent variable (z should be a one-column matrix)
ret <- genClusteredDataWeightedRandom(n=25, p=19, k_unclustered=2, cluster_size=5,
n_clusters=1,
sig_clusters=1, rho_high=1, rho_low=.5,
beta_latent=1.5, beta_unclustered=-2,
snr=NA, sigma_eps_sq=.5)
testthat::expect_true(is.list(ret))
testthat::expect_identical(names(ret), c("X", "y", "Z", "mu"))
testthat::expect_true(is.numeric(ret$Z))
testthat::expect_true(is.matrix(ret$Z))
testthat::expect_equal(nrow(ret$Z), 25)
testthat::expect_equal(ncol(ret$Z), 1)
})## Test passed with 34 successes 😸.
#' Get lambda value for lasso
#'
#' Chooses a lambda value for the lasso used on a subsample of size n/2 (as in
#' cluster stability selection) by cross-validation.
#' @param X An n x p numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the p >= 2 features/predictors that will be used by cluster stability
#' selection.
#' @param y The response; an n-dimensional numeric or integer vector. (Unlike
#' in the more general css setup, this response must be real-valued since
#' lambda will be determined using the lasso with cross-validation.)
#' @param lambda_choice Character; either "min" or "1se". If "min", chooses
#' the lambda that minimizes the cross-validated error; if "1se", chooses the
#' largest lambda within one standard error of the minimum error lambda
#' (resulting in a smaller selected set, which may be desirable because the
#' model size corresponding to the minimum error lambda tends to be larger
#' than optimal. See, for example, Bühlmann and Meinshausen 2006, Prop. 1 and
#' Bühlmann and van de Geer 2011, Section 2.5.1.). Default is "1se".
#' @param nfolds Numeric or integer; the number of folds for cross-validation.
#' Must be at least 4 (as specified by cv.glmnet). Default is 10.
#' @param alpha Numeric; the elastic net mixing parameter. Default is 1 (in
#' which case the penalty is for lasso)
#' @return A numeric; the selected value of lambda.
#' @author Gregory Faletto, Jacob Bien
#' @references Bühlmann, P., & Meinshausen, N. (2006). High-Dimensional Graphs
#' and Variable Selection With the Lasso. \emph{The Annals of Statistics},
#' 34(3), 1436–1462. \url{https://doi.org/10.1214/009053606000000281}.
#' \cr Peter Bühlmann and Sara van de Geer. Statistics for High-Dimensional
#' Data. \emph{Springer Series in Statistics}. Springer, Heidelberg, 2011. ISBN
#' 978-3-642-20191-2. \url{http://dx.doi.org/10.1007/978-3-642-20192-9}. \cr
#' Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization
#' Paths for Generalized Linear Models via Coordinate Descent. \emph{Journal of
#' Statistical Software}, 33(1), 1-22. URL \url{https://www.jstatsoft.org/v33/i01/}.
#' @export
getLassoLambda <- function(X, y, lambda_choice="1se", nfolds=10, alpha=1){
stopifnot(is.character(lambda_choice))
stopifnot(length(lambda_choice) == 1)
stopifnot(!is.na(lambda_choice))
stopifnot(lambda_choice %in% c("min", "1se"))
stopifnot(is.matrix(X))
stopifnot(is.numeric(X) | is.integer(X))
n <- nrow(X)
stopifnot(is.numeric(nfolds) | is.integer(nfolds))
stopifnot(length(nfolds) == 1)
stopifnot(nfolds == round(nfolds))
stopifnot(nfolds > 3)
stopifnot(is.numeric(alpha) | is.integer(alpha))
stopifnot(length(alpha) == 1)
stopifnot(!is.na(alpha))
stopifnot(alpha >= 0)
stopifnot(alpha <= 1)
# Since we are using the lasso, we require y to be a real-valued response
# (unlike for the general cluster stability selection procedure, where y
# can have a more general format as long as a suitable feature selection
# function is provided by the user)
stopifnot(is.numeric(y) | is.integer(y))
stopifnot(n == length(y))
# Sample size to use: inflate n/2 by a factor of nfolds/(nfolds - 1),
# so that each individual lasso fit is of size floor(n/2)
n_sample <- min(round(n/2*nfolds/(nfolds - 1)), n)
nfolds <- min(nfolds, n_sample)
inds_size <- sample(1:n, n_sample)
size_results <- glmnet::cv.glmnet(x=X[inds_size, ], y=y[inds_size],
family="gaussian", nfolds=nfolds, alpha=alpha)
lambda_ret <- size_results[[paste("lambda.", lambda_choice, sep="")]]
# Check output
stopifnot(length(lambda_ret) == 1)
stopifnot(is.numeric(lambda_ret) | is.integer(lambda_ret))
stopifnot(lambda_ret >= 0)
return(lambda_ret)
}Tests for getLassoLambda():
testthat::test_that("getLassoLambda works", {
set.seed(7252)
x <- matrix(stats::rnorm(10*6), nrow=10, ncol=6)
y <- stats::rnorm(10)
ret <- getLassoLambda(X=x, y=y, lambda_choice="1se", nfolds=4)
testthat::expect_true(is.numeric(ret) | is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_true(ret >= 0)
ret <- getLassoLambda(X=x, y=y, lambda_choice="min", nfolds=5)
testthat::expect_true(is.numeric(ret) | is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_true(ret >= 0)
# Different mixing parameter
ret <- getLassoLambda(X=x, y=y, lambda_choice="min", nfolds=5, alpha=0.5)
testthat::expect_true(is.numeric(ret) | is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_true(ret >= 0)
# Bad inputs
testthat::expect_error(getLassoLambda(X="x", y=y), "is.matrix(X) is not TRUE",
fixed=TRUE)
testthat::expect_error(getLassoLambda(X=x[1:9, ], y=y),
"n == length(y) is not TRUE", fixed=TRUE)
testthat::expect_error(getLassoLambda(X=x, y=TRUE),
"is.numeric(y) | is.integer(y) is not TRUE",
fixed=TRUE)
# Error has quotation marks in it
testthat::expect_error(getLassoLambda(X=x, y=y, lambda_choice="mni"))
testthat::expect_error(getLassoLambda(X=x, y=y,
lambda_choice=c("min", "1se")),
"length(lambda_choice) == 1 is not TRUE",
fixed=TRUE)
testthat::expect_error(getLassoLambda(X=x, y=y, lambda_choice=1),
"is.character(lambda_choice) is not TRUE",
fixed=TRUE)
testthat::expect_error(getLassoLambda(X=x, y=y, nfolds="5"),
"is.numeric(nfolds) | is.integer(nfolds) is not TRUE",
fixed=TRUE)
testthat::expect_error(getLassoLambda(X=x, y=y, nfolds=1:4),
"length(nfolds) == 1 is not TRUE", fixed=TRUE)
testthat::expect_error(getLassoLambda(X=x, y=y, nfolds=3.2),
"nfolds == round(nfolds) is not TRUE", fixed=TRUE)
testthat::expect_error(getLassoLambda(X=x, y=y, nfolds=3),
"nfolds > 3 is not TRUE", fixed=TRUE)
testthat::expect_error(getLassoLambda(X=x, y=y, nfolds=4, alpha=1.2),
"alpha <= 1 is not TRUE", fixed=TRUE)
})## ── Warning: getLassoLambda works ───────────────────────────────────────────────
## Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold
## Backtrace:
## â–†
## 1. └─litr (local) getLassoLambda(X = x, y = y, lambda_choice = "1se", nfolds = 4)
## 2. └─glmnet::cv.glmnet(...)
## 3. └─glmnet:::cv.glmnet.raw(...)
## ── Warning: getLassoLambda works ───────────────────────────────────────────────
## Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold
## Backtrace:
## â–†
## 1. └─litr (local) getLassoLambda(X = x, y = y, lambda_choice = "min", nfolds = 5)
## 2. └─glmnet::cv.glmnet(...)
## 3. └─glmnet:::cv.glmnet.raw(...)
## ── Warning: getLassoLambda works ───────────────────────────────────────────────
## Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold
## Backtrace:
## â–†
## 1. └─litr (local) getLassoLambda(...)
## 2. └─glmnet::cv.glmnet(...)
## 3. └─glmnet:::cv.glmnet.raw(...)
## Test passed with 23 successes 😸.
getModelSize():
#' Automated estimation of model size
#'
#' This function is uses the lasso with cross-validation to estimate the
#' model size. Before using the lasso, in each cluster all features will be
#' dropped from X except the one feature with the highest marginal correlation
#' with y, as in the protolasso (Reid and Tibshirani 2016).
#'
#' @param X An n x p numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the p >= 2 features/predictors.
#' @param y A length-n numeric vector containing the responses; `y[i]` is the
#' response corresponding to observation `X[i, ]`. (Note that for the css
#' function, y does not have to be a numeric response, but for this function,
#' the underlying selection procedure is the lasso, so y must be a real-valued
#' response.)
#' @param clusters A named list where each entry is an integer vector of indices
#' of features that are in a common cluster, as in the output of css.
#' (The length of list clusters is equal to the number of clusters.) All
#' identified clusters must be non-overlapping, and all features must appear in
#' exactly one cluster (any unclustered features should be in their own
#' "cluster" of size 1). CAUTION: if the provided X is a data.frame that
#' contains a categorical feature with more than two levels, then the resulting
#' matrix made from model.matrix will have a different number of columns than
#' the provided data.frame, some of the feature numbers will change, and the
#' clusters argument will not work properly (in the current version of the
#' package). To get correct results in this case, please use model.matrix to
#' convert the data.frame to a numeric matrix on your own, then provide this
#' matrix and cluster assignments with respect to this matrix.
#' @return An integer; the estimated size of the model. The minimum returned
#' value is 1, even if the lasso with cross-validation chose no features.
#' @author Gregory Faletto, Jacob Bien
#' @references Reid, S., & Tibshirani, R. (2016). Sparse regression and marginal
#' testing using cluster prototypes. \emph{Biostatistics}, 17(2), 364–376.
#' \url{https://doi.org/10.1093/biostatistics/kxv049}.
#' @export
getModelSize <- function(X, y, clusters){
stopifnot(is.matrix(X) | is.data.frame(X))
# Check if x is a matrix; if it's a data.frame, convert to matrix.
if(is.data.frame(X)){
p <- ncol(X)
X <- stats::model.matrix(~ ., X)
X <- X[, colnames(X) != "(Intercept)"]
if(length(clusters) > 0 & (p != ncol(X))){
stop("When stats::model.matrix converted the provided data.frame X to a matrix, the number of columns changed (probably because the provided data.frame contained a factor variable with at least three levels). Please convert X to a matrix yourself using model.matrix and provide cluster assignments according to the columns of the new matrix.")
}
}
stopifnot(is.matrix(X))
stopifnot(all(!is.na(X)))
stopifnot(is.numeric(X) | is.integer(X))
n <- nrow(X)
# Since the model size will be determined by cross-validation, the provided
# y must be real-valued (this should be checked internally in other
# functions before getModelSize is called, but this check is here just in
# case).
if(!is.numeric(y) & !is.integer(y)){
stop("getModelSize is trying to determine max_num_clusts using the lasso with cross-validation, but the y provided to getModelSize was not real-valued.")
}
stopifnot(length(y) == n)
# Check clusters argument
clusters <- checkCssClustersInput(clusters)
### Format clusters into a list where all features are in exactly one
# cluster (any unclustered features are put in their own "cluster" of size
# 1).
clust_names <- as.character(NA)
if(!is.null(names(clusters)) & is.list(clusters)){
clust_names <- names(clusters)
}
clusters <- formatClusters(clusters, p=ncol(X),
clust_names=clust_names)$clusters
X_size <- X
if(length(clusters) > 0){
# Create modified design matrix by dropping all but one feature from
# each cluster
feats_to_drop <- integer()
for(i in 1:length(clusters)){
cluster_i <- clusters[[i]]
if(length(cluster_i) > 1){
feat_to_keep <- identifyPrototype(cluster_i, X, y)
feats_to_drop <- c(feats_to_drop, setdiff(cluster_i,
feat_to_keep))
}
}
if(length(feats_to_drop) > 0){
X_size <- X_size[, -1*feats_to_drop]
}
}
size_results <- glmnet::cv.glmnet(x=X_size, y=y, family="gaussian")
coefs <- as.numeric(glmnet::coef.glmnet(size_results, s="lambda.1se"))
# Number of nonzero coefficients (subtract one in order to ignore intercept)
size <- length(coefs[coefs != 0]) - 1
# Check output
stopifnot(is.numeric(size) | is.integer(size))
stopifnot(!is.na(size))
stopifnot(length(size) == 1)
stopifnot(size == round(size))
return(as.integer(max(size, 1)))
}Tests for getModelSize():
testthat::test_that("getModelSize works", {
set.seed(1723)
data <- genClusteredData(n=15, p=11, k_unclustered=2, cluster_size=3,
n_clusters=2, sigma_eps_sq=10^(-6))
x <- data$X
y <- data$y
good_clusters <- list(red_cluster=1L:3L, green_cluster=4L:6L)
ret <- getModelSize(X=x, y=y, clusters=good_clusters)
testthat::expect_true(is.numeric(ret) | is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_equal(ret, round(ret))
testthat::expect_true(ret >= 1)
# 11 features, but two clusters of size 3, so maximum size should
# be 11 - 2 - 2 = 7
testthat::expect_true(ret <= 7)
## Trying other inputs
unnamed_clusters <- list(1L:3L, 5L:8L)
ret <- getModelSize(X=x, y=y, clusters=unnamed_clusters)
testthat::expect_true(is.numeric(ret) | is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_equal(ret, round(ret))
testthat::expect_true(ret >= 1)
# 11 features, but 3 in one cluster and 4 in another, so maximum size should
# be 11 - 2 - 3 = 6
testthat::expect_true(ret <= 6)
# Single cluster
ret <- getModelSize(X=x, y=y, clusters=2:5)
testthat::expect_true(is.numeric(ret) | is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_equal(ret, round(ret))
testthat::expect_true(ret >= 1)
# 11 features, but 4 in one cluster, so maximum size should be 11 - 3 = 8
testthat::expect_true(ret <= 8)
# Intentionally don't provide clusters for all feature, mix up formatting,
# etc.
good_clusters <- list(red_cluster=1:3, 5:8)
ret <- getModelSize(X=x, y=y, clusters=good_clusters)
testthat::expect_true(is.numeric(ret) | is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_equal(ret, round(ret))
testthat::expect_true(ret >= 1)
# 11 features, but 3 in one cluster and 4 in another, so maximum size should
# be 11 - 2 - 3 = 6
testthat::expect_true(ret <= 6)
## Trying bad inputs
testthat::expect_error(getModelSize(X="x", y=y, clusters=good_clusters),
"is.matrix(X) | is.data.frame(X) is not TRUE", fixed=TRUE)
testthat::expect_error(getModelSize(X=x[1:5, ], y=y, clusters=good_clusters),
"length(y) == n is not TRUE", fixed=TRUE)
testthat::expect_error(getModelSize(X=x, y=FALSE, clusters=good_clusters),
"getModelSize is trying to determine max_num_clusts using the lasso with cross-validation, but the y provided to getModelSize was not real-valued.",
fixed=TRUE)
testthat::expect_error(getModelSize(X=x, y=y, clusters=list(3:7, 7:10)),
"Overlapping clusters detected; clusters must be non-overlapping. Overlapping clusters: 1, 2.",
fixed=TRUE)
testthat::expect_error(getModelSize(X=x, y=y, clusters=list(5:8, 5:8)),
"length(clusters) == length(unique(clusters)) is not TRUE",
fixed=TRUE)
testthat::expect_error(getModelSize(X=x, y=y, clusters=6:50),
"length(all_clustered_feats) == p is not TRUE",
fixed=TRUE)
testthat::expect_error(getModelSize(X=x, y=y,
clusters=list(2:3, as.integer(NA))),
"!is.na(clusters) are not all TRUE",
fixed=TRUE)
testthat::expect_error(getModelSize(X=x, y=y, clusters=list(2:3, c(4, 4, 5))),
"length(clusters[[i]]) == length(unique(clusters[[i]])) is not TRUE",
fixed=TRUE)
testthat::expect_error(getModelSize(X=x, y=y, clusters=list(1:4, -1)),
"all(clusters[[i]] >= 1) is not TRUE",
fixed=TRUE)
testthat::expect_error(getModelSize(X=x, y=y, clusters=list(1:4,
c(2.3, 1.2))),
"all(clusters[[i]] == round(clusters[[i]])) is not TRUE",
fixed=TRUE)
})## ── Warning: getModelSize works ─────────────────────────────────────────────────
## Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold
## Backtrace:
## â–†
## 1. └─litr (local) getModelSize(X = x, y = y, clusters = good_clusters)
## 2. └─glmnet::cv.glmnet(x = X_size, y = y, family = "gaussian")
## 3. └─glmnet:::cv.glmnet.raw(...)
## ── Warning: getModelSize works ─────────────────────────────────────────────────
## Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold
## Backtrace:
## â–†
## 1. └─litr (local) getModelSize(X = x, y = y, clusters = unnamed_clusters)
## 2. └─glmnet::cv.glmnet(x = X_size, y = y, family = "gaussian")
## 3. └─glmnet:::cv.glmnet.raw(...)
## ── Warning: getModelSize works ─────────────────────────────────────────────────
## Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold
## Backtrace:
## â–†
## 1. └─litr (local) getModelSize(X = x, y = y, clusters = 2:5)
## 2. └─glmnet::cv.glmnet(x = X_size, y = y, family = "gaussian")
## 3. └─glmnet:::cv.glmnet.raw(...)
## ── Warning: getModelSize works ─────────────────────────────────────────────────
## Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold
## Backtrace:
## â–†
## 1. └─litr (local) getModelSize(X = x, y = y, clusters = good_clusters)
## 2. └─glmnet::cv.glmnet(x = X_size, y = y, family = "gaussian")
## 3. └─glmnet:::cv.glmnet.raw(...)
## Test passed with 34 successes 😀.
printCssDf():
#' Prepares a data.frame summarazing cluster stability selection output to print
#'
#' Print a summary of the information from the css function.
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param cutoff Numeric; the outputted data.frame will display only those
#' clusters with selection proportions equal to at least cutoff. Must be between
#' 0 and 1. Default is 0 (in which case either all clusters are displayed, or
#' max_num_clusts are, if max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.) Default is 1.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @return A data.frame; each row contains a cluster, arranged in decreasing
#' order of cluster selection proportion from top to bottom. The columns are
#' ClustName (the name of the cluster that was either provided to css or made by
#' css if no name was provided); ClustProtoName (the name of the selection
#' prototype from the cluster, which is the feature with the greatest individual
#' selection proportion among all the cluster members, with ties broken by
#' choosing the feature with the highest correlation with the response if the
#' response is real-valued; only returned if the features are named),
#' ClustProtoNum (the column number of the prototype in the X matrix provided to
#' css), and ClustSize (the size of the cluster).
#' @author Gregory Faletto, Jacob Bien
#' @export
printCssDf <- function(css_results, cutoff=0, min_num_clusts=1,
max_num_clusts=NA){
# Check inputs
stopifnot(class(css_results) == "cssr")
checkCutoff(cutoff)
p <- ncol(css_results$feat_sel_mat)
checkMinNumClusts(min_num_clusts, p, length(css_results$clusters))
max_num_clusts <- checkMaxNumClusts(max_num_clusts, min_num_clusts, p,
length(css_results$clusters))
sel_clusts <- getCssSelections(css_results, cutoff=cutoff,
min_num_clusts=min_num_clusts,
max_num_clusts=max_num_clusts)$selected_clusts
# sel_clusts is guaranteed to have length at least 1 by
# getCssSelections
# Get prototypes (feature from each cluster with highest selection
# proportion, breaking ties by using marginal correlations of features with
# y from data provided to css if y is real-valued)
prototypes <- getSelectionPrototypes(css_results, sel_clusts)
# Cluster selection proportions
if(length(sel_clusts) > 1){
sel_clust_sel_props <- colMeans(css_results$clus_sel_mat[,
names(sel_clusts)])
} else{
sel_clust_sel_props <- mean(css_results$clus_sel_mat[,
names(sel_clusts)])
}
# Data.frame: name of cluster, cluster prototype, selection proportion,
# cluster size
if(!is.null(names(prototypes))){
print_df <- data.frame(ClustName=names(sel_clusts),
ClustProtoName=names(prototypes), ClustProtoNum=unname(prototypes),
ClustSelProp=sel_clust_sel_props, ClustSize=lengths(sel_clusts))
} else{
print_df <- data.frame(ClustName=names(sel_clusts),
ClustProtoNum=unname(prototypes), ClustSelProp=sel_clust_sel_props,
ClustSize=lengths(sel_clusts))
}
print_df <- print_df[order(print_df$ClustSelProp, decreasing=TRUE), ]
rownames(print_df) <- NULL
stopifnot(is.data.frame(print_df))
stopifnot(nrow(print_df) >= 1)
return(print_df)
}#' Identify selection prototypes from selected clusters
#'
#' Takes in list of selected clusters and returns an integer vector of the
#' indices of the features that were most frequently selected from each cluster
#'
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param selected_clusts A list of integer vectors; each vector must contain
#' the indices of features in a cluster.
#' @return An integer vector (of length equal to the number of clusters) of the
#' indices of the feature prototypes (the features from each cluster that were
#' selected the most often individually by the base method in cluster stability
#' selection). In the case of a tie, the tie is broken by choosing the feature
#' most correlated with the response in the full data set provided to css.
#' @author Gregory Faletto, Jacob Bien
getSelectionPrototypes <- function(css_results, selected_clusts){
# Check inputs
stopifnot(class(css_results) == "cssr")
stopifnot(is.list(selected_clusts))
n_selected_clusts <- length(selected_clusts)
stopifnot(n_selected_clusts >= 1)
stopifnot(all(lengths(selected_clusts) >= 1))
prototypes <- rep(as.integer(NA), n_selected_clusts)
for(i in 1:n_selected_clusts){
clust_i <- selected_clusts[[i]]
sel_props_i <- colMeans(css_results$feat_sel_mat)[clust_i]
proto_i <- clust_i[sel_props_i == max(sel_props_i)]
stopifnot(length(proto_i) >= 1)
if(length(proto_i) > 1){
if(is.numeric(css_results$y) | is.integer(css_results$y)){
# Break tie by looking at marginal correlations
corrs_i <- stats::cor(css_results$X[, proto_i], css_results$y)[, 1]
corrs_i <- abs(corrs_i)
proto_i <- proto_i[corrs_i == max(corrs_i)]
}
}
# If there is still a tie, break by choosing the first feature of those
# remaining
prototypes[i] <- proto_i[1]
names(prototypes)[i] <- colnames(css_results$X)[proto_i]
}
# Check output
stopifnot(is.integer(prototypes))
stopifnot(all(!is.na(prototypes)))
stopifnot(length(prototypes) == length(unique(prototypes)))
return(prototypes)
}Tests for getSelectionPrototypes():
testthat::test_that("getSelectionPrototypes works", {
set.seed(67234)
data <- genClusteredData(n=15, p=11, k_unclustered=2, cluster_size=3,
n_clusters=2, sig_clusters=1, sigma_eps_sq=10^(-6))
x <- data$X
y <- data$y
good_clusters <- list(red_cluster=1L:3L, green_cluster=4L:6L)
css_results <- css(X=x, y=y, lambda=0.01, clusters=good_clusters)
ret <- getSelectionPrototypes(css_results, selected_clusts=good_clusters)
testthat::expect_true(is.integer(ret))
testthat::expect_true(all(!is.na(ret)))
testthat::expect_equal(length(ret), length(good_clusters))
testthat::expect_equal(length(ret), length(unique(ret)))
for(i in 1:length(ret)){
testthat::expect_true(ret[i] %in% good_clusters[[i]])
# Find the largest selection proportion of any feature in cluster i
max_prop <- max(colMeans(css_results$feat_sel_mat[, good_clusters[[i]]]))
# Find the selection proportion of the identified prototype
proto_prop <- colMeans(css_results$feat_sel_mat)[ret[i]]
testthat::expect_equal(max_prop, proto_prop)
}
# Try with only one selected cluster (still should be in a list)
ret <- getSelectionPrototypes(css_results,
selected_clusts=list(red_cluster=1L:3L))
testthat::expect_true(is.integer(ret))
testthat::expect_true(!is.na(ret))
testthat::expect_equal(length(ret), 1)
testthat::expect_true(ret %in% 1L:3L)
# Find the largest selection proportion of any feature in the cluster
max_prop <- max(colMeans(css_results$feat_sel_mat[, 1L:3L]))
# Find the selection proportion of the identified prototype
proto_prop <- colMeans(css_results$feat_sel_mat)[ret]
testthat::expect_equal(max_prop, proto_prop)
## Trying bad inputs
# Error contains quotation marks
testthat::expect_error(getSelectionPrototypes(x, good_clusters))
testthat::expect_error(getSelectionPrototypes(css_results, 1L:3L),
"is.list(selected_clusts) is not TRUE", fixed=TRUE)
testthat::expect_error(getSelectionPrototypes(css_results, list()),
"n_selected_clusts >= 1 is not TRUE", fixed=TRUE)
testthat::expect_error(getSelectionPrototypes(css_results,
list(red_cluster=1L:3L,
green_cluster=4L:6L,
bad_cluster=integer())),
"all(lengths(selected_clusts) >= 1) is not TRUE",
fixed=TRUE)
})## Test passed with 17 successes 😸.
Tests for printCssDf():
testthat::test_that("printCssDf works", {
set.seed(67234)
data <- genClusteredData(n=15, p=11, k_unclustered=2, cluster_size=3,
n_clusters=2, sig_clusters=1, sigma_eps_sq=10^(-6))
x <- data$X
y <- data$y
good_clusters <- list(red_cluster=1L:3L, green_cluster=4L:6L)
css_results <- css(X=x, y=y, lambda=0.01, clusters=good_clusters, B=10)
ret <- printCssDf(css_results)
testthat::expect_true(is.data.frame(ret))
testthat::expect_identical(colnames(ret), c("ClustName", "ClustProtoNum",
"ClustSelProp", "ClustSize"))
# Total number of clusters is 11 - (3 - 1) - (3 - 1) = 7
testthat::expect_equal(nrow(ret), 7)
testthat::expect_true(is.character(ret$ClustName))
testthat::expect_true(all(names(good_clusters) %in% ret$ClustName))
testthat::expect_equal(length(ret$ClustName), length(unique(ret$ClustName)))
testthat::expect_true(is.integer(ret$ClustProtoNum))
testthat::expect_true(ret[ret$ClustName=="red_cluster",
"ClustProtoNum"] %in% 1L:3L)
testthat::expect_true(ret[ret$ClustName=="green_cluster",
"ClustProtoNum"] %in% 4L:6L)
other_rows <- !(ret$ClustName %in% c("red_cluster", "green_cluster"))
testthat::expect_true(all(ret[other_rows, "ClustProtoNum"] %in% 7L:11L))
testthat::expect_true(length(ret[other_rows, "ClustProtoNum"]) ==
length(unique(ret[other_rows, "ClustProtoNum"])))
testthat::expect_true(is.numeric(ret$ClustSelProp))
testthat::expect_identical(ret$ClustSelProp, sort(ret$ClustSelProp,
decreasing=TRUE))
testthat::expect_true(is.integer(ret$ClustSize))
testthat::expect_equal(ret[ret$ClustName=="red_cluster", "ClustSize"], 3)
testthat::expect_equal(ret[ret$ClustName=="green_cluster", "ClustSize"], 3)
testthat::expect_true(all(ret[other_rows, "ClustSize"] == 1))
# Try again naming features
x_named <- x
colnames(x_named) <- LETTERS[1:11]
css_results_name_feats <- css(X=x_named, y=y, lambda=0.01,
clusters=good_clusters, B=10)
ret <- printCssDf(css_results_name_feats)
testthat::expect_true(is.data.frame(ret))
testthat::expect_identical(colnames(ret), c("ClustName", "ClustProtoName",
"ClustProtoNum", "ClustSelProp",
"ClustSize"))
# Total number of clusters is 11 - (3 - 1) - (3 - 1) = 7
testthat::expect_equal(nrow(ret), 7)
testthat::expect_true(is.character(ret$ClustName))
testthat::expect_true(all(names(good_clusters) %in% ret$ClustName))
testthat::expect_equal(length(ret$ClustName), length(unique(ret$ClustName)))
testthat::expect_true(is.character(ret$ClustProtoName))
testthat::expect_true(ret[ret$ClustName=="red_cluster",
"ClustProtoName"] %in% LETTERS[1:3])
testthat::expect_true(ret[ret$ClustName=="green_cluster",
"ClustProtoName"] %in% LETTERS[4:6])
other_rows <- !(ret$ClustName %in% c("red_cluster", "green_cluster"))
testthat::expect_true(all(ret[other_rows, "ClustProtoName"] %in% LETTERS[7:11]))
testthat::expect_true(length(ret[other_rows, "ClustProtoName"]) ==
length(unique(ret[other_rows, "ClustProtoName"])))
testthat::expect_true(is.integer(ret$ClustProtoNum))
testthat::expect_true(ret[ret$ClustName=="red_cluster",
"ClustProtoNum"] %in% 1L:3L)
testthat::expect_true(ret[ret$ClustName=="green_cluster",
"ClustProtoNum"] %in% 4L:6L)
testthat::expect_true(all(ret[other_rows, "ClustProtoNum"] %in% 7L:11L))
testthat::expect_true(length(ret[other_rows, "ClustProtoNum"]) ==
length(unique(ret[other_rows, "ClustProtoNum"])))
testthat::expect_true(is.numeric(ret$ClustSelProp))
testthat::expect_identical(ret$ClustSelProp, sort(ret$ClustSelProp,
decreasing=TRUE))
testthat::expect_true(is.integer(ret$ClustSize))
testthat::expect_equal(ret[ret$ClustName=="red_cluster", "ClustSize"], 3)
testthat::expect_equal(ret[ret$ClustName=="green_cluster", "ClustSize"], 3)
testthat::expect_true(all(ret[other_rows, "ClustSize"] == 1))
# Unnamed clusters
unnamed_clusters <- list(1:3, 4:6)
css_results_unnamed <- css(X=x, y=y, lambda=0.01, clusters=unnamed_clusters,
B=10)
ret <- printCssDf(css_results_unnamed)
testthat::expect_true(is.data.frame(ret))
testthat::expect_identical(colnames(ret), c("ClustName", "ClustProtoNum",
"ClustSelProp", "ClustSize"))
# Total number of clusters is 11 - (3 - 1) - (3 - 1) = 7
testthat::expect_equal(nrow(ret), 7)
testthat::expect_true(is.character(ret$ClustName))
testthat::expect_equal(length(ret$ClustName), length(unique(ret$ClustName)))
testthat::expect_true(is.integer(ret$ClustProtoNum))
testthat::expect_true(is.numeric(ret$ClustSelProp))
testthat::expect_identical(ret$ClustSelProp, sort(ret$ClustSelProp,
decreasing=TRUE))
testthat::expect_true(is.integer(ret$ClustSize))
# Try other settings for cutoff, min_num_clusts, max_num_clusts, etc.
ret <- printCssDf(css_results, max_num_clusts=3)
testthat::expect_true(is.data.frame(ret))
testthat::expect_identical(colnames(ret), c("ClustName", "ClustProtoNum",
"ClustSelProp", "ClustSize"))
testthat::expect_true(nrow(ret) <= 3)
testthat::expect_true(is.character(ret$ClustName))
testthat::expect_equal(length(ret$ClustName), length(unique(ret$ClustName)))
testthat::expect_true(is.integer(ret$ClustProtoNum))
other_rows <- !(ret$ClustName %in% c("red_cluster", "green_cluster"))
testthat::expect_true(all(ret[other_rows, "ClustProtoNum"] %in% 7L:11L))
testthat::expect_true(length(ret[other_rows, "ClustProtoNum"]) ==
length(unique(ret[other_rows, "ClustProtoNum"])))
testthat::expect_true(is.numeric(ret$ClustSelProp))
testthat::expect_identical(ret$ClustSelProp, sort(ret$ClustSelProp,
decreasing=TRUE))
testthat::expect_true(is.integer(ret$ClustSize))
testthat::expect_true(all(ret[other_rows, "ClustSize"] == 1))
if("red_cluster" %in% ret$ClustName){
testthat::expect_true(ret[ret$ClustName=="red_cluster",
"ClustProtoNum"] %in% 1L:3L)
testthat::expect_equal(ret[ret$ClustName=="red_cluster", "ClustSize"], 3)
}
if("green_cluster" %in% ret$ClustName){
testthat::expect_true(ret[ret$ClustName=="green_cluster",
"ClustProtoNum"] %in% 4L:6L)
testthat::expect_equal(ret[ret$ClustName=="green_cluster", "ClustSize"], 3)
}
ret <- printCssDf(css_results, min_num_clusts=2, cutoff=1)
testthat::expect_true(is.data.frame(ret))
testthat::expect_identical(colnames(ret), c("ClustName", "ClustProtoNum",
"ClustSelProp", "ClustSize"))
# Total number of clusters is 11 - (3 - 1) - (3 - 1) = 7
testthat::expect_true(nrow(ret) >= 2)
testthat::expect_true(nrow(ret) <= 7)
testthat::expect_true(is.character(ret$ClustName))
testthat::expect_equal(length(ret$ClustName), length(unique(ret$ClustName)))
testthat::expect_true(is.integer(ret$ClustProtoNum))
other_rows <- !(ret$ClustName %in% c("red_cluster", "green_cluster"))
testthat::expect_true(all(ret[other_rows, "ClustProtoNum"] %in% 7L:11L))
testthat::expect_true(length(ret[other_rows, "ClustProtoNum"]) ==
length(unique(ret[other_rows, "ClustProtoNum"])))
testthat::expect_true(is.numeric(ret$ClustSelProp))
testthat::expect_identical(ret$ClustSelProp, sort(ret$ClustSelProp,
decreasing=TRUE))
testthat::expect_true(is.integer(ret$ClustSize))
testthat::expect_true(all(ret[other_rows, "ClustSize"] == 1))
if("red_cluster" %in% ret$ClustName){
testthat::expect_true(ret[ret$ClustName=="red_cluster",
"ClustProtoNum"] %in% 1L:3L)
testthat::expect_equal(ret[ret$ClustName=="red_cluster", "ClustSize"], 3)
}
if("green_cluster" %in% ret$ClustName){
testthat::expect_true(ret[ret$ClustName=="green_cluster",
"ClustProtoNum"] %in% 4L:6L)
testthat::expect_equal(ret[ret$ClustName=="green_cluster", "ClustSize"], 3)
}
#
ret <- printCssDf(css_results, cutoff=1)
testthat::expect_true(is.data.frame(ret))
testthat::expect_identical(colnames(ret), c("ClustName", "ClustProtoNum",
"ClustSelProp", "ClustSize"))
testthat::expect_true(nrow(ret) >= 1)
testthat::expect_true(nrow(ret) <= 7)
testthat::expect_true(is.character(ret$ClustName))
testthat::expect_equal(length(ret$ClustName), length(unique(ret$ClustName)))
testthat::expect_true(is.integer(ret$ClustProtoNum))
other_rows <- !(ret$ClustName %in% c("red_cluster", "green_cluster"))
testthat::expect_true(all(ret[other_rows, "ClustProtoNum"] %in% 7L:11L))
testthat::expect_true(length(ret[other_rows, "ClustProtoNum"]) ==
length(unique(ret[other_rows, "ClustProtoNum"])))
testthat::expect_true(is.numeric(ret$ClustSelProp))
testthat::expect_identical(ret$ClustSelProp, sort(ret$ClustSelProp,
decreasing=TRUE))
testthat::expect_true(is.integer(ret$ClustSize))
testthat::expect_true(all(ret[other_rows, "ClustSize"] == 1))
if("red_cluster" %in% ret$ClustName){
testthat::expect_true(ret[ret$ClustName=="red_cluster",
"ClustProtoNum"] %in% 1L:3L)
testthat::expect_equal(ret[ret$ClustName=="red_cluster", "ClustSize"], 3)
}
if("green_cluster" %in% ret$ClustName){
testthat::expect_true(ret[ret$ClustName=="green_cluster",
"ClustProtoNum"] %in% 4L:6L)
testthat::expect_equal(ret[ret$ClustName=="green_cluster", "ClustSize"], 3)
}
## Trying bad inputs
# Error has quotation marks in it
testthat::expect_error(printCssDf("css_results"))
testthat::expect_error(printCssDf(css_results, cutoff=-.1),
"cutoff >= 0 is not TRUE", fixed=TRUE)
testthat::expect_error(printCssDf(css_results, min_num_clusts=3.2),
"min_num_clusts == round(min_num_clusts) is not TRUE",
fixed=TRUE)
testthat::expect_error(printCssDf(css_results, max_num_clusts="5"),
"is.numeric(max_num_clusts) | is.integer(max_num_clusts) is not TRUE",
fixed=TRUE)
})## Test passed with 98 successes 🎉.
#' Print cluster stability selection output
#'
#' Print a summary of the information from the css function (using output from
#' printCssDf function).
#' @param x An object of class "cssr" (the output of the function css).
#' @param cutoff Numeric; print.cssr will display only those
#' clusters with selection proportions equal to at least cutoff. Must be between
#' 0 and 1. Default is 0 (in which case either all clusters are displayed, or
#' max_num_clusts are, if max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.) Default is 1.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @param ... Additional arguments to generic print.data.frame function
#' @return A data.frame; each row contains a cluster, arranged in decreasing
#' order of cluster selection proportion from top to bottom. The columns are
#' ClustName (the name of the cluster that was either provided to css or made by
#' css if no name was provided); ClustProtoName (the name of the selection
#' prototype from the cluster, which is the feature with the greatest individual
#' selection proportion among all the cluster members, with ties broken by
#' choosing the feature with the highest correlation with the response if the
#' response is real-valued; only returned if the features are named),
#' ClustProtoNum (the column number of the prototype in the X matrix provided to
#' css), and ClustSize (the size of the cluster).
#' @author Gregory Faletto, Jacob Bien
#' @export
print.cssr <- function(x, cutoff=0, min_num_clusts=1, max_num_clusts=NA, ...){
df <- printCssDf(css_results=x, cutoff=cutoff,
min_num_clusts=min_num_clusts, max_num_clusts=max_num_clusts)
print.data.frame(df, ...)
}