5 Other useful functions
The following are some other functions that are useful for specific tasks.
genClusteredData()generates jointly Gaussian data sets that include clusters of features that are correlated with a latent feature (the latent features themselves are also provided), as in the simulation in Example 1, Section 5.1, and Section 5.2 of Faletto and Bien (2022).genClusteredDataWeighted()generates jointly Gaussian data sets that include clusters of features that are correlated with a latent feature with unequal correlations (the latent features themselves are also provided), as in the simulation Section 5.3 of Faletto and Bien (2022).genClusteredDataWeightedRandom()generates jointly Gaussian data sets that include clusters of features that are correlated with a latent feature with unequal correlations (the latent features themselves are also provided) that are determined at random.getLassoLambda()uses a data set to choose a good value of lambda for subsamples of size n/2 (as in cluster stability selection) by cross-validation.getModelSize()estimates the best size for a lasso model on a provided data set.printCssDf()converts the output of thecss()function (an object of class cssr) to a data.frame that is ready to be printed. We provide end-user access toprintCssDf()because this data.frame itself may be useful as an object that summarizes the output fromcss().print.cssr()is the print function for objects of class cssr.-
-
checkGenClusteredDataInputs()checks that the inputs togenClusteredData()are as expected. -
genZmuY()generates the latent Z, the weak signal and noise features in X, the latent mu, and the observed y from the specified parameters. -
getNoiseVar()calculates the variance of the noise to add to Z so that the proxies have the specified correlation with the latent variable Z.
-
-
-
checkGenClusteredDataWeightedInputs()verifies the inputs togenClusteredDataWeighted().
-
-
-
getSelectionPrototypes()identifies the prototypes from selected clusters (the feature in each cluster that is most correlated with the response)
-
#' Generate randomly sampled data including noisy observations of latent
#' variables
#'
#' Generate a data set including latent features Z, observed features X (which
#' may include noisy or noiseless observations of the latent features in Z),
#' an observed response y which is a linear model of features from Z and X as
#' well as independent mean zero noise, and mu (the responses from y without
#' the added noise). Data is generated in the same way as in the simulations
#' from Faletto and Bien (2022).
#' @param n Integer or numeric; the number of observations to generate. (The
#' generated X and Z will have n rows, and the generated y and mu will have
#' length n.)
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered. Must be at least 1.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho Integer or numeric; the correlation of the proxies in each cluster
#' with the latent variable. Must be greater than 0. Default is 0.9.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y (non-negative; 0 gives noiseless y). Only one of snr and sigma_eps_sq
#' must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features, as an n by n_clusters numeric matrix (one column per
#' latent variable). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @importFrom stats rnorm runif
#' @examples
#' set.seed(1)
#' data <- genClusteredData(n = 50, p = 11, k_unclustered = 2,
#' cluster_size = 4, n_clusters = 1, snr = 3)
#' dim(data$X)
#' length(data$y)
#' @export
genClusteredData <- function(n, p, k_unclustered, cluster_size, n_clusters=1,
sig_clusters=1, rho=0.9, beta_latent=1.5, beta_unclustered=1,
snr=as.numeric(NA), sigma_eps_sq=as.numeric(NA)){
# Check inputs
checkGenClusteredDataInputs(p, k_unclustered, cluster_size, n_clusters,
sig_clusters, rho, beta_latent, beta_unclustered, snr,
sigma_eps_sq)
ret <- genZmuY(n=n, p=p, k_unclustered=k_unclustered,
cluster_size=cluster_size, n_clusters=n_clusters,
sig_clusters=sig_clusters, beta_latent=beta_latent,
beta_unclustered=beta_unclustered, snr=snr, sigma_eps_sq=sigma_eps_sq)
Z <- ret$Z
y <- ret$y
mu <- ret$mu
other_X <- ret$other_X
# Finally, generate clusters of proxies to complete X. First, get needed
# variances of noise to add
noise_var <- getNoiseVar(rho)
# Generate these noise features
noise_mat <- matrix(stats::rnorm(n*n_clusters*cluster_size, mean=0,
sd=sqrt(noise_var)), n, n_clusters*cluster_size)
# Create matrix of proxies. Each cluster's cluster_size proxies are its
# latent Z column plus the matching noise columns (#58); column-expand Z to
# the noise block layout in one vectorized add (replaces the per-cluster
# block loop; noise_mat was already drawn above, so RNG order is unchanged).
# For n_clusters == 1, genZmuY returns Z as a bare numeric vector, and
# as.matrix(Z) promotes it to a one-column matrix before the column-expand.
col_clust <- rep(1:n_clusters, each = cluster_size)
proxy_mat <- as.matrix(Z)[, col_clust, drop = FALSE] + noise_mat
return(finalizeGenClusteredData(proxy_mat, other_X, y, mu, Z, n, p,
n_clusters))
}The three genClusteredData* validators share the bulk of their checks. To
keep them from drifting apart (the source of the validation inconsistencies
fixed in #13), the common checks live in two shared helpers:
checkGenClusteredDataInputsPre() (the checks that precede each generator’s
correlation-specific checks) and checkGenClusteredDataInputsPost() (the
checks that follow them). Each validator calls Pre(), then its own
correlation/strength checks, then Post(), then its own snr/sigma_eps_sq
checks – preserving each validator’s exact original check order.
#' Shared leading input checks for the genClusteredData* generators
#'
#' The checks every genClusteredData* validator runs before its
#' correlation-specific checks (sig_clusters, n_clusters, and cluster_size).
#' @param sig_clusters,n_clusters,cluster_size As documented in
#' `checkGenClusteredDataInputs()`.
#' @return No return value; called for the side effect of erroring on bad input.
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkGenClusteredDataInputsPre <- function(sig_clusters, n_clusters,
cluster_size){
stopifnot(is.numeric(sig_clusters) | is.integer(sig_clusters))
stopifnot(sig_clusters <= n_clusters)
stopifnot(sig_clusters >= 0)
stopifnot(sig_clusters == round(sig_clusters))
stopifnot(is.numeric(n_clusters) | is.integer(n_clusters))
stopifnot(n_clusters == round(n_clusters))
# TODO(gregfaletto): is it easy to remove the requirement that n_clusters is
# at least 1 (so that it's possible to generate data with no latent
# features)? If so, should only check that cluster_size >= 1 if n_clusters
# >= 1, and in makeCovarianceMatrix function only need block_size >= 1
# rather than 2.
stopifnot(n_clusters >= 1)
stopifnot(is.numeric(cluster_size) | is.integer(cluster_size))
stopifnot(cluster_size == round(cluster_size))
stopifnot(cluster_size >= 2)
}
#' Shared trailing input checks for the genClusteredData* generators
#'
#' The checks every genClusteredData* validator runs after its
#' correlation-specific checks: the beta coefficients, k_unclustered, the
#' feature-count lower bound, and that exactly one of snr / sigma_eps_sq is
#' given. (Each validator then runs its own type/range checks on whichever of
#' snr / sigma_eps_sq was supplied.)
#' @param p,k_unclustered,cluster_size,n_clusters,beta_latent,beta_unclustered,snr,sigma_eps_sq
#' As documented in `checkGenClusteredDataInputs()`.
#' @return No return value; called for the side effect of erroring on bad input.
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkGenClusteredDataInputsPost <- function(p, k_unclustered, cluster_size,
n_clusters, beta_latent, beta_unclustered, snr, sigma_eps_sq){
stopifnot(beta_latent != 0)
stopifnot(beta_unclustered != 0)
stopifnot(is.numeric(k_unclustered) | is.integer(k_unclustered))
stopifnot(k_unclustered >= 1)
stopifnot(k_unclustered == round(k_unclustered))
stopifnot(is.numeric(p) | is.integer(p))
stopifnot(p == round(p))
stopifnot(p >= n_clusters*cluster_size + k_unclustered)
# Same as make_sparse_blocked_linear_model_random, but ith coefficient
# of weak signal features is beta_unclustered/sqrt(i) in order to have
# a definitive ranking of weak signal features.
if(is.na(snr) & is.na(sigma_eps_sq)){
stop("Must specify one of snr or sigma_eps_sq")
}
if(!is.na(snr) & !is.na(sigma_eps_sq)){
stop("Only one of snr and sigma_eps_sq may be specified")
}
}
#' Shared snr / sigma_eps_sq input checks for the genClusteredData* generators
#'
#' Validates whichever of snr / sigma_eps_sq was supplied (exactly one is non-NA
#' by this point, enforced by `checkGenClusteredDataInputsPost()`): it must be
#' numeric and length 1, and positive (snr) or non-negative (sigma_eps_sq).
#' Previously only `genClusteredData()` ran the full type/length checks; the two
#' weighted generators ran a leaner version that silently accepted a
#' non-numeric snr / sigma_eps_sq (which then failed later inside `genZmuY()`).
#' Sharing this helper makes the three consistent (#35).
#' @param snr,sigma_eps_sq As documented in `checkGenClusteredDataInputs()`.
#' @return No return value; called for the side effect of erroring on bad input.
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkGenClusteredDataInputsSnrSigma <- function(snr, sigma_eps_sq){
if(is.na(snr)){
stopifnot(all(!is.na(sigma_eps_sq)))
stopifnot(is.numeric(sigma_eps_sq) | is.integer(sigma_eps_sq))
stopifnot(length(sigma_eps_sq) == 1)
stopifnot(sigma_eps_sq >= 0)
} else{
stopifnot(is.numeric(snr) | is.integer(snr))
stopifnot(length(snr) == 1)
stopifnot(snr > 0)
}
}#' Check inputs to genClusteredData
#'
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho Integer or numeric; the correlation of the proxies in each cluster
#' with the latent variable. Must be greater than 0. Default is 0.9.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features, as an n by n_clusters numeric matrix (one column per
#' latent variable). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkGenClusteredDataInputs <- function(p, k_unclustered, cluster_size,
n_clusters, sig_clusters, rho, beta_latent, beta_unclustered, snr,
sigma_eps_sq){
checkGenClusteredDataInputsPre(sig_clusters, n_clusters, cluster_size)
stopifnot(rho > 0)
checkGenClusteredDataInputsPost(p, k_unclustered, cluster_size, n_clusters,
beta_latent, beta_unclustered, snr, sigma_eps_sq)
checkGenClusteredDataInputsSnrSigma(snr, sigma_eps_sq)
}genZmuY()
#' Generates Z, weak signal features in X, noise features in X, mu, and y
#' from provided parameters
#'
#' @param n Integer or numeric; the number of observations to generate. (The
#' generated X and Z will have n rows, and the generated y and mu will have
#' length n.)
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered. Must be at least 1.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{Z}{The
#' latent features; either a numeric vector (if n_clusters == 1) or a numeric
#' matrix (if n_clusters > 1). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.}
#' \item{other_X}{A numeric matrix of n observations from a multivariate normal
#' distribution generated using the specified parameters, containing the weak
#' signal features and the noise features that will eventually be in X. (The
#' only missing features are the proxies for the latent features Z.)}
#'
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @keywords internal
#' @noRd
genZmuY <- function(n, p, k_unclustered, cluster_size, n_clusters, sig_clusters,
beta_latent, beta_unclustered, snr, sigma_eps_sq){
# Reject n < 2 up front: n = 1 collapses orig_feat_mat[, cols] to a vector
# (no drop=FALSE below), which fails downstream with a cryptic "incorrect
# number of dimensions". n is not validated by the shared input checkers, so
# guard it here -- the one point every generator funnels through (#70).
stopifnot(is.numeric(n) | is.integer(n))
stopifnot(n == round(n))
stopifnot(n >= 2)
# Generate Z, weak signal features, and noise features (total of
# p - n_clusters*(cluster_size - 1)) features)
p_orig_feat_mat <- p - n_clusters*(cluster_size - 1)
stopifnot(p_orig_feat_mat >= k_unclustered + n_clusters)
orig_feat_mat <- matrix(stats::rnorm(n*p_orig_feat_mat), n, p_orig_feat_mat)
# First n_clusters features are Z. Next k_unclustered features are weak
# signal features. Any remaining features are noise features.
Z <- orig_feat_mat[, 1:n_clusters]
other_X <- orig_feat_mat[, (n_clusters + 1):p_orig_feat_mat, drop = FALSE]
# Ready to create mu and y
if(sig_clusters == 0){
mu <- rep(0, n) # null latent model; weak-signal features add below
} else if(n_clusters > 1){
if(sig_clusters > 1){
mu <- Z[, seq_len(sig_clusters)] %*% rep(beta_latent, sig_clusters)
} else{
mu <- Z[, 1] * beta_latent
}
} else{
mu <- Z * beta_latent
}
for(j in 1:k_unclustered){
mu <- mu + beta_unclustered/sqrt(j)*other_X[, j]
}
mu <- as.numeric(mu)
# If SNR is null, use sigma_eps_sq
if(!is.na(sigma_eps_sq)){
sd <- sqrt(sigma_eps_sq)
}else{
sd <- sqrt(sum(mu^2) / (n * snr)) # taking snr = ||mu||^2 /(n * sigma^2)
}
stopifnot(is.numeric(sd) | is.integer(sd))
stopifnot(length(sd) == 1)
stopifnot(!is.na(sd))
stopifnot(sd >= 0)
y <- as.numeric(mu + rnorm(n, mean=0, sd=sd))
return(list(Z=Z, mu=mu, y=y, other_X=other_X))
}No tests for genZmuY().
getNoiseVar():
#' Get variance of noise to add to Z in order to yield proxies X with desired
#' correlations with Z
#'
#' @param cor A numeric vector of desired correlations for each proxy to have
#' with Z. Note: correlations must be positive.
#' @return A vector of variances of independent Gaussian random variables to add
#' to Z in order to yield proxies with the desired correlations with Z.
#' @author Gregory Faletto, Jacob Bien
#' @examples
#' # Noise variance needed so a proxy Z + N(0, v) attains each target
#' # correlation with Z (correlations must be positive).
#' getNoiseVar(c(0.9, 0.5, 1))
#' @export
getNoiseVar <- function(cor){
# Correlation between standard normal Z and X = Z + epsilon where epsilon
# is normal, independent of Z, and has mean 0 and variance sig_eps_sq:
#
# E[Z X]/sqrt{Var(Z) Var(X)}
# = (E[Z^2] + E[Z*epsilon])/sqrt{1*(1 + sig_eps_sq)}
# = (1 + 0)/sqrt{1 + sig_eps_sq}
#
# So we have
# cor = 1/sqrt{1 + sig_eps_sq}
# \iff 1 + sig_eps_sq = 1/cor^2
# \iff sig_eps_sq = 1/cor^2 - 1
stopifnot(is.numeric(cor) | is.integer(cor))
stopifnot(all(!is.na(cor)))
stopifnot(length(cor) >= 1)
stopifnot(all(cor > 0))
stopifnot(all(cor <= 1))
return(1/cor^2 - 1)
}finalizeGenClusteredData() is the shared tail of all three genClusteredData*
generators: once each generator has built its cluster-proxy matrix (the only
step where the three differ), this helper column-binds the proxies with the
remaining features, coerces the latent features to a matrix, runs the common
output dimension checks, and returns the standard list(X, y, Z, mu).
#' Assemble and validate the output of a genClusteredData* generator
#'
#' Shared tail of `genClusteredData()`, `genClusteredDataWeighted()`, and
#' `genClusteredDataWeightedRandom()`: column-bind the freshly generated cluster
#' proxies with the remaining features, coerce the latent features to a matrix,
#' run the common output dimension checks, and return the standard
#' `list(X, y, Z, mu)`.
#' @param proxy_mat Numeric matrix; the n by n_clusters*cluster_size block of
#' generated cluster proxies.
#' @param other_X Numeric matrix; the weak-signal and noise features (the
#' non-proxy columns of X), as returned by `genZmuY()`.
#' @param y Numeric vector; the response, as returned by `genZmuY()`.
#' @param mu Numeric vector; the expected response, as returned by `genZmuY()`.
#' @param Z The latent features, as returned by `genZmuY()` (a numeric vector
#' when n_clusters == 1); coerced to a matrix here.
#' @param n Integer or numeric; the number of observations.
#' @param p Integer or numeric; the total number of features (columns of X).
#' @param n_clusters Integer or numeric; the number of latent variables.
#' @return A list with elements X (numeric matrix), y (numeric vector), Z
#' (numeric matrix), and mu (numeric vector).
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
finalizeGenClusteredData <- function(proxy_mat, other_X, y, mu, Z, n, p,
n_clusters){
X <- cbind(proxy_mat, other_X)
Z <- as.matrix(Z)
# Check output
stopifnot(length(mu) == n)
stopifnot(nrow(X) == n)
stopifnot(ncol(X) == p)
if(any(!is.na(Z))){
stopifnot(nrow(Z) == n)
stopifnot(ncol(Z) == n_clusters)
}
return(list(X=X, y=y, Z=Z, mu=mu))
}No tests for finalizeGenClusteredData(); it is exercised through the three
genClusteredData* generators (including their characterization snapshots).
#' Generate randomly sampled data including noisy observations of latent
#' variables, where proxies differ in their relevance (noise level)
#'
#' Generate a data set including latent features Z, observed features X (which
#' may include noisy or noiseless observations of the latent features in Z),
#' an observed response y which is a linear model of features from Z and X as
#' well as independent mean zero noise, and mu (the responses from y without
#' the added noise). Data is generated in the same way as in the simulations
#' from Section 5.3 of Faletto and Bien (2022).
#' @param n Integer or numeric; the number of observations to generate. (The
#' generated X and Z will have n rows, and the generated y and mu will have
#' length n.)
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered. Must be at least 1.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_strong_cluster_vars Integer or numeric; among the cluster_size
#' proxies in each cluster, the first n_strong_cluster_vars will have a high
#' correlation (rho_high) with the latent variable and the next cluster_size -
#' n_strong_cluster_vars will have a low correlation (rho_low) with the latent
#' variable.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho_high Integer or numeric; the correlation of the "strong proxies"
#' in each cluster with the latent variable. rho_high cannot equal 0 and must be
#' at least as large as rho_low. Default is 0.9.
#' @param rho_low Integer or numeric; the correlation of the "weak proxies" in
#' each cluster with the latent variable. rho_low cannot equal 0 and must be no
#' larger than rho_high. Default is 0.5.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y (non-negative; 0 gives noiseless y). Only one of snr and sigma_eps_sq
#' must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features, as an n by n_clusters numeric matrix (one column per
#' latent variable). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @examples
#' set.seed(1)
#' data <- genClusteredDataWeighted(n = 50, p = 11, k_unclustered = 2,
#' cluster_size = 4, n_strong_cluster_vars = 2, n_clusters = 1, snr = 3)
#' dim(data$X)
#' @export
genClusteredDataWeighted <- function(n, p, k_unclustered, cluster_size,
n_strong_cluster_vars, n_clusters=1, sig_clusters=1, rho_high=0.9,
rho_low=0.5, beta_latent=1.5, beta_unclustered=1, snr=as.numeric(NA),
sigma_eps_sq=as.numeric(NA)){
# Check inputs
checkGenClusteredDataWeightedInputs(p, k_unclustered, cluster_size,
n_strong_cluster_vars, n_clusters, sig_clusters, rho_high, rho_low,
beta_latent, beta_unclustered, snr, sigma_eps_sq)
ret <- genZmuY(n=n, p=p, k_unclustered=k_unclustered,
cluster_size=cluster_size, n_clusters=n_clusters,
sig_clusters=sig_clusters, beta_latent=beta_latent,
beta_unclustered=beta_unclustered, snr=snr, sigma_eps_sq=sigma_eps_sq)
Z <- ret$Z
y <- ret$y
mu <- ret$mu
other_X <- ret$other_X
# Finally, generate clusters of proxies to complete X.
noise_var_high <- getNoiseVar(rho_high)
noise_var_low <- getNoiseVar(rho_low)
# Create matrix of proxies
Z <- as.matrix(Z)
proxy_mat <- matrix(as.numeric(NA), n, n_clusters*cluster_size)
for(i in 1:n_clusters){
for(j in 1:n_strong_cluster_vars){
proxy_mat[, (i - 1)*cluster_size + j] <- Z[, i] + rnorm(n,
mean=0, sd=sqrt(noise_var_high))
}
for(j in (n_strong_cluster_vars + 1):cluster_size){
proxy_mat[, (i - 1)*cluster_size + j] <- Z[, i] + rnorm(n,
mean=0, sd=sqrt(noise_var_low))
}
}
return(finalizeGenClusteredData(proxy_mat, other_X, y, mu, Z, n, p,
n_clusters))
}checkGenClusteredDataWeightedInputs()
#' Check inputs to genClusteredDataWeighted
#'
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_strong_cluster_vars Integer or numeric; among the cluster_size
#' proxies in each cluster, n_strong_cluster_vars will have a high correlation
#' (rho_high) with the latent variable and cluster_size - n_strong_cluster_vars
#' will have a low correlation (rho_low) with the latent variable.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho_high Integer or numeric; the correlation of the "strong proxies"
#' in each cluster with the latent variable. rho_high cannot equal 0 and must be
#' at least as large as rho_low. Default is 0.9.
#' @param rho_low Integer or numeric; the correlation of the "weak proxies" in
#' each cluster with the latent variable. rho_low cannot equal 0 and must be no
#' larger than rho_high. Default is 0.5.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features, as an n by n_clusters numeric matrix (one column per
#' latent variable). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkGenClusteredDataWeightedInputs <- function(p, k_unclustered, cluster_size,
n_strong_cluster_vars, n_clusters, sig_clusters, rho_high, rho_low,
beta_latent, beta_unclustered, snr, sigma_eps_sq){
checkGenClusteredDataInputsPre(sig_clusters, n_clusters, cluster_size)
stopifnot(is.integer(n_strong_cluster_vars) |
is.numeric(n_strong_cluster_vars))
stopifnot(!is.na(n_strong_cluster_vars))
stopifnot(length(n_strong_cluster_vars) == 1)
stopifnot(n_strong_cluster_vars == round(n_strong_cluster_vars))
stopifnot(n_strong_cluster_vars >= 1)
stopifnot(n_strong_cluster_vars < cluster_size)
stopifnot(rho_high <= 1)
stopifnot(rho_high > 0)
stopifnot(rho_low > 0)
stopifnot(rho_high >= rho_low)
checkGenClusteredDataInputsPost(p, k_unclustered, cluster_size, n_clusters,
beta_latent, beta_unclustered, snr, sigma_eps_sq)
checkGenClusteredDataInputsSnrSigma(snr, sigma_eps_sq)
}No tests for checkGenClusteredDataWeightedInputs()
genClusteredDataWeightedRandom()
#' Generate randomly sampled data including noisy observations of latent
#' variables, where proxies differ in their relevance (noise level)
#'
#' Generate a data set including latent features Z, observed features X (which
#' may include noisy or noiseless observations of the latent features in Z),
#' an observed response y which is a linear model of features from Z and X as
#' well as independent mean zero noise, and mu (the responses from y without
#' the added noise).
#' @param n Integer or numeric; the number of observations to generate. (The
#' generated X and Z will have n rows, and the generated y and mu will have
#' length n.)
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered. Must be at least 1.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho_high Integer or numeric; the maximum correlation of the proxies
#' in each cluster with the latent variable. rho_high cannot equal 0 and must be
#' at least as large as rho_low. Default is 1.
#' @param rho_low Integer or numeric; the minimum correlation of the proxies in
#' each cluster with the latent variable. rho_low cannot equal 0 and must be no
#' larger than rho_high. Default is 0.5.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y (non-negative; 0 gives noiseless y). Only one of snr and sigma_eps_sq
#' must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features, as an n by n_clusters numeric matrix (one column per
#' latent variable). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @examples
#' set.seed(1)
#' data <- genClusteredDataWeightedRandom(n = 50, p = 11, k_unclustered = 2,
#' cluster_size = 4, n_clusters = 1, snr = 3)
#' dim(data$X)
#' @export
genClusteredDataWeightedRandom <- function(n, p, k_unclustered, cluster_size,
n_clusters=1, sig_clusters=1, rho_high=1, rho_low=0.5, beta_latent=1.5,
beta_unclustered=1, snr=as.numeric(NA), sigma_eps_sq=as.numeric(NA)){
# Check inputs
checkGenClusteredDataWeightedRandomInputs(p, k_unclustered, cluster_size,
n_clusters, sig_clusters, rho_high, rho_low,
beta_latent, beta_unclustered, snr, sigma_eps_sq)
ret <- genZmuY(n=n, p=p, k_unclustered=k_unclustered,
cluster_size=cluster_size, n_clusters=n_clusters,
sig_clusters=sig_clusters, beta_latent=beta_latent,
beta_unclustered=beta_unclustered, snr=snr, sigma_eps_sq=sigma_eps_sq)
Z <- ret$Z
y <- ret$y
mu <- ret$mu
other_X <- ret$other_X
# Finally, generate clusters of proxies to complete X.
# Create matrix of proxies
Z <- as.matrix(Z)
proxy_mat <- matrix(as.numeric(NA), n, n_clusters*cluster_size)
for(i in 1:n_clusters){
for(j in 1:cluster_size){
# Choose correlation at random
rho_ij <- runif(n=1, min=rho_low, max=rho_high)
# Get noise variance
noise_var_ij <- getNoiseVar(rho_ij)
proxy_mat[, (i - 1)*cluster_size + j] <- Z[, i] + rnorm(n,
mean=0, sd=sqrt(noise_var_ij))
}
}
return(finalizeGenClusteredData(proxy_mat, other_X, y, mu, Z, n, p,
n_clusters))
}checkGenClusteredDataWeightedRandomInputs()
#' Check inputs to genClusteredDataWeightedRandom
#'
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho_high Integer or numeric; the maximum correlation of the proxies
#' in each cluster with the latent variable. rho_high cannot equal 0 and must be
#' at least as large as rho_low. Default is 1.
#' @param rho_low Integer or numeric; the minimum correlation of the proxies in
#' each cluster with the latent variable. rho_low cannot equal 0 and must be no
#' larger than rho_high. Default is 0.5.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features, as an n by n_clusters numeric matrix (one column per
#' latent variable). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkGenClusteredDataWeightedRandomInputs <- function(p, k_unclustered,
cluster_size, n_clusters, sig_clusters, rho_high, rho_low, beta_latent,
beta_unclustered, snr, sigma_eps_sq){
checkGenClusteredDataInputsPre(sig_clusters, n_clusters, cluster_size)
stopifnot(rho_high <= 1)
stopifnot(rho_high > 0)
stopifnot(rho_low > 0)
stopifnot(rho_high >= rho_low)
checkGenClusteredDataInputsPost(p, k_unclustered, cluster_size, n_clusters,
beta_latent, beta_unclustered, snr, sigma_eps_sq)
checkGenClusteredDataInputsSnrSigma(snr, sigma_eps_sq)
}No tests for checkGenClusteredDataWeightedRandomInputs()
#' Get lambda value for lasso
#'
#' Chooses a lambda value for the lasso used on a subsample of size n/2 (as in
#' cluster stability selection) by cross-validation.
#' @param X An n x p numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the p >= 2 features/predictors that will be used by cluster stability
#' selection. Must not contain missing (`NA`) values.
#' @param y The response; an n-dimensional numeric or integer vector. (Unlike
#' in the more general css setup, this response must be real-valued since
#' lambda will be determined using the lasso with cross-validation.)
#' @param lambda_choice Character; either "min" or "1se". If "min", chooses
#' the lambda that minimizes the cross-validated error; if "1se", chooses the
#' largest lambda within one standard error of the minimum error lambda
#' (resulting in a smaller selected set, which may be desirable because the
#' model size corresponding to the minimum error lambda tends to be larger
#' than optimal. See, for example, Bühlmann and Meinshausen 2006, Prop. 1 and
#' Bühlmann and van de Geer 2011, Section 2.5.1.). Default is "1se".
#' @param nfolds Numeric or integer; the number of folds for cross-validation.
#' Must be at least 4 (as specified by cv.glmnet). Default is 10.
#' @param alpha Numeric; the elastic net mixing parameter. Must be in `(0, 1]`.
#' Default is 1 (in which case the penalty is for lasso).
#' @return A numeric; the selected value of lambda.
#' @author Gregory Faletto, Jacob Bien
#' @references Bühlmann, P., & Meinshausen, N. (2006). High-Dimensional Graphs
#' and Variable Selection With the Lasso. \emph{The Annals of Statistics},
#' 34(3), 1436–1462. \url{https://doi.org/10.1214/009053606000000281}.
#' \cr Peter Bühlmann and Sara van de Geer. Statistics for High-Dimensional
#' Data. \emph{Springer Series in Statistics}. Springer, Heidelberg, 2011. ISBN
#' 978-3-642-20191-2. \url{http://dx.doi.org/10.1007/978-3-642-20192-9}. \cr
#' Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization
#' Paths for Generalized Linear Models via Coordinate Descent. \emph{Journal of
#' Statistical Software}, 33(1), 1-22. URL \url{https://www.jstatsoft.org/v33/i01/}.
#' @examples
#' set.seed(1)
#' data <- genClusteredData(n = 50, p = 11, k_unclustered = 2,
#' cluster_size = 4, n_clusters = 1, snr = 3)
#' lambda <- getLassoLambda(X = data$X, y = data$y, nfolds = 5)
#' lambda
#' @export
getLassoLambda <- function(X, y, lambda_choice="1se", nfolds=10, alpha=1){
stopifnot(is.character(lambda_choice))
stopifnot(length(lambda_choice) == 1)
stopifnot(!is.na(lambda_choice))
stopifnot(lambda_choice %in% c("min", "1se"))
checkNoNAs(X, "X")
# Accept a data.frame like the other X-accepting exports (#51): coerce to a
# matrix via the shared helper. clusters = list() makes the column-count
# guard inert (getLassoLambda has no clusters argument).
X <- coerceDataFrameToMatrix(X, clusters = list())
stopifnot(is.matrix(X))
stopifnot(is.numeric(X) | is.integer(X))
# Require at least two features, mirroring css()'s stopifnot(p >= 2). A 1-col
# X (e.g. cssSelect()/cssPredict() called with p = 1) would otherwise reach
# cv.glmnet with a single column and fail cryptically; this fires the same
# clean "p >= 2 is not TRUE" message as css().
p <- ncol(X)
stopifnot(p >= 2)
n <- nrow(X)
stopifnot(is.numeric(nfolds) | is.integer(nfolds))
stopifnot(length(nfolds) == 1)
stopifnot(nfolds == round(nfolds))
stopifnot(nfolds > 3)
checkAlpha(alpha)
# Since we are using the lasso, we require y to be a real-valued response
# (unlike for the general cluster stability selection procedure, where y
# can have a more general format as long as a suitable feature selection
# function is provided by the user). checkFiniteY also rejects NA/NaN/Inf
# here, BEFORE the random subsample below, so a non-finite y fails fast and
# deterministically rather than only when the bad value lands in the draw.
checkFiniteY(y, "y")
stopifnot(n == length(y))
# Sample size to use: inflate n/2 by a factor of nfolds/(nfolds - 1),
# so that each individual lasso fit is of size floor(n/2)
n_sample <- min(round(n/2*nfolds/(nfolds - 1)), n)
# n_sample is the number of observations actually passed to cv.glmnet. If it
# is too small for even 3-fold CV, error clearly here (reporting the
# requested nfolds) instead of letting cv.glmnet crash with "nfolds must be
# bigger than 3". Otherwise clamp nfolds into [3, n_sample] so we never end
# up below glmnet's minimum or above the available sample size.
if(n_sample < 3){
stop(paste0("Sample size n = ", n, " is too small to choose lambda by ", nfolds,
"-fold cross-validation; supply a larger n, a smaller nfolds, or lambda directly."))
}
nfolds <- max(3L, min(nfolds, n_sample))
inds_size <- sample(1:n, n_sample)
size_results <- glmnet::cv.glmnet(x=X[inds_size, ], y=y[inds_size],
family="gaussian", nfolds=nfolds, alpha=alpha)
lambda_ret <- size_results[[paste("lambda.", lambda_choice, sep="")]]
# Check output
stopifnot(length(lambda_ret) == 1)
stopifnot(is.numeric(lambda_ret) | is.integer(lambda_ret))
stopifnot(lambda_ret >= 0)
return(lambda_ret)
}getModelSize():
#' Automated estimation of model size
#'
#' This function is uses the lasso with cross-validation to estimate the
#' model size. Before using the lasso, in each cluster all features will be
#' dropped from X except the one feature with the highest marginal correlation
#' with y, as in the protolasso (Reid and Tibshirani 2016).
#'
#' @param X An n x p numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the p >= 2 features/predictors. Must not contain missing (`NA`) values.
#' @param y A length-n numeric vector containing the responses; `y[i]` is the
#' response corresponding to observation `X[i, ]`. (Note that for the css
#' function, y does not have to be a numeric response, but for this function,
#' the underlying selection procedure is the lasso, so y must be a real-valued
#' response.)
#' @param clusters A named list where each entry is an integer vector of indices
#' of features that are in a common cluster, as in the output of css.
#' (The length of list clusters is equal to the number of clusters.) All
#' identified clusters must be non-overlapping, and all features must appear in
#' exactly one cluster (any unclustered features should be in their own
#' "cluster" of size 1). CAUTION: if the provided X is a data.frame that
#' contains a categorical feature with more than two levels, then the resulting
#' matrix made from model.matrix will have a different number of columns than
#' the provided data.frame, some of the feature numbers will change, and the
#' clusters argument will not work properly (in the current version of the
#' package). To get correct results in this case, please use model.matrix to
#' convert the data.frame to a numeric matrix on your own, then provide this
#' matrix and cluster assignments with respect to this matrix.
#' @param alpha Numeric; the elastic net mixing parameter for the
#' cross-validated fit used to estimate the model size. Must be in `(0, 1]`.
#' Default is 1 (in which case the penalty is the lasso). Set alpha to match
#' the alpha used for feature selection so the model-size estimate is
#' consistent with the elastic-net selection (see `cssSelect` / `cssPredict`).
#' @return An integer; the estimated size of the model. The minimum returned
#' value is 1, even if the lasso with cross-validation chose no features.
#' @author Gregory Faletto, Jacob Bien
#' @references Reid, S., & Tibshirani, R. (2016). Sparse regression and marginal
#' testing using cluster prototypes. \emph{Biostatistics}, 17(2), 364–376.
#' \url{https://doi.org/10.1093/biostatistics/kxv049}.
#' @examples
#' set.seed(1)
#' data <- genClusteredData(n = 50, p = 11, k_unclustered = 2,
#' cluster_size = 4, n_clusters = 1, snr = 3)
#' clusters <- list(cluster1 = 1:4)
#' getModelSize(X = data$X, y = data$y, clusters = clusters)
#' @export
getModelSize <- function(X, y, clusters, alpha = 1){
# Validate alpha (a single number in (0, 1]). alpha is threaded into the
# cv.glmnet fit below so the model-size estimate uses the same elastic-net
# mixing as feature selection.
checkAlpha(alpha)
stopifnot(is.matrix(X) | is.data.frame(X))
checkNoNAs(X, "X")
# Check if x is a matrix; if it's a data.frame, convert to matrix.
X <- coerceDataFrameToMatrix(X, clusters)
stopifnot(is.matrix(X))
stopifnot(is.numeric(X) | is.integer(X))
n <- nrow(X)
# Since the model size will be determined by cross-validation, the provided
# y must be real-valued and finite (this should be checked internally in
# other functions before getModelSize is called, but this check is here just
# in case).
checkFiniteY(y, "y")
stopifnot(length(y) == n)
# Check clusters argument
clusters <- checkCssClustersInput(clusters)
### Format clusters into a list where all features are in exactly one
# cluster (any unclustered features are put in their own "cluster" of size
# 1).
clust_names <- as.character(NA)
if(!is.null(names(clusters)) & is.list(clusters)){
clust_names <- names(clusters)
}
clusters <- formatClusters(clusters, p=ncol(X),
clust_names=clust_names)$clusters
X_size <- X
if(length(clusters) > 0){
# Create modified design matrix by dropping all but one feature from
# each cluster
feats_to_drop <- integer()
for(i in 1:length(clusters)){
cluster_i <- clusters[[i]]
if(length(cluster_i) > 1){
feat_to_keep <- identifyPrototype(cluster_i, X, y)
feats_to_drop <- c(feats_to_drop, setdiff(cluster_i,
feat_to_keep))
}
}
if(length(feats_to_drop) > 0){
X_size <- X_size[, -1*feats_to_drop, drop=FALSE]
}
}
# cv.glmnet requires at least two columns; if cluster reduction leaves a
# single feature, the only possible model size is 1.
if(ncol(X_size) < 2){
return(1L)
}
size_results <- glmnet::cv.glmnet(x=X_size, y=y, family="gaussian",
alpha=alpha)
coefs <- as.numeric(glmnet::coef.glmnet(size_results, s="lambda.1se"))
# Number of nonzero coefficients (subtract one in order to ignore intercept)
size <- length(coefs[coefs != 0]) - 1
# Check output
stopifnot(is.numeric(size) | is.integer(size))
stopifnot(!is.na(size))
stopifnot(length(size) == 1)
stopifnot(size == round(size))
return(as.integer(max(size, 1)))
}buildCssDf() builds the printable cluster-summary data.frame from a css_results object and an already-obtained named vector of selected clusters. It is factored out of printCssDf() (issue #129) so that a caller which has already computed the selected clusters – notably summary.cssr(), which calls getCssSelections() itself – can reuse that result instead of triggering a second identical getCssSelections() call. It is an internal (@noRd) helper, so the exported printCssDf() keeps its exact signature and man page:
#' Build the printable cluster-summary data.frame from selected clusters
#'
#' Constructs the data.frame returned by printCssDf() (and summary.cssr()) from
#' a css_results object and an already-obtained named vector of selected
#' clusters. Factored out of printCssDf() (#129) so callers that have already
#' computed the selected clusters (e.g. summary.cssr(), which calls
#' getCssSelections() itself) can reuse that result rather than triggering a
#' second identical getCssSelections() call. Because cluster selection is
#' invariant to the weighting rule, the sel_clusts computed by summary.cssr()
#' for any weighting matches the "sparse"-default selection printCssDf() would
#' recompute, so the returned data.frame is byte-identical either way.
#'
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param sel_clusts A named numeric vector of selected clusters with their
#' selection proportions, as returned in the selected_clusts element of
#' getCssSelections() / getSelectedClusters() (the value printCssDf() itself
#' obtains via getCssSelections()). May be empty (length 0), in which case a
#' well-formed zero-row data.frame is returned.
#' @return A data.frame; each row contains a cluster, arranged in decreasing
#' order of cluster selection proportion from top to bottom (see printCssDf()
#' for the full column description).
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
buildCssDf <- function(css_results, sel_clusts){
# An empty selection is valid post-#107 (e.g. cutoff = 1 with
# min_num_clusts = 0 when no cluster clears the cutoff). Return a well-formed
# zero-row data.frame whose columns mirror the populated branch below for the
# same object: 5 columns (with ClustProtoName) when X has column names, 4
# otherwise -- names(prototypes) is non-NULL iff colnames(css_results$X) is,
# since prototype names come from colnames(X). (#120)
if(length(sel_clusts) == 0){
if(!is.null(colnames(css_results$X))){
return(data.frame(ClustName=character(0),
ClustProtoName=character(0), ClustProtoNum=integer(0),
ClustSelProp=numeric(0), ClustSize=integer(0)))
}
return(data.frame(ClustName=character(0), ClustProtoNum=integer(0),
ClustSelProp=numeric(0), ClustSize=integer(0)))
}
# Get prototypes (feature from each cluster with highest selection
# proportion, breaking ties by using marginal correlations of features with
# y from data provided to css if y is real-valued)
prototypes <- getSelectionPrototypes(css_results, sel_clusts)
# Cluster selection proportions
if(length(sel_clusts) > 1){
sel_clust_sel_props <- colMeans(css_results$clus_sel_mat[,
names(sel_clusts)])
} else{
sel_clust_sel_props <- mean(css_results$clus_sel_mat[,
names(sel_clusts)])
}
# Data.frame: name of cluster, cluster prototype, selection proportion,
# cluster size
if(!is.null(names(prototypes))){
print_df <- data.frame(ClustName=names(sel_clusts),
ClustProtoName=names(prototypes), ClustProtoNum=unname(prototypes),
ClustSelProp=sel_clust_sel_props, ClustSize=lengths(sel_clusts))
} else{
print_df <- data.frame(ClustName=names(sel_clusts),
ClustProtoNum=unname(prototypes), ClustSelProp=sel_clust_sel_props,
ClustSize=lengths(sel_clusts))
}
print_df <- print_df[order(print_df$ClustSelProp, decreasing=TRUE), ]
rownames(print_df) <- NULL
stopifnot(is.data.frame(print_df))
stopifnot(nrow(print_df) >= 1)
return(print_df)
}printCssDf():
#' Prepares a data.frame summarazing cluster stability selection output to print
#'
#' Print a summary of the information from the css function.
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param cutoff Numeric; the outputted data.frame will display only those
#' clusters with selection proportions equal to at least cutoff. Must be between
#' 0 and 1. Default is 0 (in which case either all clusters are displayed, or
#' max_num_clusts are, if max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.) Default is 1.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @return A data.frame; each row contains a cluster, arranged in decreasing
#' order of cluster selection proportion from top to bottom. The columns are
#' ClustName (the name of the cluster that was either provided to css or made by
#' css if no name was provided); ClustProtoName (the name of the selection
#' prototype from the cluster, which is the feature with the greatest individual
#' selection proportion among all the cluster members, with ties broken by
#' choosing the feature with the highest correlation with the response if the
#' response is real-valued; only returned if the features are named),
#' ClustProtoNum (the column number of the prototype in the X matrix provided to
#' css), ClustSelProp (the cluster's selection proportion), and ClustSize (the
#' size of the cluster).
#' @author Gregory Faletto, Jacob Bien
#' @examples
#' set.seed(1)
#' data <- genClusteredData(n = 50, p = 11, k_unclustered = 2,
#' cluster_size = 4, n_clusters = 1, snr = 3)
#' clusters <- list(cluster1 = 1:4)
#' res <- css(X = data$X, y = data$y, lambda = 0.01, clusters = clusters,
#' B = 10)
#' printCssDf(res)
#' @export
printCssDf <- function(css_results, cutoff=0, min_num_clusts=1,
max_num_clusts=NA){
# Check inputs
stopifnot(inherits(css_results, "cssr"))
checkCutoff(cutoff)
p <- ncol(css_results$feat_sel_mat)
checkMinNumClusts(min_num_clusts, p, length(css_results$clusters))
max_num_clusts <- checkMaxNumClusts(max_num_clusts, min_num_clusts, p,
length(css_results$clusters))
sel_clusts <- getCssSelections(css_results, cutoff=cutoff,
min_num_clusts=min_num_clusts,
max_num_clusts=max_num_clusts)$selected_clusts
# printCssDf keeps its exact exported signature; the table-building work
# (empty-selection handling, prototypes, selection proportions, ordering)
# now lives in the internal buildCssDf() helper so summary.cssr() can reuse
# an already-computed sel_clusts instead of making a second identical
# getCssSelections() call. Byte-identical: buildCssDf() consumes sel_clusts
# exactly as this body did. (#129)
return(buildCssDf(css_results, sel_clusts))
}#' Identify selection prototypes from selected clusters
#'
#' Takes in list of selected clusters and returns an integer vector of the
#' indices of the features that were most frequently selected from each cluster
#'
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param selected_clusts A list of integer vectors; each vector must contain
#' the indices of features in a cluster.
#' @return An integer vector (of length equal to the number of clusters) of the
#' indices of the feature prototypes (the features from each cluster that were
#' selected the most often individually by the base method in cluster stability
#' selection). In the case of a tie, the tie is broken by choosing the feature
#' most correlated with the response in the full data set provided to css.
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
getSelectionPrototypes <- function(css_results, selected_clusts){
# Check inputs
stopifnot(inherits(css_results, "cssr"))
stopifnot(is.list(selected_clusts))
n_selected_clusts <- length(selected_clusts)
# An empty selection is valid post-#107: getCssSelections() can return zero
# selected clusters (e.g. cutoff = 1 with min_num_clusts = 0 when no cluster
# clears the cutoff). Return an empty integer vector before the non-empty
# guards below; this also avoids the 1:0 foot-gun in the for loop. (#120)
if(n_selected_clusts == 0){
return(integer(0))
}
stopifnot(n_selected_clusts >= 1)
stopifnot(all(lengths(selected_clusts) >= 1))
prototypes <- rep(as.integer(NA), n_selected_clusts)
feat_sel_props <- colMeans(css_results$feat_sel_mat)
for(i in 1:n_selected_clusts){
clust_i <- selected_clusts[[i]]
sel_props_i <- feat_sel_props[clust_i]
proto_i <- clust_i[sel_props_i == max(sel_props_i)]
stopifnot(length(proto_i) >= 1)
if(length(proto_i) > 1){
if(is.numeric(css_results$y) | is.integer(css_results$y)){
# Break tie by marginal correlation. suppressWarnings + NA->0
# mirrors identifyPrototype (#59): a constant column has
# undefined correlation (cor() returns NA with a
# "standard deviation is zero" warning) -- treat it as 0 so
# it never wins the tie-break and never poisons max() (#68).
corrs_i <- suppressWarnings(abs(stats::cor(
css_results$X[, proto_i, drop = FALSE], css_results$y)[, 1]))
corrs_i[is.na(corrs_i)] <- 0
proto_i <- proto_i[corrs_i == max(corrs_i)]
}
}
# If there is still a tie, break by choosing the first feature of those
# remaining
prototypes[i] <- proto_i[1]
names(prototypes)[i] <- colnames(css_results$X)[proto_i]
}
# Check output
stopifnot(is.integer(prototypes))
stopifnot(all(!is.na(prototypes)))
stopifnot(length(prototypes) == length(unique(prototypes)))
return(prototypes)
}#' Print cluster stability selection output
#'
#' Print a summary of the information from the css function (using output from
#' printCssDf function).
#' @param x An object of class "cssr" (the output of the function css).
#' @param cutoff Numeric; print.cssr will display only those
#' clusters with selection proportions equal to at least cutoff. Must be between
#' 0 and 1. Default is 0 (in which case either all clusters are displayed, or
#' max_num_clusts are, if max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.) Default is 1.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @param ... Additional arguments to generic print.data.frame function
#' @return Invisibly, the unchanged `cssr` object `x` (following the standard
#' convention for `print` methods). Called for its side effect: printing a
#' summary table with one row per cluster, arranged in decreasing order of
#' cluster selection proportion from top to bottom. The printed columns are
#' ClustName (the name of the cluster that was either provided to css or made by
#' css if no name was provided); ClustProtoName (the name of the selection
#' prototype from the cluster, which is the feature with the greatest individual
#' selection proportion among all the cluster members, with ties broken by
#' choosing the feature with the highest correlation with the response if the
#' response is real-valued; only shown if the features are named); ClustProtoNum
#' (the column number of the prototype in the X matrix provided to css);
#' ClustSelProp (the cluster's selection proportion); and ClustSize (the size of
#' the cluster).
#' @author Gregory Faletto, Jacob Bien
#' @export
print.cssr <- function(x, cutoff=0, min_num_clusts=1, max_num_clusts=NA, ...){
df <- printCssDf(css_results=x, cutoff=cutoff,
min_num_clusts=min_num_clusts, max_num_clusts=max_num_clusts)
if(nrow(df) == 0){
# Empty selection (#120): no table to print.
cat(sprintf("(no clusters selected at cutoff %s)\n", format(cutoff)))
} else{
print.data.frame(df, ...)
}
invisible(x)
}
5.1 Ergonomic S3 accessors: selected() and summary.cssr()
The functions above (getCssSelections(), printCssDf(), print.cssr()) already expose everything an end user needs, but they take the cssr object as css_results and return heterogeneous output. For a more idiomatic R experience – mirroring the selected() and summary.stabsel() interface of the stabs package – we add a small S3 layer on top of these internals. selected() is a generic accessor that returns the selected clusters (or features) from a fitted cssr object, and summary.cssr() returns a compact, computable overview object with its own print method. Both are thin wrappers over getCssSelections()/printCssDf(); they introduce no new methodology.
selected() generic:
#' Extract the selected clusters or features from cluster stability selection
#'
#' A convenient S3 accessor for the selected clusters (the default) or the
#' selected features of a fitted `cssr` object, without re-running the
#' (computationally expensive) subsampling. This is a thin wrapper around
#' [getCssSelections()].
#'
#' Note that, unlike the `selected()` accessor in the \pkg{stabs} package (which
#' returns the selected *variables*), the default here returns the selected
#' *clusters* -- the natural unit of cluster stability selection. Pass
#' `type = "features"` to obtain the flat integer vector of selected features
#' instead.
#'
#' @param object An object of class "cssr" (the output of the function [css()]).
#' @param ... Additional arguments passed to methods (currently unused).
#' @return For the `cssr` method: if `type = "clusters"` (the default), a named
#' list of integer vectors (each the indices of the features in one selected
#' cluster; an empty list if no cluster is selected); if `type = "features"`, a
#' named integer vector of the indices of the selected features (an empty integer
#' vector if none).
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @seealso [summary.cssr()] for an overview (counts plus a per-cluster table);
#' [getCssSelections()] for the underlying selection (clusters, features, and
#' weights together); [printCssDf()] and [print.cssr()] for the printed summary.
#' @examples
#' set.seed(1)
#' data <- genClusteredData(n = 50, p = 11, k_unclustered = 2,
#' cluster_size = 4, n_clusters = 1, snr = 3)
#' clusters <- list(cluster1 = 1:4)
#' res <- css(X = data$X, y = data$y, lambda = 0.01, clusters = clusters,
#' B = 10)
#' # Selected clusters (the default):
#' selected(res)
#' # Selected features (a flat integer vector):
#' selected(res, type = "features")
#' @export
selected <- function(object, ...){
UseMethod("selected")
}selected.cssr() method:
#' @param type Character; either "clusters" (the default) to return the named
#' list of selected clusters, or "features" to return the flat integer vector of
#' selected features. May be abbreviated.
#' @param cutoff Numeric; only those clusters with selection proportions equal to
#' at least cutoff will be selected. Must be between 0 and 1. Default is 0 (in
#' which case either all clusters are selected, or max_num_clusts are selected,
#' if max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be lowered until at least
#' min_num_clusts clusters are selected.) Default is 1. May be set to 0 to allow
#' a pure cutoff-based (threshold) selection that returns an empty result when no
#' cluster's selection proportion meets the cutoff.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be raised until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @param weighting Character; passed to [getCssSelections()] to determine the
#' weights of individual features within the selected clusters. This affects
#' ONLY `type = "features"` (it determines which cluster members have nonzero
#' weight and are therefore returned); it is a no-op for `type = "clusters"`,
#' whose selected clusters, selection proportions, and sizes do not depend on the
#' weighting. Must be one of "sparse", "weighted_avg", or "simple_avg". See
#' [getCssSelections()] for details. Default is "sparse".
#' @rdname selected
#' @export
selected.cssr <- function(object, type=c("clusters", "features"), cutoff=0,
min_num_clusts=1, max_num_clusts=NA, weighting="sparse", ...){
stopifnot(inherits(object, "cssr"))
type <- match.arg(type)
# Thin wrapper: getCssSelections validates all inputs (checkCutoff,
# checkWeighting, checkMinNumClusts, checkMaxNumClusts) and cleanly returns
# empty results on an empty selection (e.g. cutoff = 1, min_num_clusts = 0).
sel <- getCssSelections(object, weighting=weighting, cutoff=cutoff,
min_num_clusts=min_num_clusts, max_num_clusts=max_num_clusts)
if(type == "clusters"){
return(sel$selected_clusts)
}
return(sel$selected_feats)
}summary.cssr():
#' Summarize cluster stability selection output
#'
#' Produce a concise, computable overview of a fitted `cssr` object: a header
#' with the number of selected clusters and features (at the given cutoff), a
#' per-cluster table built on top of [printCssDf()], and a placeholder for
#' per-family error rate (PFER) error-control content.
#'
#' The returned object has class "summary.cssr" and an accompanying print method
#' ([print.summary.cssr()]). Its `table` element is exactly the data.frame
#' returned by [printCssDf()]: one row per selected cluster, sorted by selection
#' proportion, with columns ClustName, ClustProtoName (only if the features are
#' named), ClustProtoNum, ClustSelProp (the cluster's selection proportion) and
#' ClustSize.
#'
#' If the selection is empty (for example `cutoff = 1` with `min_num_clusts = 0`
#' when no cluster reaches the cutoff), `summary.cssr` returns a well-formed
#' zero-row object without error: `table` is NULL and the header reports
#' "0 clusters / 0 features selected". In this case [printCssDf()] and its helper
#' are deliberately NOT called, as they require at least one selected cluster.
#'
#' The `pfer` element is a documented placeholder (NA) reserved for error-control
#' content to be filled in by a future release (issue #87).
#'
#' @param object An object of class "cssr" (the output of the function [css()]).
#' @param cutoff Numeric; only those clusters with selection proportions equal to
#' at least cutoff are summarized. Must be between 0 and 1. Default is 0 (in
#' which case either all clusters are summarized, or max_num_clusts are, if
#' max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. Default is 1. May be set to 0 to allow a pure
#' cutoff-based (threshold) selection that can be empty.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. Default is NA (in which case max_num_clusts is
#' ignored).
#' @param weighting Character; passed to [getCssSelections()] to determine the
#' selected features and hence the selected-feature count in the header. As in
#' [selected()], it does NOT affect which clusters are selected or the
#' per-cluster table. Must be one of "sparse", "weighted_avg", or "simple_avg".
#' Default is "sparse".
#' @param ... Additional arguments (currently unused).
#' @return An object of class "summary.cssr": a named list with elements
#' \item{n_selected_clusts}{Integer; the number of selected clusters.}
#' \item{n_selected_feats}{Integer; the number of selected features.}
#' \item{cutoff}{Numeric; the cutoff that was used.}
#' \item{table}{A data.frame with one row per selected cluster (the output of
#' [printCssDf()]), or NULL when the selection is empty.}
#' \item{pfer}{A placeholder (NA) for per-family error rate error-control
#' content, to be filled in by a future release (issue #87).}
#' @author Gregory Faletto, Jacob Bien
#' @seealso [selected()] to extract just the selected clusters or features;
#' [getCssSelections()] for the underlying selection; [printCssDf()] for the
#' per-cluster data.frame; [print.cssr()] for the analogous printed summary.
#' @examples
#' set.seed(1)
#' data <- genClusteredData(n = 50, p = 11, k_unclustered = 2,
#' cluster_size = 4, n_clusters = 1, snr = 3)
#' clusters <- list(cluster1 = 1:4)
#' res <- css(X = data$X, y = data$y, lambda = 0.01, clusters = clusters,
#' B = 10)
#' summary(res)
#' @export
summary.cssr <- function(object, cutoff=0, min_num_clusts=1, max_num_clusts=NA,
weighting="sparse", ...){
stopifnot(inherits(object, "cssr"))
# Selected clusters + features. getCssSelections validates all inputs
# (checkCutoff / checkWeighting / checkMinNumClusts / checkMaxNumClusts) and
# is empty-safe (a clean empty result when e.g. cutoff = 1 and
# min_num_clusts = 0).
sel <- getCssSelections(object, weighting=weighting, cutoff=cutoff,
min_num_clusts=min_num_clusts, max_num_clusts=max_num_clusts)
n_selected_clusts <- length(sel$selected_clusts)
n_selected_feats <- length(sel$selected_feats)
# Empty-selection path: do NOT call printCssDf() / getSelectionPrototypes()
# here. Both assert at least one selected cluster (printCssDf: nrow >= 1;
# getSelectionPrototypes: n_selected_clusts >= 1) and so would die opaquely
# on an empty selection (#107). Return a well-formed zero-row summary.
if(n_selected_clusts >= 1){
# Reuse the sel$selected_clusts already computed above instead of a
# second identical getCssSelections() inside printCssDf(). Cluster
# selection is weighting-invariant, so sel$selected_clusts (for this
# weighting) matches the "sparse"-default selection printCssDf() would
# recompute -- the resulting table is byte-identical. (#129)
clust_table <- buildCssDf(object, sel$selected_clusts)
} else{
clust_table <- NULL
}
out <- list(n_selected_clusts=n_selected_clusts,
n_selected_feats=n_selected_feats, cutoff=cutoff,
table=clust_table, pfer=NA_real_)
class(out) <- "summary.cssr"
return(out)
}print.summary.cssr() (mirrors the print.cssr() S3-method pattern – roxygen with @export, no @examples):
#' @param x An object of class "summary.cssr" (the output of [summary.cssr()]).
#' @rdname summary.cssr
#' @export
print.summary.cssr <- function(x, ...){
cat("Cluster stability selection summary\n")
cat(sprintf("%d cluster%s / %d feature%s selected at cutoff %s\n",
x$n_selected_clusts, if(x$n_selected_clusts == 1) "" else "s",
x$n_selected_feats, if(x$n_selected_feats == 1) "" else "s",
format(x$cutoff)))
if(is.null(x$table)){
cat("(no clusters selected at this cutoff)\n")
} else{
cat("\n")
print.data.frame(x$table, ...)
}
# PFER placeholder: error-control content to be filled in by issue #87.
cat(sprintf("\nPFER: %s (not yet implemented; see issue #87)\n",
format(x$pfer)))
invisible(x)
}