5 Other useful functions

The following are some other functions that are useful for specific tasks.

  • genClusteredData() generates jointly Gaussian data sets that include clusters of features that are correlated with a latent feature (the latent features themselves are also provided), as in the simulation in Example 1, Section 5.1, and Section 5.2 of Faletto and Bien (2022).

  • genClusteredDataWeighted() generates jointly Gaussian data sets that include clusters of features that are correlated with a latent feature with unequal correlations (the latent features themselves are also provided), as in the simulation Section 5.3 of Faletto and Bien (2022).

  • genClusteredDataWeightedRandom() generates jointly Gaussian data sets that include clusters of features that are correlated with a latent feature with unequal correlations (the latent features themselves are also provided) that are determined at random.

  • getLassoLambda() uses a data set to choose a good value of lambda for subsamples of size n/2 (as in cluster stability selection) by cross-validation.

  • getModelSize() estimates the best size for a lasso model on a provided data set.

  • printCssDf() converts the output of the css() function (an object of class cssr) to a data.frame that is ready to be printed. We provide end-user access to printCssDf() because this data.frame itself may be useful as an object that summarizes the output from css().

  • print.cssr() is the print function for objects of class cssr.

  • genClusteredData()

    • checkGenClusteredDataInputs() checks that the inputs to genClusteredData() are as expected.
    • genZmuY() generates the latent Z, the weak signal and noise features in X, the latent mu, and the observed y from the specified parameters.
    • getNoiseVar() calculates the variance of the noise to add to Z so that the proxies have the specified correlation with the latent variable Z.
  • genClusteredDataWeighted()

  • getLassoLambda()

  • getModelSize()

  • printCssDf()

    • getSelectionPrototypes() identifies the prototypes from selected clusters (the feature in each cluster that is most correlated with the response)
  • print.cssr()

genClusteredData()

#' Generate randomly sampled data including noisy observations of latent
#' variables
#'
#' Generate a data set including latent features Z, observed features X (which
#' may include noisy or noiseless observations of the latent features in Z),
#' an observed response y which is a linear model of features from Z and X as
#' well as independent mean zero noise, and mu (the responses from y without
#' the added noise). Data is generated in the same way as in the simulations
#' from Faletto and Bien (2022).
#' @param n Integer or numeric; the number of observations to generate. (The
#' generated X and Z will have n rows, and the generated y and mu will have
#' length n.)
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features 
#' not generated from the n_clusters latent variables (called "weak signal" 
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered. Must be at least 1.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho Integer or numeric; the correlation of the proxies in each cluster
#' with the latent variable. Must be greater than 0. Default is 0.9.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y (non-negative; 0 gives noiseless y). Only one of snr and sigma_eps_sq
#' must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features, as an n by n_clusters numeric matrix (one column per
#' latent variable). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @importFrom stats rnorm runif
#' @examples
#' set.seed(1)
#' data <- genClusteredData(n = 50, p = 11, k_unclustered = 2,
#'   cluster_size = 4, n_clusters = 1, snr = 3)
#' dim(data$X)
#' length(data$y)
#' @export
genClusteredData <- function(n, p, k_unclustered, cluster_size, n_clusters=1,
    sig_clusters=1, rho=0.9, beta_latent=1.5, beta_unclustered=1,
    snr=as.numeric(NA), sigma_eps_sq=as.numeric(NA)){

    # Check inputs
    checkGenClusteredDataInputs(p, k_unclustered, cluster_size, n_clusters,
        sig_clusters, rho, beta_latent, beta_unclustered, snr,
        sigma_eps_sq)

    ret <- genZmuY(n=n, p=p, k_unclustered=k_unclustered,
        cluster_size=cluster_size, n_clusters=n_clusters,
        sig_clusters=sig_clusters, beta_latent=beta_latent,
        beta_unclustered=beta_unclustered, snr=snr, sigma_eps_sq=sigma_eps_sq)

    Z <- ret$Z
    y <- ret$y
    mu <- ret$mu
    other_X <- ret$other_X

    # Finally, generate clusters of proxies to complete X. First, get needed
    # variances of noise to add
    noise_var <- getNoiseVar(rho)

    # Generate these noise features
    noise_mat <- matrix(stats::rnorm(n*n_clusters*cluster_size, mean=0,
        sd=sqrt(noise_var)), n, n_clusters*cluster_size)

    # Create matrix of proxies. Each cluster's cluster_size proxies are its
    # latent Z column plus the matching noise columns (#58); column-expand Z to
    # the noise block layout in one vectorized add (replaces the per-cluster
    # block loop; noise_mat was already drawn above, so RNG order is unchanged).
    # For n_clusters == 1, genZmuY returns Z as a bare numeric vector, and
    # as.matrix(Z) promotes it to a one-column matrix before the column-expand.
    col_clust <- rep(1:n_clusters, each = cluster_size)
    proxy_mat <- as.matrix(Z)[, col_clust, drop = FALSE] + noise_mat

    return(finalizeGenClusteredData(proxy_mat, other_X, y, mu, Z, n, p,
        n_clusters))
}

The three genClusteredData* validators share the bulk of their checks. To keep them from drifting apart (the source of the validation inconsistencies fixed in #13), the common checks live in two shared helpers: checkGenClusteredDataInputsPre() (the checks that precede each generator’s correlation-specific checks) and checkGenClusteredDataInputsPost() (the checks that follow them). Each validator calls Pre(), then its own correlation/strength checks, then Post(), then its own snr/sigma_eps_sq checks – preserving each validator’s exact original check order.

#' Shared leading input checks for the genClusteredData* generators
#'
#' The checks every genClusteredData* validator runs before its
#' correlation-specific checks (sig_clusters, n_clusters, and cluster_size).
#' @param sig_clusters,n_clusters,cluster_size As documented in
#' `checkGenClusteredDataInputs()`.
#' @return No return value; called for the side effect of erroring on bad input.
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkGenClusteredDataInputsPre <- function(sig_clusters, n_clusters,
    cluster_size){

    stopifnot(is.numeric(sig_clusters) | is.integer(sig_clusters))
    stopifnot(sig_clusters <= n_clusters)
    stopifnot(sig_clusters >= 0)
    stopifnot(sig_clusters == round(sig_clusters))

    stopifnot(is.numeric(n_clusters) | is.integer(n_clusters))
    stopifnot(n_clusters == round(n_clusters))
    # TODO(gregfaletto): is it easy to remove the requirement that n_clusters is
    # at least 1 (so that it's possible to generate data with no latent
    # features)? If so, should only check that cluster_size >= 1 if n_clusters
    # >= 1, and in makeCovarianceMatrix function only need block_size >= 1
    # rather than 2.
    stopifnot(n_clusters >= 1)

    stopifnot(is.numeric(cluster_size) | is.integer(cluster_size))
    stopifnot(cluster_size == round(cluster_size))
    stopifnot(cluster_size >= 2)
}

#' Shared trailing input checks for the genClusteredData* generators
#'
#' The checks every genClusteredData* validator runs after its
#' correlation-specific checks: the beta coefficients, k_unclustered, the
#' feature-count lower bound, and that exactly one of snr / sigma_eps_sq is
#' given. (Each validator then runs its own type/range checks on whichever of
#' snr / sigma_eps_sq was supplied.)
#' @param p,k_unclustered,cluster_size,n_clusters,beta_latent,beta_unclustered,snr,sigma_eps_sq
#' As documented in `checkGenClusteredDataInputs()`.
#' @return No return value; called for the side effect of erroring on bad input.
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkGenClusteredDataInputsPost <- function(p, k_unclustered, cluster_size,
    n_clusters, beta_latent, beta_unclustered, snr, sigma_eps_sq){

    stopifnot(beta_latent != 0)
    stopifnot(beta_unclustered != 0)

    stopifnot(is.numeric(k_unclustered) | is.integer(k_unclustered))
    stopifnot(k_unclustered >= 1)
    stopifnot(k_unclustered == round(k_unclustered))

    stopifnot(is.numeric(p) | is.integer(p))
    stopifnot(p == round(p))
    stopifnot(p >= n_clusters*cluster_size + k_unclustered)

    # Same as make_sparse_blocked_linear_model_random, but ith coefficient
    # of weak signal features is beta_unclustered/sqrt(i) in order to have
    # a definitive ranking of weak signal features.
    if(is.na(snr) & is.na(sigma_eps_sq)){
        stop("Must specify one of snr or sigma_eps_sq")
    }

    if(!is.na(snr) & !is.na(sigma_eps_sq)){
        stop("Only one of snr and sigma_eps_sq may be specified")
    }
}

#' Shared snr / sigma_eps_sq input checks for the genClusteredData* generators
#'
#' Validates whichever of snr / sigma_eps_sq was supplied (exactly one is non-NA
#' by this point, enforced by `checkGenClusteredDataInputsPost()`): it must be
#' numeric and length 1, and positive (snr) or non-negative (sigma_eps_sq).
#' Previously only `genClusteredData()` ran the full type/length checks; the two
#' weighted generators ran a leaner version that silently accepted a
#' non-numeric snr / sigma_eps_sq (which then failed later inside `genZmuY()`).
#' Sharing this helper makes the three consistent (#35).
#' @param snr,sigma_eps_sq As documented in `checkGenClusteredDataInputs()`.
#' @return No return value; called for the side effect of erroring on bad input.
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkGenClusteredDataInputsSnrSigma <- function(snr, sigma_eps_sq){
    if(is.na(snr)){
        stopifnot(all(!is.na(sigma_eps_sq)))
        stopifnot(is.numeric(sigma_eps_sq) | is.integer(sigma_eps_sq))
        stopifnot(length(sigma_eps_sq) == 1)
        stopifnot(sigma_eps_sq >= 0)
    } else{
        stopifnot(is.numeric(snr) | is.integer(snr))
        stopifnot(length(snr) == 1)
        stopifnot(snr > 0)
    }
}

checkGenClusteredDataInputs()

#' Check inputs to genClusteredData
#'
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features 
#' not generated from the n_clusters latent variables (called "weak signal" 
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho Integer or numeric; the correlation of the proxies in each cluster
#' with the latent variable. Must be greater than 0. Default is 0.9.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features, as an n by n_clusters numeric matrix (one column per
#' latent variable). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkGenClusteredDataInputs <- function(p, k_unclustered, cluster_size,
    n_clusters, sig_clusters, rho, beta_latent, beta_unclustered, snr,
    sigma_eps_sq){

    checkGenClusteredDataInputsPre(sig_clusters, n_clusters, cluster_size)

    stopifnot(rho > 0)

    checkGenClusteredDataInputsPost(p, k_unclustered, cluster_size, n_clusters,
        beta_latent, beta_unclustered, snr, sigma_eps_sq)

    checkGenClusteredDataInputsSnrSigma(snr, sigma_eps_sq)
}

genZmuY()

#' Generates Z, weak signal features in X, noise features in X, mu, and y
#' from provided parameters
#'
#' @param n Integer or numeric; the number of observations to generate. (The
#' generated X and Z will have n rows, and the generated y and mu will have
#' length n.)
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features 
#' not generated from the n_clusters latent variables (called "weak signal" 
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered. Must be at least 1.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{Z}{The
#' latent features; either a numeric vector (if n_clusters == 1) or a numeric
#' matrix (if n_clusters > 1). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.}
#' \item{other_X}{A numeric matrix of n observations from a multivariate normal
#' distribution generated using the specified parameters, containing the weak
#' signal features and the noise features that will eventually be in X. (The
#' only missing features are the proxies for the latent features Z.)}
#'
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @keywords internal
#' @noRd
genZmuY <- function(n, p, k_unclustered, cluster_size, n_clusters, sig_clusters,
    beta_latent, beta_unclustered, snr, sigma_eps_sq){
    # Reject n < 2 up front: n = 1 collapses orig_feat_mat[, cols] to a vector
    # (no drop=FALSE below), which fails downstream with a cryptic "incorrect
    # number of dimensions". n is not validated by the shared input checkers, so
    # guard it here -- the one point every generator funnels through (#70).
    stopifnot(is.numeric(n) | is.integer(n))
    stopifnot(n == round(n))
    stopifnot(n >= 2)
    # Generate Z, weak signal features, and noise features (total of
    # p - n_clusters*(cluster_size - 1)) features)
    p_orig_feat_mat <- p - n_clusters*(cluster_size - 1)
    stopifnot(p_orig_feat_mat >= k_unclustered + n_clusters)
    orig_feat_mat <- matrix(stats::rnorm(n*p_orig_feat_mat), n, p_orig_feat_mat)
    # First n_clusters features are Z. Next k_unclustered features are weak
    # signal features. Any remaining features are noise features.
    Z <- orig_feat_mat[, 1:n_clusters]

    other_X <- orig_feat_mat[, (n_clusters + 1):p_orig_feat_mat, drop = FALSE]

    # Ready to create mu and y
    if(sig_clusters == 0){
        mu <- rep(0, n)                 # null latent model; weak-signal features add below
    } else if(n_clusters > 1){
        if(sig_clusters > 1){
            mu <- Z[, seq_len(sig_clusters)] %*% rep(beta_latent, sig_clusters)
        } else{
            mu <- Z[, 1] * beta_latent
        }
    } else{
        mu <- Z * beta_latent
    }
    for(j in 1:k_unclustered){
        mu <- mu + beta_unclustered/sqrt(j)*other_X[, j]
    }
    mu <- as.numeric(mu)

    # If SNR is null, use sigma_eps_sq
    if(!is.na(sigma_eps_sq)){
        sd <- sqrt(sigma_eps_sq)
    }else{
        sd <- sqrt(sum(mu^2) / (n * snr)) # taking snr = ||mu||^2 /(n * sigma^2)
    }

    stopifnot(is.numeric(sd) | is.integer(sd))
    stopifnot(length(sd) == 1)
    stopifnot(!is.na(sd))
    stopifnot(sd >= 0)

    y <- as.numeric(mu + rnorm(n, mean=0, sd=sd))

    return(list(Z=Z, mu=mu, y=y, other_X=other_X))
}

No tests for genZmuY().

getNoiseVar():

#' Get variance of noise to add to Z in order to yield proxies X with desired
#' correlations with Z
#'
#' @param cor A numeric vector of desired correlations for each proxy to have
#' with Z. Note: correlations must be positive.
#' @return A vector of variances of independent Gaussian random variables to add
#' to Z in order to yield proxies with the desired correlations with Z.
#' @author Gregory Faletto, Jacob Bien
#' @examples
#' # Noise variance needed so a proxy Z + N(0, v) attains each target
#' # correlation with Z (correlations must be positive).
#' getNoiseVar(c(0.9, 0.5, 1))
#' @export
getNoiseVar <- function(cor){
    # Correlation between standard normal Z and X = Z + epsilon where epsilon
    # is normal, independent of Z, and has mean 0 and variance sig_eps_sq:
    # 
    #   E[Z X]/sqrt{Var(Z) Var(X)}
    # = (E[Z^2] + E[Z*epsilon])/sqrt{1*(1 + sig_eps_sq)}
    # = (1 + 0)/sqrt{1 + sig_eps_sq}
    #
    # So we have
    #                 cor = 1/sqrt{1 + sig_eps_sq}
    # \iff 1 + sig_eps_sq = 1/cor^2
    # \iff     sig_eps_sq = 1/cor^2 - 1
    stopifnot(is.numeric(cor) | is.integer(cor))
    stopifnot(all(!is.na(cor)))
    stopifnot(length(cor) >= 1)
    stopifnot(all(cor > 0))
    stopifnot(all(cor <= 1))
    return(1/cor^2 - 1)
}

finalizeGenClusteredData() is the shared tail of all three genClusteredData* generators: once each generator has built its cluster-proxy matrix (the only step where the three differ), this helper column-binds the proxies with the remaining features, coerces the latent features to a matrix, runs the common output dimension checks, and returns the standard list(X, y, Z, mu).

#' Assemble and validate the output of a genClusteredData* generator
#'
#' Shared tail of `genClusteredData()`, `genClusteredDataWeighted()`, and
#' `genClusteredDataWeightedRandom()`: column-bind the freshly generated cluster
#' proxies with the remaining features, coerce the latent features to a matrix,
#' run the common output dimension checks, and return the standard
#' `list(X, y, Z, mu)`.
#' @param proxy_mat Numeric matrix; the n by n_clusters*cluster_size block of
#' generated cluster proxies.
#' @param other_X Numeric matrix; the weak-signal and noise features (the
#' non-proxy columns of X), as returned by `genZmuY()`.
#' @param y Numeric vector; the response, as returned by `genZmuY()`.
#' @param mu Numeric vector; the expected response, as returned by `genZmuY()`.
#' @param Z The latent features, as returned by `genZmuY()` (a numeric vector
#' when n_clusters == 1); coerced to a matrix here.
#' @param n Integer or numeric; the number of observations.
#' @param p Integer or numeric; the total number of features (columns of X).
#' @param n_clusters Integer or numeric; the number of latent variables.
#' @return A list with elements X (numeric matrix), y (numeric vector), Z
#' (numeric matrix), and mu (numeric vector).
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
finalizeGenClusteredData <- function(proxy_mat, other_X, y, mu, Z, n, p,
    n_clusters){
    X <- cbind(proxy_mat, other_X)
    Z <- as.matrix(Z)

    # Check output
    stopifnot(length(mu) == n)

    stopifnot(nrow(X) == n)
    stopifnot(ncol(X) == p)

    if(any(!is.na(Z))){
        stopifnot(nrow(Z) == n)
        stopifnot(ncol(Z) == n_clusters)
    }

    return(list(X=X, y=y, Z=Z, mu=mu))
}

No tests for finalizeGenClusteredData(); it is exercised through the three genClusteredData* generators (including their characterization snapshots).

genClusteredDataWeighted()

#' Generate randomly sampled data including noisy observations of latent
#' variables, where proxies differ in their relevance (noise level)
#'
#' Generate a data set including latent features Z, observed features X (which
#' may include noisy or noiseless observations of the latent features in Z),
#' an observed response y which is a linear model of features from Z and X as
#' well as independent mean zero noise, and mu (the responses from y without
#' the added noise). Data is generated in the same way as in the simulations
#' from Section 5.3 of Faletto and Bien (2022).
#' @param n Integer or numeric; the number of observations to generate. (The
#' generated X and Z will have n rows, and the generated y and mu will have
#' length n.)
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered. Must be at least 1.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_strong_cluster_vars Integer or numeric; among the cluster_size
#' proxies in each cluster, the first n_strong_cluster_vars will have a high
#' correlation (rho_high) with the latent variable and the next cluster_size -
#' n_strong_cluster_vars will have a low correlation (rho_low) with the latent
#' variable.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho_high Integer or numeric; the correlation of the "strong proxies"
#' in each cluster with the latent variable. rho_high cannot equal 0 and must be
#' at least as large as rho_low. Default is 0.9.
#' @param rho_low Integer or numeric; the correlation of the "weak proxies" in
#' each cluster with the latent variable. rho_low cannot equal 0 and must be no
#' larger than rho_high. Default is 0.5.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y (non-negative; 0 gives noiseless y). Only one of snr and sigma_eps_sq
#' must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features, as an n by n_clusters numeric matrix (one column per
#' latent variable). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @examples
#' set.seed(1)
#' data <- genClusteredDataWeighted(n = 50, p = 11, k_unclustered = 2,
#'   cluster_size = 4, n_strong_cluster_vars = 2, n_clusters = 1, snr = 3)
#' dim(data$X)
#' @export
genClusteredDataWeighted <- function(n, p, k_unclustered, cluster_size,
    n_strong_cluster_vars, n_clusters=1, sig_clusters=1, rho_high=0.9,
    rho_low=0.5, beta_latent=1.5, beta_unclustered=1, snr=as.numeric(NA),
    sigma_eps_sq=as.numeric(NA)){

    # Check inputs
    checkGenClusteredDataWeightedInputs(p, k_unclustered, cluster_size,
        n_strong_cluster_vars, n_clusters,  sig_clusters, rho_high, rho_low,
        beta_latent, beta_unclustered, snr, sigma_eps_sq)

    ret <- genZmuY(n=n, p=p, k_unclustered=k_unclustered,
        cluster_size=cluster_size, n_clusters=n_clusters,
        sig_clusters=sig_clusters, beta_latent=beta_latent,
        beta_unclustered=beta_unclustered, snr=snr, sigma_eps_sq=sigma_eps_sq)

    Z <- ret$Z
    y <- ret$y
    mu <- ret$mu
    other_X <- ret$other_X

    # Finally, generate clusters of proxies to complete X.
    noise_var_high <- getNoiseVar(rho_high)
    noise_var_low <- getNoiseVar(rho_low)

    # Create matrix of proxies
    Z <- as.matrix(Z)
    proxy_mat <- matrix(as.numeric(NA), n, n_clusters*cluster_size)
    for(i in 1:n_clusters){
        for(j in 1:n_strong_cluster_vars){
            proxy_mat[, (i - 1)*cluster_size + j] <- Z[, i] + rnorm(n,
                mean=0, sd=sqrt(noise_var_high))
        }
        for(j in (n_strong_cluster_vars + 1):cluster_size){
            proxy_mat[, (i - 1)*cluster_size + j] <- Z[, i] + rnorm(n,
                mean=0, sd=sqrt(noise_var_low))
        }
    }

    return(finalizeGenClusteredData(proxy_mat, other_X, y, mu, Z, n, p,
        n_clusters))
}

checkGenClusteredDataWeightedInputs()

#' Check inputs to genClusteredDataWeighted
#'
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features 
#' not generated from the n_clusters latent variables (called "weak signal" 
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_strong_cluster_vars Integer or numeric; among the cluster_size
#' proxies in each cluster, n_strong_cluster_vars will have a high correlation
#' (rho_high) with the latent variable and cluster_size - n_strong_cluster_vars
#' will have a low correlation (rho_low) with the latent variable.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho_high Integer or numeric; the correlation of the "strong proxies"
#' in each cluster with the latent variable. rho_high cannot equal 0 and must be
#' at least as large as rho_low. Default is 0.9.
#' @param rho_low Integer or numeric; the correlation of the "weak proxies" in
#' each cluster with the latent variable. rho_low cannot equal 0 and must be no
#' larger than rho_high. Default is 0.5.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features, as an n by n_clusters numeric matrix (one column per
#' latent variable). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkGenClusteredDataWeightedInputs <- function(p, k_unclustered, cluster_size,
        n_strong_cluster_vars, n_clusters, sig_clusters, rho_high, rho_low,
        beta_latent, beta_unclustered, snr, sigma_eps_sq){

    checkGenClusteredDataInputsPre(sig_clusters, n_clusters, cluster_size)

    stopifnot(is.integer(n_strong_cluster_vars) |
        is.numeric(n_strong_cluster_vars))
    stopifnot(!is.na(n_strong_cluster_vars))
    stopifnot(length(n_strong_cluster_vars) == 1)
    stopifnot(n_strong_cluster_vars == round(n_strong_cluster_vars))
    stopifnot(n_strong_cluster_vars >= 1)
    stopifnot(n_strong_cluster_vars < cluster_size)

    stopifnot(rho_high <= 1)
    stopifnot(rho_high > 0)
    stopifnot(rho_low > 0)
    stopifnot(rho_high >= rho_low)

    checkGenClusteredDataInputsPost(p, k_unclustered, cluster_size, n_clusters,
        beta_latent, beta_unclustered, snr, sigma_eps_sq)

    checkGenClusteredDataInputsSnrSigma(snr, sigma_eps_sq)
}

No tests for checkGenClusteredDataWeightedInputs()

genClusteredDataWeightedRandom()

#' Generate randomly sampled data including noisy observations of latent
#' variables, where proxies differ in their relevance (noise level)
#'
#' Generate a data set including latent features Z, observed features X (which
#' may include noisy or noiseless observations of the latent features in Z),
#' an observed response y which is a linear model of features from Z and X as
#' well as independent mean zero noise, and mu (the responses from y without
#' the added noise).
#' @param n Integer or numeric; the number of observations to generate. (The
#' generated X and Z will have n rows, and the generated y and mu will have
#' length n.)
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features
#' not generated from the n_clusters latent variables (called "weak signal"
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered. Must be at least 1.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho_high Integer or numeric; the maximum correlation of the proxies
#' in each cluster with the latent variable. rho_high cannot equal 0 and must be
#' at least as large as rho_low. Default is 1.
#' @param rho_low Integer or numeric; the minimum correlation of the proxies in
#' each cluster with the latent variable. rho_low cannot equal 0 and must be no
#' larger than rho_high. Default is 0.5.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y (non-negative; 0 gives noiseless y). Only one of snr and sigma_eps_sq
#' must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features, as an n by n_clusters numeric matrix (one column per
#' latent variable). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @examples
#' set.seed(1)
#' data <- genClusteredDataWeightedRandom(n = 50, p = 11, k_unclustered = 2,
#'   cluster_size = 4, n_clusters = 1, snr = 3)
#' dim(data$X)
#' @export
genClusteredDataWeightedRandom <- function(n, p, k_unclustered, cluster_size,
    n_clusters=1, sig_clusters=1, rho_high=1, rho_low=0.5, beta_latent=1.5,
    beta_unclustered=1, snr=as.numeric(NA), sigma_eps_sq=as.numeric(NA)){

    # Check inputs
    checkGenClusteredDataWeightedRandomInputs(p, k_unclustered, cluster_size,
        n_clusters, sig_clusters, rho_high, rho_low,
        beta_latent, beta_unclustered, snr, sigma_eps_sq)

    ret <- genZmuY(n=n, p=p, k_unclustered=k_unclustered,
        cluster_size=cluster_size, n_clusters=n_clusters,
        sig_clusters=sig_clusters, beta_latent=beta_latent,
        beta_unclustered=beta_unclustered, snr=snr, sigma_eps_sq=sigma_eps_sq)

    Z <- ret$Z
    y <- ret$y
    mu <- ret$mu
    other_X <- ret$other_X

    # Finally, generate clusters of proxies to complete X.
    # Create matrix of proxies
    Z <- as.matrix(Z)
    proxy_mat <- matrix(as.numeric(NA), n, n_clusters*cluster_size)
    for(i in 1:n_clusters){
        for(j in 1:cluster_size){
            # Choose correlation at random
            rho_ij <- runif(n=1, min=rho_low, max=rho_high)
            # Get noise variance
            noise_var_ij <- getNoiseVar(rho_ij)
            proxy_mat[, (i - 1)*cluster_size + j] <- Z[, i] + rnorm(n,
                mean=0, sd=sqrt(noise_var_ij))
        }
    }

    return(finalizeGenClusteredData(proxy_mat, other_X, y, mu, Z, n, p,
        n_clusters))
}

checkGenClusteredDataWeightedRandomInputs()

#' Check inputs to genClusteredDataWeightedRandom
#'
#' @param p Integer or numeric; the number of features to generate. The
#' generated X will have p columns.
#' @param k_unclustered Integer or numeric; the number of features in X that
#' will have nonzero coefficients in the true model for y among those features 
#' not generated from the n_clusters latent variables (called "weak signal" 
#' features in the simulations from Faletto and Bien 2022). The coefficients on
#' these features will be determined by beta_unclustered.
#' @param cluster_size Integer or numeric; for each of the n_clusters latent
#' variables, X will contain cluster_size noisy proxies that are correlated with
#' the latent variable. Must be at least 2.
#' @param n_clusters Integer or numeric; the number of latent variables to
#' generate, each of which will be associated with an observed cluster in X.
#' Must be at least 1. Default is 1.
#' @param sig_clusters Integer or numeric; the number of generated latent
#' features that will have nonzero coefficients in the true model for y (all of
#' them will have coefficient beta_latent). Must be less than or equal to
#' n_clusters. Default is 1.
#' @param rho_high Integer or numeric; the maximum correlation of the proxies
#' in each cluster with the latent variable. rho_high cannot equal 0 and must be
#' at least as large as rho_low. Default is 1.
#' @param rho_low Integer or numeric; the minimum correlation of the proxies in
#' each cluster with the latent variable. rho_low cannot equal 0 and must be no
#' larger than rho_high. Default is 0.5.
#' @param beta_latent Integer or numeric; the coefficient used for all
#' sig_clusters latent variables that have nonzero coefficients in the true
#' model for y. Can't equal 0. Default is 1.5.
#' @param beta_unclustered Integer or numeric; the maximum coefficient in the
#' model for y among the k_unclustered features in X not generated from the
#' latent variables. The coefficients of the features will be
#' beta_unclustered/sqrt(1:k_unclustered). Can't equal 0. Default is 1.
#' @param snr Integer or numeric; the signal-to-noise ratio of the response
#' y. If sigma_eps_sq is not specified, the variance of the noise in y will be
#' calculated using the formula sigma_eps_sq = sum(mu^2)/(n * snr). Only one of
#' snr and sigma_eps_sq must be specified. Default is NA.
#' @param sigma_eps_sq Integer or numeric; the variance on the noise added
#' to y. Only one of snr and sigma_eps_sq must be specified. Default is NA.
#' @return A list of the following elements. \item{X}{An n x p numeric matrix of
#' n observations from a p-dimensional multivariate normal distribution
#' generated using the specified parameters. The first n_clusters times
#' cluster_size features will be the clusters of features correlated with the
#' n_clusters latent variables. The next k_unclustered features will be the
#' "weak signal" features, and the remaining p - n_clusters*cluster_size -
#' k_unclustered features will be the unclustered noise features.} \item{y}{A
#' length n numeric vector; the response generated from X, the latent features
#' from Z, and the coefficient vector, along with additive noise.} \item{Z}{The
#' latent features, as an n by n_clusters numeric matrix (one column per
#' latent variable). Note that (X, Z) is multivariate Gaussian.}
#' \item{mu}{A length `n` numeric vector; the expected response given X, Z, and
#' the true coefficient vector (equal to y minus the added noise).}
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkGenClusteredDataWeightedRandomInputs <- function(p, k_unclustered,
    cluster_size, n_clusters, sig_clusters, rho_high, rho_low, beta_latent,
    beta_unclustered, snr, sigma_eps_sq){

    checkGenClusteredDataInputsPre(sig_clusters, n_clusters, cluster_size)

    stopifnot(rho_high <= 1)
    stopifnot(rho_high > 0)
    stopifnot(rho_low > 0)
    stopifnot(rho_high >= rho_low)

    checkGenClusteredDataInputsPost(p, k_unclustered, cluster_size, n_clusters,
        beta_latent, beta_unclustered, snr, sigma_eps_sq)

    checkGenClusteredDataInputsSnrSigma(snr, sigma_eps_sq)
}

No tests for checkGenClusteredDataWeightedRandomInputs()

getLassoLambda():

#' Get lambda value for lasso
#'
#' Chooses a lambda value for the lasso used on a subsample of size n/2 (as in
#' cluster stability selection) by cross-validation.
#' @param X An n x p numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the p >= 2 features/predictors that will be used by cluster stability
#' selection. Must not contain missing (`NA`) values.
#' @param y The response; an n-dimensional numeric or integer vector. (Unlike
#' in the more general css setup, this response must be real-valued since
#' lambda will be determined using the lasso with cross-validation.)
#' @param lambda_choice Character; either "min" or "1se". If "min", chooses
#' the lambda that minimizes the cross-validated error; if "1se", chooses the
#' largest lambda within one standard error of the minimum error lambda
#' (resulting in a smaller selected set, which may be desirable because the
#' model size corresponding to the minimum error lambda tends to be larger
#' than optimal. See, for example, Bühlmann and Meinshausen 2006, Prop. 1 and
#' Bühlmann and van de Geer 2011, Section 2.5.1.). Default is "1se".
#' @param nfolds Numeric or integer; the number of folds for cross-validation.
#' Must be at least 4 (as specified by cv.glmnet). Default is 10.
#' @param alpha Numeric; the elastic net mixing parameter. Must be in `(0, 1]`.
#' Default is 1 (in which case the penalty is for lasso).
#' @return A numeric; the selected value of lambda.
#' @author Gregory Faletto, Jacob Bien
#' @references Bühlmann, P., & Meinshausen, N. (2006). High-Dimensional Graphs
#' and Variable Selection With the Lasso. \emph{The Annals of Statistics},
#' 34(3), 1436–1462. \url{https://doi.org/10.1214/009053606000000281}.
#' \cr Peter Bühlmann and Sara van de Geer. Statistics for High-Dimensional
#' Data. \emph{Springer Series in Statistics}. Springer, Heidelberg, 2011. ISBN
#' 978-3-642-20191-2. \url{http://dx.doi.org/10.1007/978-3-642-20192-9}. \cr
#' Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization
#' Paths for Generalized Linear Models via Coordinate Descent. \emph{Journal of
#' Statistical Software}, 33(1), 1-22. URL \url{https://www.jstatsoft.org/v33/i01/}.
#' @examples
#' set.seed(1)
#' data <- genClusteredData(n = 50, p = 11, k_unclustered = 2,
#'   cluster_size = 4, n_clusters = 1, snr = 3)
#' lambda <- getLassoLambda(X = data$X, y = data$y, nfolds = 5)
#' lambda
#' @export
getLassoLambda <- function(X, y, lambda_choice="1se", nfolds=10, alpha=1){
    stopifnot(is.character(lambda_choice))
    stopifnot(length(lambda_choice) == 1)
    stopifnot(!is.na(lambda_choice))
    stopifnot(lambda_choice %in% c("min", "1se"))

    checkNoNAs(X, "X")

    # Accept a data.frame like the other X-accepting exports (#51): coerce to a
    # matrix via the shared helper. clusters = list() makes the column-count
    # guard inert (getLassoLambda has no clusters argument).
    X <- coerceDataFrameToMatrix(X, clusters = list())
    stopifnot(is.matrix(X))
    stopifnot(is.numeric(X) | is.integer(X))
    # Require at least two features, mirroring css()'s stopifnot(p >= 2). A 1-col
    # X (e.g. cssSelect()/cssPredict() called with p = 1) would otherwise reach
    # cv.glmnet with a single column and fail cryptically; this fires the same
    # clean "p >= 2 is not TRUE" message as css().
    p <- ncol(X)
    stopifnot(p >= 2)
    n <- nrow(X)

    stopifnot(is.numeric(nfolds) | is.integer(nfolds))
    stopifnot(length(nfolds) == 1)
    stopifnot(nfolds == round(nfolds))
    stopifnot(nfolds > 3)

    checkAlpha(alpha)

    # Since we are using the lasso, we require y to be a real-valued response
    # (unlike for the general cluster stability selection procedure, where y
    # can have a more general format as long as a suitable feature selection
    # function is provided by the user). checkFiniteY also rejects NA/NaN/Inf
    # here, BEFORE the random subsample below, so a non-finite y fails fast and
    # deterministically rather than only when the bad value lands in the draw.
    checkFiniteY(y, "y")
    stopifnot(n == length(y))

    # Sample size to use: inflate n/2 by a factor of nfolds/(nfolds - 1),
    # so that each individual lasso fit is of size floor(n/2)

    n_sample <- min(round(n/2*nfolds/(nfolds - 1)), n)
    # n_sample is the number of observations actually passed to cv.glmnet. If it
    # is too small for even 3-fold CV, error clearly here (reporting the
    # requested nfolds) instead of letting cv.glmnet crash with "nfolds must be
    # bigger than 3". Otherwise clamp nfolds into [3, n_sample] so we never end
    # up below glmnet's minimum or above the available sample size.
    if(n_sample < 3){
        stop(paste0("Sample size n = ", n, " is too small to choose lambda by ", nfolds,
            "-fold cross-validation; supply a larger n, a smaller nfolds, or lambda directly."))
    }
    nfolds <- max(3L, min(nfolds, n_sample))

    inds_size <- sample(1:n, n_sample)
    size_results <- glmnet::cv.glmnet(x=X[inds_size, ], y=y[inds_size],
        family="gaussian", nfolds=nfolds, alpha=alpha)

    lambda_ret <- size_results[[paste("lambda.", lambda_choice, sep="")]]

    # Check output
    stopifnot(length(lambda_ret) == 1)
    stopifnot(is.numeric(lambda_ret) | is.integer(lambda_ret))
    stopifnot(lambda_ret >= 0)

    return(lambda_ret)
}

getModelSize():

#' Automated estimation of model size
#'
#' This function is uses the lasso with cross-validation to estimate the
#' model size. Before using the lasso, in each cluster all features will be
#' dropped from X except the one feature with the highest marginal correlation
#' with y, as in the protolasso (Reid and Tibshirani 2016).
#' 
#' @param X An n x p numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the p >= 2 features/predictors. Must not contain missing (`NA`) values.
#' @param y A length-n numeric vector containing the responses; `y[i]` is the
#' response corresponding to observation `X[i, ]`. (Note that for the css
#' function, y does not have to be a numeric response, but for this function,
#' the underlying selection procedure is the lasso, so y must be a real-valued
#' response.)
#' @param clusters A named list where each entry is an integer vector of indices
#' of features that are in a common cluster, as in the output of css.
#' (The length of list clusters is equal to the number of clusters.) All
#' identified clusters must be non-overlapping, and all features must appear in
#' exactly one cluster (any unclustered features should be in their own
#' "cluster" of size 1). CAUTION: if the provided X is a data.frame that
#' contains a categorical feature with more than two levels, then the resulting
#' matrix made from model.matrix will have a different number of columns than
#' the provided data.frame, some of the feature numbers will change, and the
#' clusters argument will not work properly (in the current version of the
#' package). To get correct results in this case, please use model.matrix to
#' convert the data.frame to a numeric matrix on your own, then provide this
#' matrix and cluster assignments with respect to this matrix.
#' @param alpha Numeric; the elastic net mixing parameter for the
#' cross-validated fit used to estimate the model size. Must be in `(0, 1]`.
#' Default is 1 (in which case the penalty is the lasso). Set alpha to match
#' the alpha used for feature selection so the model-size estimate is
#' consistent with the elastic-net selection (see `cssSelect` / `cssPredict`).
#' @return An integer; the estimated size of the model. The minimum returned
#' value is 1, even if the lasso with cross-validation chose no features.
#' @author Gregory Faletto, Jacob Bien
#' @references Reid, S., & Tibshirani, R. (2016). Sparse regression and marginal
#' testing using cluster prototypes. \emph{Biostatistics}, 17(2), 364–376.
#' \url{https://doi.org/10.1093/biostatistics/kxv049}.
#' @examples
#' set.seed(1)
#' data <- genClusteredData(n = 50, p = 11, k_unclustered = 2,
#'   cluster_size = 4, n_clusters = 1, snr = 3)
#' clusters <- list(cluster1 = 1:4)
#' getModelSize(X = data$X, y = data$y, clusters = clusters)
#' @export
getModelSize <- function(X, y, clusters, alpha = 1){

    # Validate alpha (a single number in (0, 1]). alpha is threaded into the
    # cv.glmnet fit below so the model-size estimate uses the same elastic-net
    # mixing as feature selection.
    checkAlpha(alpha)

    stopifnot(is.matrix(X) | is.data.frame(X))
    checkNoNAs(X, "X")

    # Check if x is a matrix; if it's a data.frame, convert to matrix.
    X <- coerceDataFrameToMatrix(X, clusters)

    stopifnot(is.matrix(X))
    stopifnot(is.numeric(X) | is.integer(X))
    n <- nrow(X)

    # Since the model size will be determined by cross-validation, the provided
    # y must be real-valued and finite (this should be checked internally in
    # other functions before getModelSize is called, but this check is here just
    # in case).
    checkFiniteY(y, "y")
    stopifnot(length(y) == n)

    # Check clusters argument
    clusters <- checkCssClustersInput(clusters)

    ### Format clusters into a list where all features are in exactly one
    # cluster (any unclustered features are put in their own "cluster" of size
    # 1).
    clust_names <- as.character(NA)
    if(!is.null(names(clusters)) & is.list(clusters)){
        clust_names <- names(clusters)
    }

    clusters <- formatClusters(clusters, p=ncol(X),
        clust_names=clust_names)$clusters

    X_size <- X

    if(length(clusters) > 0){
        # Create modified design matrix by dropping all but one feature from
        # each cluster
        feats_to_drop <- integer()
        for(i in 1:length(clusters)){
            cluster_i <- clusters[[i]]
            if(length(cluster_i) > 1){
                feat_to_keep <- identifyPrototype(cluster_i, X, y)
                feats_to_drop <- c(feats_to_drop, setdiff(cluster_i,
                    feat_to_keep))
            }
        }
        if(length(feats_to_drop) > 0){
            X_size <- X_size[, -1*feats_to_drop, drop=FALSE]
        }
    }

    # cv.glmnet requires at least two columns; if cluster reduction leaves a
    # single feature, the only possible model size is 1.
    if(ncol(X_size) < 2){
        return(1L)
    }

    size_results <- glmnet::cv.glmnet(x=X_size, y=y, family="gaussian",
        alpha=alpha)
    coefs <- as.numeric(glmnet::coef.glmnet(size_results, s="lambda.1se"))

    # Number of nonzero coefficients (subtract one in order to ignore intercept)
    size <- length(coefs[coefs != 0]) - 1

    # Check output
    stopifnot(is.numeric(size) | is.integer(size))
    stopifnot(!is.na(size))
    stopifnot(length(size) == 1)
    stopifnot(size == round(size))

    return(as.integer(max(size, 1)))
}

buildCssDf() builds the printable cluster-summary data.frame from a css_results object and an already-obtained named vector of selected clusters. It is factored out of printCssDf() (issue #129) so that a caller which has already computed the selected clusters – notably summary.cssr(), which calls getCssSelections() itself – can reuse that result instead of triggering a second identical getCssSelections() call. It is an internal (@noRd) helper, so the exported printCssDf() keeps its exact signature and man page:

#' Build the printable cluster-summary data.frame from selected clusters
#'
#' Constructs the data.frame returned by printCssDf() (and summary.cssr()) from
#' a css_results object and an already-obtained named vector of selected
#' clusters. Factored out of printCssDf() (#129) so callers that have already
#' computed the selected clusters (e.g. summary.cssr(), which calls
#' getCssSelections() itself) can reuse that result rather than triggering a
#' second identical getCssSelections() call. Because cluster selection is
#' invariant to the weighting rule, the sel_clusts computed by summary.cssr()
#' for any weighting matches the "sparse"-default selection printCssDf() would
#' recompute, so the returned data.frame is byte-identical either way.
#'
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param sel_clusts A named numeric vector of selected clusters with their
#' selection proportions, as returned in the selected_clusts element of
#' getCssSelections() / getSelectedClusters() (the value printCssDf() itself
#' obtains via getCssSelections()). May be empty (length 0), in which case a
#' well-formed zero-row data.frame is returned.
#' @return A data.frame; each row contains a cluster, arranged in decreasing
#' order of cluster selection proportion from top to bottom (see printCssDf()
#' for the full column description).
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
buildCssDf <- function(css_results, sel_clusts){
    # An empty selection is valid post-#107 (e.g. cutoff = 1 with
    # min_num_clusts = 0 when no cluster clears the cutoff). Return a well-formed
    # zero-row data.frame whose columns mirror the populated branch below for the
    # same object: 5 columns (with ClustProtoName) when X has column names, 4
    # otherwise -- names(prototypes) is non-NULL iff colnames(css_results$X) is,
    # since prototype names come from colnames(X). (#120)
    if(length(sel_clusts) == 0){
        if(!is.null(colnames(css_results$X))){
            return(data.frame(ClustName=character(0),
                ClustProtoName=character(0), ClustProtoNum=integer(0),
                ClustSelProp=numeric(0), ClustSize=integer(0)))
        }
        return(data.frame(ClustName=character(0), ClustProtoNum=integer(0),
            ClustSelProp=numeric(0), ClustSize=integer(0)))
    }

    # Get prototypes (feature from each cluster with highest selection
    # proportion, breaking ties by using marginal correlations of features with
    # y from data provided to css if y is real-valued)
    prototypes <- getSelectionPrototypes(css_results, sel_clusts)

    # Cluster selection proportions
    if(length(sel_clusts) > 1){
        sel_clust_sel_props <- colMeans(css_results$clus_sel_mat[,
            names(sel_clusts)])
    } else{
        sel_clust_sel_props <- mean(css_results$clus_sel_mat[,
            names(sel_clusts)])
    }

    # Data.frame: name of cluster, cluster prototype, selection proportion,
    # cluster size

    if(!is.null(names(prototypes))){
        print_df <- data.frame(ClustName=names(sel_clusts),
            ClustProtoName=names(prototypes), ClustProtoNum=unname(prototypes),
            ClustSelProp=sel_clust_sel_props, ClustSize=lengths(sel_clusts))
    } else{
        print_df <- data.frame(ClustName=names(sel_clusts),
            ClustProtoNum=unname(prototypes), ClustSelProp=sel_clust_sel_props,
            ClustSize=lengths(sel_clusts))
    }

    print_df <- print_df[order(print_df$ClustSelProp, decreasing=TRUE), ]

    rownames(print_df) <- NULL

    stopifnot(is.data.frame(print_df))
    stopifnot(nrow(print_df) >= 1)

    return(print_df)
}

printCssDf():

#' Prepares a data.frame summarazing cluster stability selection output to print
#'
#' Print a summary of the information from the css function.
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param cutoff Numeric; the outputted data.frame will display only those
#' clusters with selection proportions equal to at least cutoff. Must be between
#' 0 and 1. Default is 0 (in which case either all clusters are displayed, or
#' max_num_clusts are, if max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.) Default is 1.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @return A data.frame; each row contains a cluster, arranged in decreasing
#' order of cluster selection proportion from top to bottom. The columns are
#' ClustName (the name of the cluster that was either provided to css or made by
#' css if no name was provided); ClustProtoName (the name of the selection
#' prototype from the cluster, which is the feature with the greatest individual
#' selection proportion among all the cluster members, with ties broken by
#' choosing the feature with the highest correlation with the response if the
#' response is real-valued; only returned if the features are named),
#' ClustProtoNum (the column number of the prototype in the X matrix provided to
#' css), ClustSelProp (the cluster's selection proportion), and ClustSize (the
#' size of the cluster).
#' @author Gregory Faletto, Jacob Bien
#' @examples
#' set.seed(1)
#' data <- genClusteredData(n = 50, p = 11, k_unclustered = 2,
#'   cluster_size = 4, n_clusters = 1, snr = 3)
#' clusters <- list(cluster1 = 1:4)
#' res <- css(X = data$X, y = data$y, lambda = 0.01, clusters = clusters,
#'   B = 10)
#' printCssDf(res)
#' @export
printCssDf <- function(css_results, cutoff=0, min_num_clusts=1,
    max_num_clusts=NA){
    # Check inputs
    stopifnot(inherits(css_results, "cssr"))
    checkCutoff(cutoff)

    p <- ncol(css_results$feat_sel_mat)

    checkMinNumClusts(min_num_clusts, p, length(css_results$clusters))

    max_num_clusts <- checkMaxNumClusts(max_num_clusts, min_num_clusts, p,
        length(css_results$clusters))

    sel_clusts <- getCssSelections(css_results, cutoff=cutoff,
        min_num_clusts=min_num_clusts,
        max_num_clusts=max_num_clusts)$selected_clusts

    # printCssDf keeps its exact exported signature; the table-building work
    # (empty-selection handling, prototypes, selection proportions, ordering)
    # now lives in the internal buildCssDf() helper so summary.cssr() can reuse
    # an already-computed sel_clusts instead of making a second identical
    # getCssSelections() call. Byte-identical: buildCssDf() consumes sel_clusts
    # exactly as this body did. (#129)
    return(buildCssDf(css_results, sel_clusts))
}

getSelectionPrototypes()

#' Identify selection prototypes from selected clusters
#'
#' Takes in list of selected clusters and returns an integer vector of the
#' indices of the features that were most frequently selected from each cluster
#'
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param selected_clusts A list of integer vectors; each vector must contain
#' the indices of features in a cluster.
#' @return An integer vector (of length equal to the number of clusters) of the
#' indices of the feature prototypes (the features from each cluster that were
#' selected the most often individually by the base method in cluster stability
#' selection). In the case of a tie, the tie is broken by choosing the feature
#' most correlated with the response in the full data set provided to css.
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
getSelectionPrototypes <- function(css_results, selected_clusts){
    
    # Check inputs
    stopifnot(inherits(css_results, "cssr"))

    stopifnot(is.list(selected_clusts))
    n_selected_clusts <- length(selected_clusts)

    # An empty selection is valid post-#107: getCssSelections() can return zero
    # selected clusters (e.g. cutoff = 1 with min_num_clusts = 0 when no cluster
    # clears the cutoff). Return an empty integer vector before the non-empty
    # guards below; this also avoids the 1:0 foot-gun in the for loop. (#120)
    if(n_selected_clusts == 0){
        return(integer(0))
    }

    stopifnot(n_selected_clusts >= 1)
    stopifnot(all(lengths(selected_clusts) >= 1))

    prototypes <- rep(as.integer(NA), n_selected_clusts)
    feat_sel_props <- colMeans(css_results$feat_sel_mat)
    for(i in 1:n_selected_clusts){
        clust_i <- selected_clusts[[i]]
        sel_props_i <- feat_sel_props[clust_i]
        proto_i <- clust_i[sel_props_i == max(sel_props_i)]
        stopifnot(length(proto_i) >= 1)
        if(length(proto_i) > 1){
            if(is.numeric(css_results$y) | is.integer(css_results$y)){
                # Break tie by marginal correlation. suppressWarnings + NA->0
                # mirrors identifyPrototype (#59): a constant column has
                # undefined correlation (cor() returns NA with a
                # "standard deviation is zero" warning) -- treat it as 0 so
                # it never wins the tie-break and never poisons max() (#68).
                corrs_i <- suppressWarnings(abs(stats::cor(
                    css_results$X[, proto_i, drop = FALSE], css_results$y)[, 1]))
                corrs_i[is.na(corrs_i)] <- 0
                proto_i <- proto_i[corrs_i == max(corrs_i)]
            }
        }
        # If there is still a tie, break by choosing the first feature of those
        # remaining
        prototypes[i] <- proto_i[1]
        names(prototypes)[i] <- colnames(css_results$X)[proto_i]
    }

    # Check output

    stopifnot(is.integer(prototypes))
    stopifnot(all(!is.na(prototypes)))
    stopifnot(length(prototypes) == length(unique(prototypes)))

    return(prototypes)
}

print.cssr()

#' Print cluster stability selection output
#'
#' Print a summary of the information from the css function (using output from
#' printCssDf function).
#' @param x An object of class "cssr" (the output of the function css).
#' @param cutoff Numeric; print.cssr will display only those
#' clusters with selection proportions equal to at least cutoff. Must be between
#' 0 and 1. Default is 0 (in which case either all clusters are displayed, or
#' max_num_clusts are, if max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.) Default is 1.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @param ... Additional arguments to generic print.data.frame function
#' @return Invisibly, the unchanged `cssr` object `x` (following the standard
#' convention for `print` methods). Called for its side effect: printing a
#' summary table with one row per cluster, arranged in decreasing order of
#' cluster selection proportion from top to bottom. The printed columns are
#' ClustName (the name of the cluster that was either provided to css or made by
#' css if no name was provided); ClustProtoName (the name of the selection
#' prototype from the cluster, which is the feature with the greatest individual
#' selection proportion among all the cluster members, with ties broken by
#' choosing the feature with the highest correlation with the response if the
#' response is real-valued; only shown if the features are named); ClustProtoNum
#' (the column number of the prototype in the X matrix provided to css);
#' ClustSelProp (the cluster's selection proportion); and ClustSize (the size of
#' the cluster).
#' @author Gregory Faletto, Jacob Bien
#' @export
print.cssr <- function(x, cutoff=0, min_num_clusts=1, max_num_clusts=NA, ...){
    df <- printCssDf(css_results=x, cutoff=cutoff,
        min_num_clusts=min_num_clusts, max_num_clusts=max_num_clusts)
    if(nrow(df) == 0){
        # Empty selection (#120): no table to print.
        cat(sprintf("(no clusters selected at cutoff %s)\n", format(cutoff)))
    } else{
        print.data.frame(df, ...)
    }
    invisible(x)
}

5.1 Ergonomic S3 accessors: selected() and summary.cssr()

The functions above (getCssSelections(), printCssDf(), print.cssr()) already expose everything an end user needs, but they take the cssr object as css_results and return heterogeneous output. For a more idiomatic R experience – mirroring the selected() and summary.stabsel() interface of the stabs package – we add a small S3 layer on top of these internals. selected() is a generic accessor that returns the selected clusters (or features) from a fitted cssr object, and summary.cssr() returns a compact, computable overview object with its own print method. Both are thin wrappers over getCssSelections()/printCssDf(); they introduce no new methodology.

selected() generic:

#' Extract the selected clusters or features from cluster stability selection
#'
#' A convenient S3 accessor for the selected clusters (the default) or the
#' selected features of a fitted `cssr` object, without re-running the
#' (computationally expensive) subsampling. This is a thin wrapper around
#' [getCssSelections()].
#'
#' Note that, unlike the `selected()` accessor in the \pkg{stabs} package (which
#' returns the selected *variables*), the default here returns the selected
#' *clusters* -- the natural unit of cluster stability selection. Pass
#' `type = "features"` to obtain the flat integer vector of selected features
#' instead.
#'
#' @param object An object of class "cssr" (the output of the function [css()]).
#' @param ... Additional arguments passed to methods (currently unused).
#' @return For the `cssr` method: if `type = "clusters"` (the default), a named
#' list of integer vectors (each the indices of the features in one selected
#' cluster; an empty list if no cluster is selected); if `type = "features"`, a
#' named integer vector of the indices of the selected features (an empty integer
#' vector if none).
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @seealso [summary.cssr()] for an overview (counts plus a per-cluster table);
#' [getCssSelections()] for the underlying selection (clusters, features, and
#' weights together); [printCssDf()] and [print.cssr()] for the printed summary.
#' @examples
#' set.seed(1)
#' data <- genClusteredData(n = 50, p = 11, k_unclustered = 2,
#'   cluster_size = 4, n_clusters = 1, snr = 3)
#' clusters <- list(cluster1 = 1:4)
#' res <- css(X = data$X, y = data$y, lambda = 0.01, clusters = clusters,
#'   B = 10)
#' # Selected clusters (the default):
#' selected(res)
#' # Selected features (a flat integer vector):
#' selected(res, type = "features")
#' @export
selected <- function(object, ...){
    UseMethod("selected")
}

selected.cssr() method:

#' @param type Character; either "clusters" (the default) to return the named
#' list of selected clusters, or "features" to return the flat integer vector of
#' selected features. May be abbreviated.
#' @param cutoff Numeric; only those clusters with selection proportions equal to
#' at least cutoff will be selected. Must be between 0 and 1. Default is 0 (in
#' which case either all clusters are selected, or max_num_clusts are selected,
#' if max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be lowered until at least
#' min_num_clusts clusters are selected.) Default is 1. May be set to 0 to allow
#' a pure cutoff-based (threshold) selection that returns an empty result when no
#' cluster's selection proportion meets the cutoff.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be raised until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @param weighting Character; passed to [getCssSelections()] to determine the
#' weights of individual features within the selected clusters. This affects
#' ONLY `type = "features"` (it determines which cluster members have nonzero
#' weight and are therefore returned); it is a no-op for `type = "clusters"`,
#' whose selected clusters, selection proportions, and sizes do not depend on the
#' weighting. Must be one of "sparse", "weighted_avg", or "simple_avg". See
#' [getCssSelections()] for details. Default is "sparse".
#' @rdname selected
#' @export
selected.cssr <- function(object, type=c("clusters", "features"), cutoff=0,
    min_num_clusts=1, max_num_clusts=NA, weighting="sparse", ...){
    stopifnot(inherits(object, "cssr"))
    type <- match.arg(type)

    # Thin wrapper: getCssSelections validates all inputs (checkCutoff,
    # checkWeighting, checkMinNumClusts, checkMaxNumClusts) and cleanly returns
    # empty results on an empty selection (e.g. cutoff = 1, min_num_clusts = 0).
    sel <- getCssSelections(object, weighting=weighting, cutoff=cutoff,
        min_num_clusts=min_num_clusts, max_num_clusts=max_num_clusts)

    if(type == "clusters"){
        return(sel$selected_clusts)
    }
    return(sel$selected_feats)
}

summary.cssr():

#' Summarize cluster stability selection output
#'
#' Produce a concise, computable overview of a fitted `cssr` object: a header
#' with the number of selected clusters and features (at the given cutoff), a
#' per-cluster table built on top of [printCssDf()], and a placeholder for
#' per-family error rate (PFER) error-control content.
#'
#' The returned object has class "summary.cssr" and an accompanying print method
#' ([print.summary.cssr()]). Its `table` element is exactly the data.frame
#' returned by [printCssDf()]: one row per selected cluster, sorted by selection
#' proportion, with columns ClustName, ClustProtoName (only if the features are
#' named), ClustProtoNum, ClustSelProp (the cluster's selection proportion) and
#' ClustSize.
#'
#' If the selection is empty (for example `cutoff = 1` with `min_num_clusts = 0`
#' when no cluster reaches the cutoff), `summary.cssr` returns a well-formed
#' zero-row object without error: `table` is NULL and the header reports
#' "0 clusters / 0 features selected". In this case [printCssDf()] and its helper
#' are deliberately NOT called, as they require at least one selected cluster.
#'
#' The `pfer` element is a documented placeholder (NA) reserved for error-control
#' content to be filled in by a future release (issue #87).
#'
#' @param object An object of class "cssr" (the output of the function [css()]).
#' @param cutoff Numeric; only those clusters with selection proportions equal to
#' at least cutoff are summarized. Must be between 0 and 1. Default is 0 (in
#' which case either all clusters are summarized, or max_num_clusts are, if
#' max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. Default is 1. May be set to 0 to allow a pure
#' cutoff-based (threshold) selection that can be empty.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. Default is NA (in which case max_num_clusts is
#' ignored).
#' @param weighting Character; passed to [getCssSelections()] to determine the
#' selected features and hence the selected-feature count in the header. As in
#' [selected()], it does NOT affect which clusters are selected or the
#' per-cluster table. Must be one of "sparse", "weighted_avg", or "simple_avg".
#' Default is "sparse".
#' @param ... Additional arguments (currently unused).
#' @return An object of class "summary.cssr": a named list with elements
#' \item{n_selected_clusts}{Integer; the number of selected clusters.}
#' \item{n_selected_feats}{Integer; the number of selected features.}
#' \item{cutoff}{Numeric; the cutoff that was used.}
#' \item{table}{A data.frame with one row per selected cluster (the output of
#' [printCssDf()]), or NULL when the selection is empty.}
#' \item{pfer}{A placeholder (NA) for per-family error rate error-control
#' content, to be filled in by a future release (issue #87).}
#' @author Gregory Faletto, Jacob Bien
#' @seealso [selected()] to extract just the selected clusters or features;
#' [getCssSelections()] for the underlying selection; [printCssDf()] for the
#' per-cluster data.frame; [print.cssr()] for the analogous printed summary.
#' @examples
#' set.seed(1)
#' data <- genClusteredData(n = 50, p = 11, k_unclustered = 2,
#'   cluster_size = 4, n_clusters = 1, snr = 3)
#' clusters <- list(cluster1 = 1:4)
#' res <- css(X = data$X, y = data$y, lambda = 0.01, clusters = clusters,
#'   B = 10)
#' summary(res)
#' @export
summary.cssr <- function(object, cutoff=0, min_num_clusts=1, max_num_clusts=NA,
    weighting="sparse", ...){
    stopifnot(inherits(object, "cssr"))

    # Selected clusters + features. getCssSelections validates all inputs
    # (checkCutoff / checkWeighting / checkMinNumClusts / checkMaxNumClusts) and
    # is empty-safe (a clean empty result when e.g. cutoff = 1 and
    # min_num_clusts = 0).
    sel <- getCssSelections(object, weighting=weighting, cutoff=cutoff,
        min_num_clusts=min_num_clusts, max_num_clusts=max_num_clusts)

    n_selected_clusts <- length(sel$selected_clusts)
    n_selected_feats <- length(sel$selected_feats)

    # Empty-selection path: do NOT call printCssDf() / getSelectionPrototypes()
    # here. Both assert at least one selected cluster (printCssDf: nrow >= 1;
    # getSelectionPrototypes: n_selected_clusts >= 1) and so would die opaquely
    # on an empty selection (#107). Return a well-formed zero-row summary.
    if(n_selected_clusts >= 1){
        # Reuse the sel$selected_clusts already computed above instead of a
        # second identical getCssSelections() inside printCssDf(). Cluster
        # selection is weighting-invariant, so sel$selected_clusts (for this
        # weighting) matches the "sparse"-default selection printCssDf() would
        # recompute -- the resulting table is byte-identical. (#129)
        clust_table <- buildCssDf(object, sel$selected_clusts)
    } else{
        clust_table <- NULL
    }

    out <- list(n_selected_clusts=n_selected_clusts,
        n_selected_feats=n_selected_feats, cutoff=cutoff,
        table=clust_table, pfer=NA_real_)

    class(out) <- "summary.cssr"

    return(out)
}

print.summary.cssr() (mirrors the print.cssr() S3-method pattern – roxygen with @export, no @examples):

#' @param x An object of class "summary.cssr" (the output of [summary.cssr()]).
#' @rdname summary.cssr
#' @export
print.summary.cssr <- function(x, ...){
    cat("Cluster stability selection summary\n")
    cat(sprintf("%d cluster%s / %d feature%s selected at cutoff %s\n",
        x$n_selected_clusts, if(x$n_selected_clusts == 1) "" else "s",
        x$n_selected_feats, if(x$n_selected_feats == 1) "" else "s",
        format(x$cutoff)))
    if(is.null(x$table)){
        cat("(no clusters selected at this cutoff)\n")
    } else{
        cat("\n")
        print.data.frame(x$table, ...)
    }
    # PFER placeholder: error-control content to be filled in by issue #87.
    cat(sprintf("\nPFER: %s (not yet implemented; see issue #87)\n",
        format(x$pfer)))
    invisible(x)
}