4 Selection and prediction functions
Now that css() has been defined, we write functions to work with the output of css(). The output of these functions will be of more direct interest for most end users than the output of css(). These functions are defined separately from css() because the most computationally intensive steps happen within css(). css() can be called only once on a data set, and then the functions that follow can be explored relatively quickly (one can try different parameters, etc.).
-
getCssSelections()takes in the results ofcss()along with user-defined parameters on how to select clusters (a minimum or maximum number of clusters to select, along with a cutoff for cluster selection proportions) and selects clusters as well as features from those clusters. -
getCssDesign()takes in the same inputs asgetCssSelections()along with an unlabeled test matrix of features X (containing the same features as the X matrix provided originally tocss()). It uses the results fromcss()to select clusters likegetCssSelections(), then it uses a user-selected weighting scheme to compute weighted averages of the cluster members. It returns a test matrix of cluster representatives, which can be used for downstream predictive tasks. - Finally,
getCssPreds()has the same inputs asgetCssDesign(), except it also accepts a set of labeled training data (where the response must be real-valued).getCssPreds()selects clusters, forms matrices of cluster representatives on the training and test data, uses the training matrix of cluster representatives (along with the vector of responses for the training data) to estimate a linear model via ordinary least squares, and finally generates predictions on the test data using this linear model.
As in the previous section, we first define each function and then define the helper functions called by that function. As before, the tests for these functions are collected in the Tests chapters at the end of this document (the Tests part).
-
getCssSelections()- Calls
checkCutoff(), which checks that the specified cutoff input is as expected -
checkWeighting()verifies the weighting input togetCssSelections() -
checkMinNumClusts()confirms that themin_num_clustsparameter provided togetCssSelections()is as expected - Likewise for
checkMaxNumClusts() -
getSelectedClusters()is the workhorse function ofgetCssSelections(), doing most of the work to get the selected clusters (and the selected features from those clusters) with the verified inputs-
checkSelectedClusters()checks internally that the selected clusters are as expected -
getAllClustWeights()gets the weights for each of the members of the selected clusters-
getClustWeights()gets the weights for a single cluster
-
-
checkGetSelectedClustersOutput()verifies the output ofgetSelectedClusters()
-
- Calls
-
getCssDesign()-
checkNewXProvided()confirms that the design matrix provided togetCssDesign()matches the characteristics of the matrix that was provided tocss().-
checkXInputResults()also verifies these inputs (and is also used bygetCssPreds()on both the training and test data)
-
-
formCssDesign()is the workhorse function, generating a matrix of cluster representatives with the verified inputs togetCssDesign()-
checkFormCssDesignInputs()verifies the inputs toformCssDesign()(somewhat redundantly here, butformCssDesign()is called by more than one function, so this verifies the inputs toformCssDesign()regardless of where it is called).
-
-
-
getCssPreds()-
checkGetCssPredsInputs()verifies the inputs togetCssPreds()
-
#' Obtain a selected set of clusters and features
#'
#' Generate sets of selected clusters and features from cluster stability
#' selection.
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param weighting Character; determines how to calculate the weights for
#' individual features within the selected clusters. Only those features with
#' nonzero weight within the selected clusters will be returned. Must be one of
#' "sparse", "weighted_avg", or "simple_avg". For "sparse", all the weight is
#' put on the most frequently
#' selected individual cluster member (or divided equally among all the clusters
#' that are tied for the top selection proportion if there is a tie). For
#' "weighted_avg", only the features within a selected cluster that were
#' themselves selected on at least one subsample will have nonzero weight. For
#' "simple_avg", each cluster member gets equal weight regardless of the
#' individual feature selection proportions (that is, all cluster members within
#' each selected cluster will be returned.). See Faletto and Bien (2022) for
#' details. Default is "sparse".
#' @param cutoff Numeric; getCssSelections will select and return only of those
#' clusters with selection proportions equal to at least cutoff. Must be between
#' 0 and 1. Default is 0 (in which case either all clusters are selected, or
#' max_num_clusts are selected, if max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.) Default is 1. May be set to 0 to allow
#' a pure cutoff-based (threshold) selection that returns an empty set when no
#' cluster's selection proportion meets the cutoff.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @return A named list with three items. \item{selected_clusts}{A named list of
#' integer vectors; each vector contains the indices of the features in one of
#' the selected clusters.} \item{selected_feats}{A named integer vector; the
#' indices of the features with nonzero weights from all of the selected
#' clusters.} \item{weights}{A named list of the same length as selected_clusts.
#' Each list element `weights[[j]]` is a numeric vector of the weights to use for
#' the jth selected cluster, and it has the same name as the cluster it
#' corresponds to.}
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @examples
#' set.seed(1)
#' data <- genClusteredData(n = 50, p = 11, k_unclustered = 2,
#' cluster_size = 4, n_clusters = 1, snr = 3)
#' clusters <- list(cluster1 = 1:4)
#' res <- css(X = data$X, y = data$y, lambda = 0.01, clusters = clusters,
#' B = 10)
#' getCssSelections(res)
#' @export
getCssSelections <- function(css_results, weighting="sparse", cutoff=0,
min_num_clusts=1, max_num_clusts=NA){
# Check inputs
stopifnot(inherits(css_results, "cssr"))
checkCutoff(cutoff)
checkWeighting(weighting)
p <- ncol(css_results$feat_sel_mat)
checkMinNumClusts(min_num_clusts, p, length(css_results$clusters))
max_num_clusts <- checkMaxNumClusts(max_num_clusts, min_num_clusts, p,
length(css_results$clusters))
sel_results <- getSelectedClusters(css_results, weighting, cutoff,
min_num_clusts, max_num_clusts)
# sel_results$selected_clusts is guaranteed to have length at least 1 by
# getSelectedClusters
sel_clust_names <- names(sel_results$selected_clusts)
stopifnot(length(sel_clust_names) >= 0)
stopifnot(all(sel_clust_names %in% names(css_results$clusters)))
sel_clusts <- list()
for(i in seq_along(sel_clust_names)){
sel_clusts[[i]] <- css_results$clusters[[sel_clust_names[i]]]
names(sel_clusts)[i] <- sel_clust_names[i]
}
stopifnot(is.list(sel_clusts))
stopifnot(length(sel_clusts) == length(sel_clust_names))
# sel_results$selected_feats is guaranteed to have length at least as long
# as sel_results$selected_clusts by getSelectedClusters
return(list(selected_clusts=sel_clusts,
selected_feats=sel_results$selected_feats, weights=sel_results$weights))
}checkCutoff():
#' Helper function to confirm that the argument cutoff to several functions is
#' as expected
#'
#' @param cutoff Numeric; only those clusters with selection proportions equal
#' to at least cutoff will be selected by cluster stability selection. Must be
#' between 0 and 1.
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkCutoff <- function(cutoff){
stopifnot(is.numeric(cutoff) | is.integer(cutoff))
stopifnot(length(cutoff) == 1)
stopifnot(!is.na(cutoff))
stopifnot(cutoff >= 0)
stopifnot(cutoff <= 1)
}#' Helper function to confirm that the argument weighting to several
#' functions is as expected
#'
#' @param weighting Character; determines how to calculate the weights to
#' combine features from the selected clusters into weighted averages, called
#' cluster representatives. Must be one of "sparse", "weighted_avg", or
#' "simple_avg".
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkWeighting <- function(weighting){
stopifnot(length(weighting)==1)
stopifnot(!is.na(weighting))
if(!is.character(weighting)){
stop("Weighting must be a character")
}
if(!(weighting %in% c("sparse", "simple_avg", "weighted_avg"))){
stop("Weighting must be a character and one of sparse, simple_avg, or weighted_avg")
}
}#' Helper function to confirm that the argument min_num_clusts to several
#' functions is as expected
#'
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.)
#' @param p The number of features; since this is an upper bound on the number
#' of clusters of features, it is also an upper bound on min_num_clusts.
#' @param n_clusters The number of clusters; note that this is an upper bound
#' on min_num_clusts
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkMinNumClusts <- function(min_num_clusts, p, n_clusters){
stopifnot(length(min_num_clusts) == 1)
stopifnot(is.numeric(min_num_clusts) | is.integer(min_num_clusts))
stopifnot(!is.na(min_num_clusts))
stopifnot(min_num_clusts == round(min_num_clusts))
stopifnot(min_num_clusts >= 0)
stopifnot(min_num_clusts <= p)
stopifnot(min_num_clusts <= n_clusters)
}#' Helper function to confirm that the argument max_num_clusts to several
#' functions is as expected
#'
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) Can be NA, in which case
#' max_num_clusts will be ignored.
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.) max_num_clusts must be at least as
#' large as min_num_clusts.
#' @param p The number of features; since this is an upper bound on the number
#' of clusters of features, it is also an upper bound on max_num_clusts.
#' @param n_clusters The number of clusters; note that this is an upper bound
#' on max_num_clusts
#' @return The provided max_num_clusts, coerced to an integer if needed, and
#' coerced to be less than or equal to the total number of clusters.
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkMaxNumClusts <- function(max_num_clusts, min_num_clusts, p, n_clusters){
stopifnot(length(max_num_clusts) == 1)
if(!is.na(max_num_clusts)){
stopifnot(is.numeric(max_num_clusts) | is.integer(max_num_clusts))
stopifnot(max_num_clusts == round(max_num_clusts))
stopifnot(max_num_clusts >= 1)
stopifnot(max_num_clusts <= p)
max_num_clusts <- as.integer(min(n_clusters, max_num_clusts))
stopifnot(max_num_clusts >= min_num_clusts)
}
return(max_num_clusts)
}#' From css output, obtain names of selected clusters and selection proportions,
#' indices of all selected features, and weights of individual cluster members
#'
#' If cutoff is too high for at least min_num_clusts clusters to be selected,
#' then it will be lowered until min_num_clusts can be selected. After that, if
#' the cutoff is too low such that more than max_num_clusts are selected, then
#' the cutoff will be increased until no more than max_num_clusts are selected.
#' Note that because clusters can have tied selection proportions, it is
#' possible that the number of selected clusters will be strictly lower than
#' max_num_clusts or strictly greater than min_num_clusts. In fact, it is
#' possible that both cutoffs won't be able to be satisfied simulteaneously,
#' even if there is a strictly positive difference between max_num_clusts and
#' min_num_clusts. If this occurs, max_num_clusts will take precedence over
#' min_num_clusts. getSelectedClusters will throw an error if the provided
#' inputs don't allow it to select any clusters.
#'
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param weighting Character; determines how to calculate the weights for
#' individual features within the selected clusters. Only those features with
#' nonzero weight within the selected clusters will be returned. Must be one of
#' "sparse", "weighted_avg", or "simple_avg". For "sparse", all the weight is
#' put on the most frequently
#' selected individual cluster member (or divided equally among all the clusters
#' that are tied for the top selection proportion if there is a tie). For
#' "weighted_avg", only the features within a selected cluster that were
#' themselves selected on at least one subsample will have nonzero weight. For
#' "simple_avg", each cluster member gets equal weight regardless of the
#' individual feature selection proportions (that is, all cluster members within
#' each selected cluster will be returned.). See Faletto and Bien (2022) for
#' details.
#' @param cutoff Numeric; getCssSelections will select and return only of those
#' clusters with selection proportions equal to at least cutoff. Must be between
#' 0 and 1.
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.)
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) If NA, max_num_clusts is ignored.
#' @return A named list with the following elements: \item{selected_clusts}{A
#' named numeric vector containing the selection proportions for the selected
#' clusters. The name of each entry is the name of the corresponding cluster.}
#' \item{selected_feats}{A named integer vector; the indices of the features
#' with nonzero weights from all of the selected clusters.} \item{weights}{A
#' named list of the same length as the number of selected clusters. Each list
#' element `weights[[j]]` is a numeric vector of the weights to use for the jth
#' selected cluster, and it has the same name as the cluster it corresponds
#' to.}
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
getSelectedClusters <- function(css_results, weighting, cutoff, min_num_clusts,
max_num_clusts){
# Check input
stopifnot(inherits(css_results, "cssr"))
# Eliminate clusters with selection proportions below cutoff
clus_sel_props <- colMeans(css_results$clus_sel_mat)
B <- nrow(css_results$feat_sel_mat)
# Selection proportions are multiples of 1/B, but `cutoff` is adjusted by
# repeated +/- 1/B in the loops below, which accumulates floating-point
# error. Compare against `cutoff` minus half a step so a cluster sitting
# exactly at the threshold is included robustly (snaps to the nearest
# count); otherwise a tie could be dropped and, e.g., the max_num_clusts
# loop could break early and return more clusters than max_num_clusts.
tol <- 1 / (2 * B)
# Get selected clusters
selected_clusts <- clus_sel_props[clus_sel_props >= cutoff - tol]
# Check that selected_clusts has length at least min_num_clusts
while(length(selected_clusts) < min_num_clusts){
# Guard against an unbounded decrement if min_num_clusts exceeds the
# number of clusters: once every cluster is selected we can't select
# more, so stop (mirrors the max-loop's break) (#69). Unreachable via the
# public API (checkMinNumClusts enforces min_num_clusts <= n_clusters),
# but a defensive break for any internal caller.
if(length(selected_clusts) == length(clus_sel_props)){
break
}
cutoff <- cutoff - 1/B
selected_clusts <- clus_sel_props[clus_sel_props >= cutoff - tol]
}
# Check that selected_clusts has length at most max_num_clusts
if(!is.na(max_num_clusts)){
n_clusters <- ncol(css_results$clus_sel_mat)
while(length(selected_clusts) > max_num_clusts){
cutoff <- cutoff + 1/B
# Apply the same tol to the upper bound: cutoff accumulates +1/B and
# B*(1/B) can float just above 1, which would break before the
# cutoff == 1 filter runs and leave a cluster below proportion 1 in
# the selection (#42).
if(cutoff > 1 + tol){
break
}
# Make sure we don't reduce to a selected set of size 0
if(any(clus_sel_props >= cutoff - tol)){
selected_clusts <- clus_sel_props[clus_sel_props >= cutoff - tol]
} else{
break
}
}
}
stopifnot(length(selected_clusts) >= 0)
clust_names <- names(selected_clusts)
n_sel_clusts <- length(selected_clusts)
# Check that n_sel_clusts is as expected, and throw warnings or an error if
# not
checkSelectedClusters(n_sel_clusts, min_num_clusts, max_num_clusts,
max(clus_sel_props))
### Get selected features from selected clusters
clusters <- css_results$clusters
stopifnot(all(clust_names %in% names(clusters)))
# Get a list of weights for all of the selected clusters
weights <- getAllClustWeights(css_results, selected_clusts, weighting)
# Get selected features from each cluster (those features with nonzero
# weights)
selected_feats <- integer()
for(i in seq_len(n_sel_clusts)){
clus_i_name <- clust_names[i]
clust_i <- clusters[[clus_i_name]]
weights_i <- weights[[i]]
selected_feats <- c(selected_feats, clust_i[weights_i != 0])
}
feat_names <- colnames(css_results$feat_sel_mat)
names(selected_feats) <- feat_names[selected_feats]
# Check output (already checked weights wihin getAllClustWeights)
checkGetSelectedClustersOutput(selected_clusts, selected_feats,
weights, n_clusters=length(clusters), p=ncol(css_results$feat_sel_mat))
return(list(selected_clusts=selected_clusts,
selected_feats=selected_feats, weights=weights))
}#' Helper function to check operations within getSelectedClusters function
#'
#' @param n_sel_clusts The number of selected clusters; should be constrained
#' by min_num_clusts and max_num_clusts (though it may not be possible to
#' satisfy both constraints simulteneously, in which case a warning will be
#' thrown).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.)
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) If NA, max_num_clusts is ignored.
#' @param max_sel_prop Numeric; the maximum selection proportion observed for
#' any cluster.
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkSelectedClusters <- function(n_sel_clusts, min_num_clusts, max_num_clusts,
max_sel_prop){
if(n_sel_clusts == 0 && min_num_clusts >= 1){
err <- paste("No clusters selected with this cutoff (try a cutoff below the maximum cluster selection proportion, ",
max_sel_prop, ")", sep="")
stop(err)
}
stopifnot(n_sel_clusts >= 0)
# It may be impossible to get at least min_num_clusts or at most
# max_num_clusts; if so, give a warning
if(n_sel_clusts < min_num_clusts){
warn <- paste("Returning fewer than min_num_clusts = ", min_num_clusts,
" clusters because decreasing the cutoff any further would require returning more than max_num_clusts = ",
max_num_clusts, " clusters", sep="")
warning(warn)
}
if(!is.na(max_num_clusts)){
if(n_sel_clusts > max_num_clusts){
warn <- paste("Returning more than max_num_clusts = ",
max_num_clusts,
" clusters because increasing the cutoff any further would require returning 0 clusters",
sep="")
warning(warn)
}
}
}#' Calculate weights for each cluster member of all of the selected clusters.
#'
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param sel_clusters A named numeric vector containing the selection
#' proportions for the selected clusters. The name of each entry is the name
#' of the corresponding cluster.
#' @param weighting Character; determines how to calculate the weights for
#' individual features within the selected clusters. Only those features with
#' nonzero weight within the selected clusters will be returned. Must be one of
#' "sparse", "weighted_avg", or "simple_avg". For "sparse", all the weight is
#' put on the most frequently
#' selected individual cluster member (or divided equally among all the clusters
#' that are tied for the top selection proportion if there is a tie). For
#' "weighted_avg", only the features within a selected cluster that were
#' themselves selected on at least one subsample will have nonzero weight. For
#' "simple_avg", each cluster member gets equal weight regardless of the
#' individual feature selection proportions (that is, all cluster members within
#' each selected cluster will be returned.). See Faletto and Bien (2022) for
#' details.
#' @return A named list of the same length as sel_clusters of numeric vectors.
#' `weights[[j]]` is the weights to use for the jth selected cluster, and it has
#' the same name as the cluster it corresponds to.
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
getAllClustWeights <- function(css_results, sel_clusters, weighting){
# Check inputs
stopifnot(inherits(css_results, "cssr"))
stopifnot(is.numeric(sel_clusters))
p_ret <- length(sel_clusters)
stopifnot(length(unique(names(sel_clusters))) == p_ret)
stopifnot(p_ret >= 0)
checkWeighting(weighting)
# Get selection proportions and clusters
feat_sel_props <- colMeans(css_results$feat_sel_mat)
p <- length(feat_sel_props)
stopifnot(p >= p_ret)
clusters <- css_results$clusters
stopifnot(all(names(sel_clusters) %in% names(clusters)))
# Identify weights
weights <- list()
for(j in seq_len(p_ret)){
# Find the members of the cluster feature j is a member of
cluster_j <- clusters[[names(sel_clusters)[j]]]
# Get the weights for this cluster and add them to the list
weights[[j]] <- getClustWeights(cluster_j, weighting, feat_sel_props)
}
# Add names to weights
names(weights) <- names(sel_clusters)
# Check output
stopifnot(length(weights) == p_ret)
stopifnot(is.list(weights))
for(i in seq_len(p_ret)){
stopifnot(length(clusters[[names(sel_clusters)[i]]]) ==
length(weights[[i]]))
stopifnot(all(weights[[i]] >= 0))
stopifnot(all(weights[[i]] <= 1))
stopifnot(abs(sum(weights[[i]]) - 1) < 10^(-6))
}
return(weights)
}#' Calculate weights for members of a cluster using selection proportions
#'
#' Given a cluster of features, the selection proportions for each cluster
#' member, and a specified weighting scheme, calculate the appropriate weights
#' for the cluster.
#' @param cluster_i An integer vector containing the indices of the members
#' of a cluster.
#' @param weighting Character; determines how to calculate the weights for
#' individual features within the selected clusters. Only those features with
#' nonzero weight within the selected clusters will be returned. Must be one of
#' "sparse", "weighted_avg", or "simple_avg". For "sparse", all the weight is
#' put on the most frequently
#' selected individual cluster member (or divided equally among all the clusters
#' that are tied for the top selection proportion if there is a tie). For
#' "weighted_avg", only the features within a selected cluster that were
#' themselves selected on at least one subsample will have nonzero weight. For
#' "simple_avg", each cluster member gets equal weight regardless of the
#' individual feature selection proportions (that is, all cluster members within
#' each selected cluster will be returned.). See Faletto and Bien (2022) for
#' details.
#' @param feat_sel_props A numeric vector of selection proportions corresponding
#' to each of the p features.
#' @return A numeric vector of the same length as cluster_i containing the
#' weights corresponding to each of the features in cluster_i. The weights
#' will all be nonnegative and sum to 1.
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
getClustWeights <- function(cluster_i, weighting, feat_sel_props){
stopifnot(is.integer(cluster_i) | is.numeric(cluster_i))
stopifnot(all(cluster_i == round(cluster_i)))
n_weights <- length(cluster_i)
stopifnot(length(unique(cluster_i)) == n_weights)
p <- length(feat_sel_props)
stopifnot(all(cluster_i %in% 1:p))
# Get the selection proportions of each cluster member
sel_props <- feat_sel_props[cluster_i]
stopifnot(all(sel_props >= 0))
stopifnot(all(sel_props <= 1))
weights_i <- rep(as.numeric(NA), n_weights)
# Weighted or simple average?
if(weighting == "sparse"){
# Sparse cluster stability selection: All features in cluster with
# selection proportion equal to the max
# for the cluster get equal weight; rest of cluster gets 0 weight
if(sum(sel_props) == 0){
weights_i <- rep(1/n_weights, n_weights)
} else{
maxes <- sel_props==max(sel_props)
stopifnot(sum(maxes) > 0)
stopifnot(sum(maxes) <= n_weights)
weights_i <- rep(0, n_weights)
weights_i[maxes] <- 1/sum(maxes)
}
} else if(weighting == "weighted_avg"){
# Get weights for weighted average
if(sum(sel_props) == 0){
weights_i <- rep(1/n_weights, n_weights)
} else{
weights_i <- sel_props/sum(sel_props)
}
} else if(weighting == "simple_avg"){
weights_i <- rep(1/n_weights, n_weights)
} else{
stop("weighting must be one of sparse, simple_avg, or weighted_avg")
}
stopifnot(abs(sum(weights_i) - 1) < 10^(-6))
stopifnot(length(weights_i) == n_weights)
stopifnot(length(weights_i) >= 1)
stopifnot(all(weights_i >= 0))
stopifnot(all(weights_i <= 1))
return(weights_i)
}checkGetSelectedClustersOutput():
#' Helper function to check that output of getSelectedClusters is as expected
#'
#' @param selected_clusts A named numeric vector containing the selection
#' proportions for the selected clusters. The name of each entry is the name of
#' the corresponding cluster.
#' @param selected_feats A named integer vector; the indices of the features
#' with nonzero weights from all of the selected clusters.
#' @param weights A named list of the same length as the number of selected
#' clusters. Each list element `weights[[j]]` is a numeric vector of the weights
#' to use for the jth selected cluster, and it has the same name as the cluster
#' it corresponds to.
#' @param n_clusters Integer; the number of clusters in the data (upper bound
#' for the length of selected_clusts)
#' @param p Integer; number of features in the data (all selected_feats should
#' be in 1:p)
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkGetSelectedClustersOutput <- function(selected_clusts, selected_feats,
weights, n_clusters, p){
stopifnot(is.numeric(selected_clusts))
stopifnot(all(selected_clusts >= 0))
stopifnot(all(selected_clusts <= 1))
stopifnot(length(selected_clusts) >= 0)
stopifnot(length(selected_clusts) <= n_clusters)
stopifnot(length(names(selected_clusts)) ==
length(unique(names(selected_clusts))))
stopifnot(!is.null(names(selected_clusts)))
stopifnot(all(!is.na(names(selected_clusts)) &
names(selected_clusts) != ""))
stopifnot(length(names(selected_clusts)) == length(selected_clusts))
stopifnot(is.integer(selected_feats))
stopifnot(length(selected_feats) == length(unique(selected_feats)))
stopifnot(all(selected_feats %in% 1:p))
stopifnot(length(selected_clusts) <= length(selected_feats))
stopifnot(identical(names(weights), names(selected_clusts)))
stopifnot(length(weights) == length(selected_clusts))
}#' Obtain a design matrix of cluster representatives
#'
#' Takes a matrix of observations from the original feature space and returns
#' a matrix of representatives from the selected clusters based on the results
#' of cluster stability selection.
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param newX A numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the data that will be used to generate the design matrix of cluster
#' representatives. Must contain the same features (in the same
#' number of columns) as the X matrix provided to css, and if the columns of
#' newX are labeled, the names must match the variable names provided to css.
#' Must not contain missing (`NA`) values. newX may be omitted if train_inds
#' were provided to css to set aside observations for model estimation. If this
#' is the case, then when newX is omitted getCssDesign will return a design
#' matrix of cluster representatives formed from the train_inds observations
#' from the matrix X provided to css. (If no train_inds were provided to css,
#' newX must be provided to getCssDesign.) Default is NA.
#' @param weighting Character; determines how to calculate the weights to
#' combine features from the selected clusters into weighted averages, called
#' cluster representatives. Must be one of "sparse", "weighted_avg", or
#' "simple_avg". For "sparse", all the weight is put on the most frequently
#' selected individual cluster member (or divided equally among all the clusters
#' that are tied for the top selection proportion if there is a tie). For
#' "weighted_avg", the weight used for each cluster member is calculated in
#' proportion to the individual selection proportions of each feature. For
#' "simple_avg", each cluster member gets equal weight regardless of the
#' individual feature selection proportions (that is, the cluster representative
#' is just a simple average of all the cluster members). See Faletto and Bien
#' (2022) for details. Default is "weighted_avg".
#' @param cutoff Numeric; getCssDesign will only include those clusters with
#' selection proportions equal to at least cutoff. Must be between 0 and 1.
#' Default is 0 (in which case either all clusters are used, or max_num_clusts
#' are used, if max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.) Default is 1. May be 0, but an empty
#' selection then raises an error (no design/predictions possible).
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @return A design matrix with either nrow(newX) (or length(train_inds), if
#' train_inds was provided to css and newX was not provided to getCssDesign)
#' observations and number of columns equal to the number of selected clusters,
#' containing the cluster representatives for each cluster.
#' @author Gregory Faletto, Jacob Bien
#' @examples
#' set.seed(1)
#' train <- genClusteredData(n = 50, p = 11, k_unclustered = 2,
#' cluster_size = 4, n_clusters = 1, snr = 3)
#' newdata <- genClusteredData(n = 10, p = 11, k_unclustered = 2,
#' cluster_size = 4, n_clusters = 1, snr = 3)
#' clusters <- list(cluster1 = 1:4)
#' res <- css(X = train$X, y = train$y, lambda = 0.01, clusters = clusters,
#' B = 10)
#' X_design <- getCssDesign(res, newX = newdata$X)
#' dim(X_design)
#' @export
getCssDesign <- function(css_results, newX=NA, weighting="weighted_avg",
cutoff=0, min_num_clusts=1, max_num_clusts=NA){
# Check inputs
stopifnot(inherits(css_results, "cssr"))
check_results <- checkNewXProvided(newX, css_results)
newX <- check_results$newX
newXProvided <- check_results$newXProvided
rm(check_results)
n_train <- nrow(newX)
results <- checkXInputResults(newX, css_results$X)
newX <- results$newx
feat_names <- results$feat_names
rm(results)
n <- nrow(newX)
p <- ncol(newX)
checkCutoff(cutoff)
checkWeighting(weighting)
checkMinNumClusts(min_num_clusts, p, length(css_results$clusters))
max_num_clusts <- checkMaxNumClusts(max_num_clusts, min_num_clusts, p,
length(css_results$clusters))
# Take provided training design matrix and testX and turn them into
# matrices of cluster representatives using information from css_results
if(newXProvided){
newX_clusters <- formCssDesign(css_results, weighting, cutoff,
min_num_clusts, max_num_clusts, newx=newX)
} else{
newX_clusters <- formCssDesign(css_results, weighting, cutoff,
min_num_clusts, max_num_clusts)
}
return(newX_clusters)
}#' Helper function to confirm that the new X matrix provided to getCssDesign or
#' getCssPreds matches the characteristics of the X that was provided to css.
#'
#' @param trainX A numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix). Must contain
#' the same features (in the same number of columns) as the X matrix provided to
#' css, and if the columns of trainX are labeled, the names must match the
#' variable names provided to css. trainX may be omitted if train_inds were
#' provided to css to set aside observations.
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @return A named list with the following elements: \item{newX}{If trainX was
#' provided, this is the provided trainX matrix, coerced from a data.frame to a
#' matrix if the provided trainX was a data.frame. If trainX was not provided,
#' this is a matrix made up of the training indices provided to css in the
#' train_inds argument.} \item{newXProvided}{Logical; indicates whether a valid
#' trainX input was provided.}
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkNewXProvided <- function(trainX, css_results){
newXProvided <- FALSE
# "Was newX provided?" -- the documented default for trainX is the scalar NA
# sentinel, so treat trainX as provided UNLESS it is exactly that default
# (an atomic, length-1 NA). A previous length()-based heuristic
# (length(trainX) > 1) misclassified a numeric vector or a one-column
# data.frame (length 1) as "not provided" and silently fell back to
# train_inds. The is.atomic/length/is.na test instead routes any matrix or
# data.frame to the provided branch -- including an NA-CONTAINING matrix
# (length > 1, so the is.na operand is short-circuited and never collapses a
# multi-element matrix to a length-1 logical), which is then rejected by
# checkXInputResults' checkNoNAs rather than silently replaced by the
# train_inds data (#71).
if(!(is.atomic(trainX) && length(trainX) == 1 && is.na(trainX))){
# trainX was provided -- it must be a matrix or data.frame (a bare
# vector is not a valid design and used to fail with a cryptic
# is.matrix() stopifnot downstream).
if(!(is.matrix(trainX) || is.data.frame(trainX))){
stop("newX must be a matrix or data.frame with the same columns as the X provided to css().", call. = FALSE)
}
newXProvided <- TRUE
trainX <- checkXInputResults(trainX, css_results$X)$newx
n_train <- nrow(trainX)
# A single-row newX is fine -- the design is just one row of cluster
# representatives, and getCssPreds/cssPredict already accept a 1-row test
# set -- so don't require more than one row here (#44).
stopifnot(n_train >= 1)
} else{
if(length(css_results$train_inds) == 0){
stop("css was not provided with indices to set aside for model training (train_inds), so must provide new X in order to generate a design matrix")
}
trainX <- css_results$X[css_results$train_inds, , drop = FALSE]
}
stopifnot(is.matrix(trainX))
stopifnot(is.numeric(trainX) | is.integer(trainX))
stopifnot(all(!is.na(trainX)))
stopifnot(ncol(trainX) >= 2)
return(list(newX=trainX, newXProvided=newXProvided))
}#' Helper function to confirm that inputs to several functions are as expected,
#' and modify inputs if needed
#'
#' @param newx A numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the data that will be used to generate the design matrix of cluster
#' representatives. Must contain the same features (in the same
#' number of columns) as the X matrix provided to css, and if the columns of
#' newX are labeled, the names must match the variable names provided to css.
#' @param css_X The X matrix provided to css, as in the output of the css
#' function (after having been coerced from a data.frame to a matrix by css if
#' needed).
#' @return A named list with the following elements. \item{feat_names}{A
#' character vector containing the column names of newx (if the provided newx
#' had column names). If the provided newx did not have column names, feat_names
#' will be NA.} \item{newx}{The provided newx matrix, coerced from a data.frame
#' to a matrix if the provided newx was a data.frame.}
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkXInputResults <- function(newx, css_X){
checkNoNAs(newx, "newx")
# Check if x is a matrix; if it's a data.frame, convert to matrix.
newx <- coerceDataFrameToMatrix(newx, clusters = list(), arg_name = "newx")
feat_names <- as.character(NA)
if(!is.null(colnames(newx))){
feat_names <- colnames(newx)
stopifnot(identical(feat_names, colnames(css_X)))
} else{
# In this case, newx has no column names, so same better be true of
# css_X
if(!is.null(colnames(css_X))){
warning("New X provided had no variable names (column names) even though the X provided to css did.")
}
}
stopifnot(is.matrix(newx))
n <- nrow(newx)
p <- ncol(newx)
stopifnot(p >= 2)
if(length(feat_names) > 1){
stopifnot(length(feat_names) == p)
stopifnot(!("(Intercept)" %in% feat_names))
} else{
stopifnot(is.na(feat_names))
}
colnames(newx) <- character()
# Confirm that newx matches css_results$X
if(p != ncol(css_X)){
err <- paste("Number of columns in newx must match number of columns from matrix provided to css. Number of columns in new provided X: ",
p, ". Number of columns in matrix provided to css: ", ncol(css_X),
".", sep="")
stop(err)
}
if(length(feat_names) != 1 & all(!is.na(feat_names))){
if(!identical(feat_names, colnames(css_X))){
stop("Provided feature names for newx do not match feature names provided to css")
}
}
return(list(feat_names=feat_names, newx=newx))
}#' Create design matrix of cluster representatives from matrix of raw features
#' using results of css function
#'
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param weighting Character; determines how to calculate the weights to
#' combine features from the selected clusters into weighted averages, called
#' cluster representatives. Must be one of "sparse", "weighted_avg", or
#' "simple_avg". For "sparse", all the weight is put on the most frequently
#' selected individual cluster member (or divided equally among all the clusters
#' that are tied for the top selection proportion if there is a tie). For
#' "weighted_avg", the weight used for each cluster member is calculated in
#' proportion to the individual selection proportions of each feature. For
#' "simple_avg", each cluster member gets equal weight regardless of the
#' individual feature selection proportions (that is, the cluster representative
#' is just a simple average of all the cluster members). See Faletto and Bien
#' (2022) for details. Default is "weighted_avg".
#' @param cutoff Numeric; css will return only those clusters with selection
#' proportions equal to at least cutoff. Must be between 0 and 1. Default is 0
#' (in which case all clusters are returned in decreasing order of selection
#' proportion).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.) Default is 1.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @param newx A numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the data that will be used to generate the design matrix of cluster
#' representatives. Must contain the same features (in the same
#' number of columns) as the X matrix provided to css, and if the columns of
#' newx are labeled, the names must match the variable names provided to css.
#' newx may be omitted if train_inds were provided to css to set aside
#' observations for model estimation. If this is the case, then when newx is
#' omitted formCssDesign will return a design matrix of cluster representatives
#' formed from the train_inds observations from the matrix X provided to css.
#' (If no train_inds were provided to css, newX must be provided to
#' formCssDesign.) Default is NA.
#' @return A design matrix with the same number of rows as newx (or the
#' train_inds provided to css) where the columns are the constructed cluster
#' representatives.
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @keywords internal
#' @noRd
formCssDesign <- function(css_results, weighting="weighted_avg", cutoff=0,
min_num_clusts=1, max_num_clusts=NA, newx=NA, weights=NULL){
# Check inputs
ret <- checkFormCssDesignInputs(css_results, weighting, cutoff,
min_num_clusts, max_num_clusts, newx)
newx <- ret$newx
max_num_clusts <- ret$max_num_clusts
rm(ret)
n <- nrow(newx)
p <- ncol(newx)
# Get the names of the selected clusters and the weights for the features
# within each cluster, according to the provided weighting rule. These depend
# only on css_results and the (weighting, cutoff, min/max_num_clusts) args,
# NOT on newx, so a caller that forms several designs from one css_results
# (e.g. getCssPreds, which builds a train design and a test design) can
# compute them once and pass them in via `weights` to skip the repeated
# getSelectedClusters() work. Byte-identical: a passed-in `weights` equals
# what this call would otherwise recompute (checkMaxNumClusts is idempotent,
# so the max_num_clusts re-derived by checkFormCssDesignInputs above matches
# the caller's already-adjusted value). (#129)
if(is.null(weights)){
weights <- getSelectedClusters(css_results, weighting, cutoff,
min_num_clusts, max_num_clusts)$weights
}
n_sel_clusts <- length(weights)
if(n_sel_clusts == 0){
stop("No clusters were selected, so a design matrix / predictions cannot be formed; lower the cutoff or set min_num_clusts >= 1.")
}
# Form matrix of cluster representatives of selected clusters
X_clus_reps <- matrix(rep(as.numeric(NA), n*n_sel_clusts), nrow=n,
ncol=n_sel_clusts)
colnames(X_clus_reps) <- rep(as.character(NA), n_sel_clusts)
for(i in seq_len(n_sel_clusts)){
clust_i_name <- names(weights)[i]
stopifnot(length(clust_i_name) == 1)
stopifnot(clust_i_name %in% names(weights))
colnames(X_clus_reps)[i] <- clust_i_name
clust_i <- css_results$clusters[[clust_i_name]]
stopifnot(length(clust_i) >= 1)
stopifnot(all(clust_i %in% 1:p))
weights_i <- weights[[clust_i_name]]
stopifnot(length(clust_i) == length(weights_i))
if(length(weights_i) > 1){
X_clus_reps[, i] <- newx[, clust_i] %*% weights_i
} else{
X_clus_reps[, i] <- newx[, clust_i]*weights_i
}
}
# Check output
stopifnot(all(!is.na(X_clus_reps)))
stopifnot(ncol(X_clus_reps) == n_sel_clusts)
stopifnot(nrow(X_clus_reps) == n)
return(X_clus_reps)
}#' Helper function to check that the inputs to formCssDesign are as expected
#'
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param weighting Character; determines how to calculate the weights to
#' combine features from the selected clusters into weighted averages, called
#' cluster representatives. Must be one of "sparse", "weighted_avg", or
#' "simple_avg".
#' @param cutoff Numeric; css will return only those clusters with selection
#' proportions equal to at least cutoff. Must be between 0 and 1.
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.)
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.)
#' @param newx A numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the data that will be used to generate the design matrix of cluster
#' representatives. Must contain the same features (in the same
#' number of columns) as the X matrix provided to css, and if the columns of
#' newx are labeled, the names must match the variable names provided to css.
#' newx may be omitted if train_inds were provided to css to set aside
#' observations for model estimation. If this is the case, then when newx is
#' omitted formCssDesign will return a design matrix of cluster representatives
#' formed from the train_inds observations from the matrix X provided to css.
#' (If no train_inds were provided to css, newX must be provided to
#' formCssDesign.)
#' @return A named list with the following elements: \item{newx}{If newx was
#' provided, the provided newx matrix, coerced from a data.frame to a matrix if
#' needed. If newx was not provided, a matrix formed by the train_inds set
#' aside in the original function call to css.} \item{max_num_clusts}{The
#' provided max_num_clusts, coerced to an integer if needed, and coerced to be
#' less than or equal to the total number of clusters.}
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkFormCssDesignInputs <- function(css_results, weighting, cutoff,
min_num_clusts, max_num_clusts, newx){
stopifnot(inherits(css_results, "cssr"))
if(length(newx) == 1){
if(is.na(newx)){
if(length(css_results$train_inds) == 0){
stop("If css was not provided with indices to set aside for model training, then newx must be provided to formCssDesign")
}
newx <- css_results$X[css_results$train_inds, , drop = FALSE]
# feat_names <- colnames(newx)
} else{
results <- checkXInputResults(newx, css_results$X)
newx <- results$newx
# feat_names <- results$feat_names
rm(results)
}
} else{
results <- checkXInputResults(newx, css_results$X)
newx <- results$newx
# feat_names <- results$feat_names
rm(results)
}
p <- ncol(newx)
checkCutoff(cutoff)
checkWeighting(weighting)
checkMinNumClusts(min_num_clusts, p, length(css_results$clusters))
max_num_clusts <- checkMaxNumClusts(max_num_clusts, min_num_clusts, p,
length(css_results$clusters))
return(list(newx=newx, max_num_clusts=max_num_clusts))
}#' Fit model and generate predictions from new data
#'
#' Generate predictions on test data using cluster stability-selected model.
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param testX A numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the data that will be used to generate predictions. Must contain the same
#' features (in the same number of columns) as the matrix provided to css, and
#' if the columns of testX are labeled, the names must match the variable names
#' provided to css. Must not contain missing (`NA`) values.
#' @param weighting Character; determines how to calculate the weights to
#' combine features from the selected clusters into weighted averages, called
#' cluster representatives. Must be one of "sparse", "weighted_avg", or
#' "simple_avg". For "sparse", all the weight is put on the most frequently
#' selected individual cluster member (or divided equally among all the clusters
#' that are tied for the top selection proportion if there is a tie). For
#' "weighted_avg", the weight used for each cluster member is calculated in
#' proportion to the individual selection proportions of each feature. For
#' "simple_avg", each cluster member gets equal weight regardless of the
#' individual feature selection proportions (that is, the cluster representative
#' is just a simple average of all the cluster members). See Faletto and Bien
#' (2022) for details. Default is "weighted_avg".
#' @param cutoff Numeric; getCssPreds will make use only of those clusters with
#' selection proportions equal to at least cutoff. Must be between 0 and 1.
#' Default is 0 (in which case either all clusters are used, or max_num_clusts
#' are used, if max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.) Default is 1. May be 0, but an empty
#' selection then raises an error (no design/predictions possible).
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @param trainX A numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the data that will be used to estimate the linear model from the selected
#' clusters. trainX is only necessary to provide if no train_inds were
#' designated in the css function call to set aside observations for model
#' estimation (though even if train_inds was provided, trainX and trainY will be
#' used for model estimation if they are both provided to getCssPreds). Must
#' contain the same features (in the same number of columns) as the matrix
#' provided to css, and if the columns of trainX are labeled, the names must
#' match the variable names provided to css. Must not contain missing (`NA`)
#' values. Default is NA (in which case getCssPreds uses the observations from
#' the train_inds that were provided to css to estimate a linear model).
#' @param trainY The response corresponding to trainX. Must be a real-valued
#' response (unlike in the general css setup) because predictions will be
#' generated by an ordinary least squares model. Must have the same length as
#' the number of rows of trainX. Like trainX, only needs to be provided if no
#' observations were set aside for model estimation by the parameter train_inds
#' in the css function call. Default is NA (in which case getCssPreds uses the
#' observations from the train_inds that were provided to css).
#' @return A vector of predictions corresponding to the observations from testX.
#' @author Gregory Faletto, Jacob Bien
#' @references
<<faletto2022>>
#' @examples
#' set.seed(1)
#' data <- genClusteredData(n = 60, p = 11, k_unclustered = 2,
#' cluster_size = 4, n_clusters = 1, snr = 3)
#' clusters <- list(cluster1 = 1:4)
#' # Split into selection / training / test rows.
#' res <- css(X = data$X[1:20, ], y = data$y[1:20], lambda = 0.01,
#' clusters = clusters, B = 10)
#' preds <- getCssPreds(res, testX = data$X[51:60, ],
#' trainX = data$X[21:50, ], trainY = data$y[21:50])
#' preds
#' @export
getCssPreds <- function(css_results, testX, weighting="weighted_avg", cutoff=0,
min_num_clusts=1, max_num_clusts=NA, trainX=NA, trainY=NA){
# TODO(gregfaletto) Consider adding an argument for a user-provided prediction
# function in order to allow for more general kinds of predictions than
# OLS.
# Check inputs
check_list <- checkGetCssPredsInputs(css_results, testX, weighting, cutoff,
min_num_clusts, max_num_clusts, trainX, trainY)
trainXProvided <- check_list$trainXProvided
trainX <- check_list$trainX
testX <- check_list$testX
feat_names <- check_list$feat_names
max_num_clusts <- check_list$max_num_clusts
rm(check_list)
n_train <- nrow(trainX)
n <- nrow(testX)
p <- ncol(testX)
# The cluster weights depend only on css_results and the (weighting, cutoff,
# min/max_num_clusts) arguments -- not on the design matrix -- so compute
# them ONCE here and reuse for every formCssDesign() call below (train +
# test) instead of letting each call recompute the same getSelectedClusters()
# (colMeans + cutoff loops + getAllClustWeights). max_num_clusts is already
# checkMaxNumClusts-adjusted by checkGetCssPredsInputs, and checkMaxNumClusts
# is idempotent, so these weights are byte-identical to what each
# formCssDesign() would otherwise compute internally. (#129)
sel_weights <- getSelectedClusters(css_results, weighting, cutoff,
min_num_clusts, max_num_clusts)$weights
# Take provided training design matrix and testX and turn them into
# matrices of cluster representatives using information from css_results
if(trainXProvided){
train_X_clusters <- formCssDesign(css_results, weighting, cutoff,
min_num_clusts, max_num_clusts, newx=trainX, weights=sel_weights)
if(!is.numeric(trainY) & !is.integer(trainY)){
stop("The provided trainY must be real-valued, because predictions will be generated by ordinary least squares regression.")
}
y_train <- trainY
} else{
train_X_clusters <- formCssDesign(css_results, weighting, cutoff,
min_num_clusts, max_num_clusts, weights=sel_weights)
y_train <- css_results$y[css_results$train_inds]
if(!is.numeric(y_train) & !is.integer(y_train)){
stop("Can't generate predictions from the data that was provided to css because the provided y was not real-valued (getCssPreds generated predictions using ordinary least squares regression).")
}
}
stopifnot(length(y_train) == nrow(train_X_clusters))
testX_clusters <- formCssDesign(css_results, weighting, cutoff,
min_num_clusts, max_num_clusts, newx=testX, weights=sel_weights)
stopifnot(ncol(testX_clusters) == ncol(train_X_clusters))
# Get names for clusters
clust_X_names <- paste("c_fit_", 1:ncol(testX_clusters), sep="")
if(!is.null(colnames(train_X_clusters))){
stopifnot(identical(colnames(train_X_clusters), colnames(testX_clusters)))
clust_X_names <- colnames(train_X_clusters)
}
# Fit linear model on training data via OLS. lm(y ~ .) adds an intercept, so
# the fit is rank-deficient unless there are strictly more training
# observations than clusters (nrow >= ncol + 1); guard with <= so the
# n_train == n_clusters boundary errors clearly instead of silently fitting a
# rank-deficient model with NA coefficients.
if(nrow(train_X_clusters) <= ncol(train_X_clusters)){
err_mess <- paste("css not provided with enough indices to fit OLS model for predictions (number of training indices: ",
nrow(train_X_clusters), ", number of clusters: ",
ncol(train_X_clusters),
"). The OLS model includes an intercept, so it needs at least one more training observation than the number of clusters. Try reducing number of clusters by increasing cutoff, or re-run css with a larger number of training indices.",
sep="")
stop(err_mess)
}
df <- data.frame(y=y_train, train_X_clusters)
colnames(df)[2:ncol(df)] <- clust_X_names
model <- stats::lm(y ~., data=df)
# Use fitted model to generate predictions on testX. Route through
# olsPredictRankSafe() so a rank-deficient OLS fit predicts from its
# estimable coefficients instead of crashing under R >= 4.4's predict.lm
# default (see the helper's note and issue #117).
df_test <- data.frame(testX_clusters)
colnames(df_test) <- clust_X_names
predictions <- olsPredictRankSafe(model, df_test)
names(predictions) <- NULL
# Check output
stopifnot(is.numeric(predictions) | is.integer(predictions))
stopifnot(length(predictions) == n)
stopifnot(all(is.finite(predictions)))
return(predictions)
}Internal OLS-prediction helper used by getCssPreds() (issue #117):
#' Predict from a possibly rank-deficient OLS fit without crashing on R >= 4.4
#'
#' On R >= 4.4 `stats::predict.lm()`'s default `rankdeficient = "warnif"` runs
#' `qr.Q(qr.default(t(qr.R(model$qr))))`, which aborts with "NA/NaN/Inf in
#' foreign function call" when `qr.R(model$qr)` contains NaN/Inf (a rank-deficient
#' fit plus a LINPACK numerical pathology seen on real high-dimensional data).
#' `rankdeficient = "simple"` restores the pre-4.4 behavior: predict from the
#' estimable coefficients, yielding finite predictions. That argument only exists
#' on R >= 4.4, so it is added conditionally (a plain retry with the argument
#' would itself error on older R with "unused argument"). The `"simple"` path
#' emits the stock "rank-deficient fit" warning; muffle only that specific
#' warning (its advice to use `rankdeficient = "NA"` would NA-fill non-estimable
#' cases and break the finiteness contract in `getCssPreds()`) so unrelated
#' warnings still surface. On full-rank fits this is bit-identical to the old
#' default.
#' @noRd
olsPredictRankSafe <- function(model, newdata) {
pred_args <- list(object = model, newdata = newdata)
if (getRversion() >= "4.4.0") {
pred_args$rankdeficient <- "simple"
}
withCallingHandlers(
do.call(stats::predict.lm, pred_args),
warning = function(w) {
if (grepl("rank-deficient fit", conditionMessage(w), fixed = TRUE)) {
invokeRestart("muffleWarning")
}
}
)
}#' Helper function to confirm that inputs to the function getCssPreds are as
#' expected, and modify inputs if needed.
#'
#' @param css_results An object of class "cssr" (the output of the function
#' css).
#' @param testX A numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the data that will be used to generate predictions. Must contain the same
#' features (in the same number of columns) as the matrix provided to css.
#' @param weighting Character; determines how to calculate the weights to
#' combine features from the selected clusters into weighted averages, called
#' cluster representatives. Must be one of "sparse", "weighted_avg", or
#' "simple_avg". For "sparse", all the weight is put on the most frequently
#' selected individual cluster member (or divided equally among all the clusters
#' that are tied for the top selection proportion if there is a tie). For
#' "weighted_avg", the weight used for each cluster member is calculated in
#' proportion to the individual selection proportions of each feature. For
#' "simple_avg", each cluster member gets equal weight regardless of the
#' individual feature selection proportions (that is, the cluster representative
#' is just a simple average of all the cluster members). See Faletto and Bien
#' (2022) for details. Default is "weighted_avg".
#' @param cutoff Numeric; getCssPreds will make use only of those clusters with
#' selection proportions equal to at least cutoff. Must be between 0 and 1.
#' Default is 0 (in which case either all clusters are used, or max_num_clusts
#' are used, if max_num_clusts is specified).
#' @param min_num_clusts Integer or numeric; the minimum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns fewer than
#' min_num_clusts clusters, the cutoff will be increased until at least
#' min_num_clusts clusters are selected.) Default is 1.
#' @param max_num_clusts Integer or numeric; the maximum number of clusters to
#' use regardless of cutoff. (That is, if the chosen cutoff returns more than
#' max_num_clusts clusters, the cutoff will be decreased until at most
#' max_num_clusts clusters are selected.) Default is NA (in which case
#' max_num_clusts is ignored).
#' @param trainX A numeric matrix (preferably) or a data.frame (which will
#' be coerced internally to a matrix by the function model.matrix) containing
#' the data that will be used to estimate the linear model from the selected
#' clusters. trainX is only necessary to provide if no train_inds were
#' designated in the css function call to set aside observations for model
#' estimation (though even if train_inds was provided, trainX and trainY will be
#' used for model estimation if they are both provided to getCssPreds). Must
#' contain the same features (in the same number of columns) as the matrix
#' provided to css, and if the columns of trainX are labeled, the names must
#' match the variable names provided to css. Must not contain missing (`NA`)
#' values. Default is NA (in which case getCssPreds uses the observations from
#' the train_inds that were provided to css to estimate a linear model).
#' @param trainY The response corresponding to trainX. Must be a real-valued
#' response (unlike in the general css setup) because predictions will be
#' generated by an ordinary least squares model. Must have the same length as
#' the number of rows of trainX. Like trainX, only needs to be provided if no
#' observations were set aside for model estimation by the parameter train_inds
#' in the css function call. Default is NA (in which case getCssPreds uses the
#' observations from the train_inds that were provided to css).
#' @return A named list with the following elements: \item{trainXProvided}{
#' Logical; indicates whether a valid trainX input was provided.} \item{trainX}{
#' The provided trainX matrix, coerced from a data.frame to a matrix if the
#' provided trainX was a data.frame. (If a valid trainX was not provided, this
#' output simply passes whatever was provided as trainX.)} \item{testX}{The
#' provided testX matrix, coerced from a data.frame to a matrix if the provided
#' testX was a data.frame.} \item{feat_names}{A character vector containing the
#' column names of testX (if the provided testX had column names). If the
#' provided testX did not have column names, feat_names will be NA.}
#' \item{max_num_clusts}{The provided max_num_clusts, coerced to an integer if
#' needed, and coerced to be less than or equal to the total number of clusters
#' from the output of css_results.}
#' @author Gregory Faletto, Jacob Bien
#' @keywords internal
#' @noRd
checkGetCssPredsInputs <- function(css_results, testX, weighting, cutoff,
min_num_clusts, max_num_clusts, trainX, trainY){
# Check inputs
stopifnot(inherits(css_results, "cssr"))
check_results <- checkNewXProvided(trainX, css_results)
trainX <- check_results$newX
trainXProvided <- check_results$newXProvided
rm(check_results)
n_train <- nrow(trainX)
if(trainXProvided){
if(all(!is.na(trainY)) & length(trainY) > 1){
stopifnot(is.numeric(trainY))
stopifnot(n_train == length(trainY))
} else{
if(length(css_results$train_inds) == 0){
stop("css was not provided with indices to set aside for model training (train_inds), so must provide both trainX and trainY in order to generate predictions")
}
trainXProvided <- FALSE
warning("trainX provided but no trainY provided; instead, training model using the train_inds observations provided to css to set aside for model training.")
}
} else{
if(length(css_results$train_inds) == 0){
stop("css was not provided with indices to set aside for model training (train_inds), so must provide both trainX and trainY in order to generate predictions")
}
if(all(!is.na(trainY)) & length(trainY) > 1){
warning("trainY provided but no trainX provided; instead, training model using the train_inds observations provided to css to set aside for model training.")
}
}
results <- checkXInputResults(testX, css_results$X)
testX <- results$newx
feat_names <- results$feat_names
if(all(!is.na(feat_names))){
stopifnot(length(feat_names) == ncol(testX))
stopifnot(!("(Intercept)" %in% feat_names))
colnames(testX) <- feat_names
}
rm(results)
n <- nrow(testX)
p <- ncol(testX)
stopifnot(n >= 1)
stopifnot(p == ncol(trainX))
if(!is.null(colnames(trainX)) & is.null(colnames(testX))){
warning("Column names were provided for trainX but not for testX (are you sure they both contain identical features in the same order?)")
}
if(is.null(colnames(trainX)) & !is.null(colnames(testX))){
warning("Column names were provided for testX but not for trainX (are you sure they both contain identical features in the same order?)")
}
if(!is.null(colnames(trainX)) & !is.null(colnames(testX))){
stopifnot(all(colnames(trainX) == colnames(testX)))
}
checkCutoff(cutoff)
checkWeighting(weighting)
checkMinNumClusts(min_num_clusts, p, length(css_results$clusters))
max_num_clusts <- checkMaxNumClusts(max_num_clusts, min_num_clusts, p,
length(css_results$clusters))
return(list(trainXProvided=trainXProvided, trainX=trainX, testX=testX,
feat_names=feat_names, max_num_clusts=max_num_clusts))
}