Fit model and generate predictions from new data

Generate predictions on test data using cluster stability-selected model.

Usage

getCssPreds(
  css_results,
  testX,
  weighting = "weighted_avg",
  cutoff = 0,
  min_num_clusts = 1,
  max_num_clusts = NA,
  trainX = NA,
  trainY = NA
)

Arguments

css_results: An object of class "cssr" (the output of the function css).
testX: A numeric matrix (preferably) or a data.frame (which will be coerced internally to a matrix by the function model.matrix) containing the data that will be used to generate predictions. Must contain the same features (in the same number of columns) as the matrix provided to css, and if the columns of testX are labeled, the names must match the variable names provided to css.
weighting: Character; determines how to calculate the weights to combine features from the selected clusters into weighted averages, called cluster representatives. Must be one of "sparse", "weighted_avg", or "simple_avg'. For "sparse", all the weight is put on the most frequently selected individual cluster member (or divided equally among all the clusters that are tied for the top selection proportion if there is a tie). For "weighted_avg", the weight used for each cluster member is calculated in proportion to the individual selection proportions of each feature. For "simple_avg", each cluster member gets equal weight regardless of the individual feature selection proportions (that is, the cluster representative is just a simple average of all the cluster members). See Faletto and Bien (2022) for details. Default is "weighted_avg".
cutoff: Numeric; getCssPreds will make use only of those clusters with selection proportions equal to at least cutoff. Must be between 0 and 1. Default is 0 (in which case either all clusters are used, or max_num_clusts are used, if max_num_clusts is specified).
min_num_clusts: Integer or numeric; the minimum number of clusters to use regardless of cutoff. (That is, if the chosen cutoff returns fewer than min_num_clusts clusters, the cutoff will be increased until at least min_num_clusts clusters are selected.) Default is 1.
max_num_clusts: Integer or numeric; the maximum number of clusters to use regardless of cutoff. (That is, if the chosen cutoff returns more than max_num_clusts clusters, the cutoff will be decreased until at most max_num_clusts clusters are selected.) Default is NA (in which case max_num_clusts is ignored).
trainX: A numeric matrix (preferably) or a data.frame (which will be coerced internally to a matrix by the function model.matrix) containing the data that will be used to estimate the linear model from the selected clusters. trainX is only necessary to provide if no train_inds were designated in the css function call to set aside observations for model estimation (though even if train_inds was provided, trainX and trianY will be used for model estimation if they are both provided to getCssPreds). Must contain the same features (in the same number of columns) as the matrix provided to css, and if the columns of trainX are labeled, the names must match the variable names provided to css. Default is NA (in which case getCssPreds uses the observations from the train_inds that were provided to css to estimate a linear model).
trainY: The response corresponding to trainX. Must be a real-valued response (unlike in the general css setup) because predictions will be generated by an ordinary least squares model. Must have the same length as the number of rows of trainX. Like trainX, only needs to be provided if no observations were set aside for model estimation by the parameter train_inds in the css function call. Default is NA (in which case getCssPreds uses the observations from the train_inds that were provided to css).

Value

A vector of predictions corresponding to the observations from testX.

References

Faletto, G., & Bien, J. (2022). Cluster Stability Selection. arXiv preprint arXiv:2201.00494. https://arxiv.org/abs/2201.00494.

Author

Gregory Faletto, Jacob Bien