Cluster Stability Selection
css.RdExecutes cluster stability selection algorithm. Takes subsamples of data, executes feature selection algorithm on each subsample, and returns matrices of feature selection indicators as well as cluster selection indicators.
Arguments
- X
An n x p numeric matrix (preferably) or a data.frame (which will be coerced internally to a matrix by the function model.matrix) containing p >= 2 features/predictors.
- y
The response; can be anything that takes the form of an n-dimensional vector, with the ith entry corresponding to the ith row of X. Typically (and for default
fitfun = cssLasso),yshould be an n-dimensional numeric vector.- lambda
A tuning parameter or set of tuning parameters that may be used by the feature selection method
fitfun. In the default case whenfitfun = cssLasso, lambda should be a numeric: the penalty to use for each lasso fit. (css()does not require lambda to be any particular object because for a user-specified feature selection methodfitfun, lambda can be an arbitrary object. See the description offitfunbelow.)- clusters
A list of integer vectors; each vector should contain the indices of a cluster of features (a subset of
1:p). (If there is only one cluster, clusters can either be a list of length 1 or an integer vector.) All of the provided clusters must be non-overlapping. Every feature not appearing in any cluster will be assumed to be unclustered (that is, they will be treated as if they are in a "cluster" containing only themselves). If clusters is a list of length 0 (or a list only containing clusters of length 1), thencss()returns the same results as stability selection (so the returnedfeat_sel_matwill be identical toclus_sel_mat). Names for the clusters will be needed later; any clusters that are not given names in the provided list will be given names automatically bycss(). #' CAUTION: if the provided X is a data.frame that contains a categorical feature with more than two levels, then the resulting matrix made from model.matrix will have a different number of columns than the provided data.frame, some of the feature numbers will change, and the clusters argument will not work properly (in the current version of the package). To get correct results in this case, please use model.matrix to convert the data.frame to a numeric matrix on your own, then provide this matrix and cluster assignments with respect to this matrix. Default islist()(so no clusters are specified).- fitfun
A function; the feature selection function used on each subsample by cluster stability selection. This can be any feature selection method; the only requirement is that it accepts the arguments (and only the arguments)
X,y, andlambdaand returns an integer vector that is a subset of1:p. For example,fitfuncould be best subset selection or forward stepwise selection or LARS andlambdacould be the desired model size; orfitfuncould be the elastic net andlambdacould be a length-two vector specifying lambda and alpha. Default iscssLasso, an implementation of lasso (relying on the R packageglmnet), wherelambdamust be a positive numeric specifying the L1 penalty for thelasso.- sampling_type
A character vector; either "SS" or "MB". For "MB", all B subsamples are drawn randomly (as proposed by Meinshausen and Bühlmann 2010). For "SS", in addition to these B subsamples, the B complementary pair subsamples will be drawn as well (see Faletto and Bien 2022 or Shah and Samworth 2013 for details). Default is "SS", and "MB" is not supported yet.
- B
Integer or numeric; the number of subsamples. Note: For
sampling_type=="MB"the total number of subsamples will beB; forsampling_type="SS"the number of subsamples will be2*B. Default is 100 forsampling_type="MB"and 50 forsampling_type="SS".- prop_feats_remove
Numeric; if
prop_feats_removeis greater than 0, then on each subsample, each feature is randomly dropped from the design matrix that is provided tofitfunwith probabilityprop_feats_remove(independently across features). That is, in a typical subsample,prop_feats_remove*pfeatures will be dropped (though this number will vary). This is similar in spirit (but distinct from) extended stability selection (Beinrucker et. al. 2016); see their paper for some of the benefits of dropping features (besides increasing computational speed and decreasing memory requirements). Forsampling_type="SS", the features dropped in each complementary pair of subsamples are identical in order to ensure that the theoretical guarantees of Faletto and Bien (2022) are retained within each individual pair of subsamples. (Note that this feature is not investigated either theoretically or in simulations by Faletto and Bien 2022). Must be between 0 and 1. Default is 0.- train_inds
Optional; an integer or numeric vector containing the indices of observations in
Xandyto set aside for model training by the functiongetCssPreds()after feature selection. (This will only work ifyis real-valued, becausegetCssPreds()using ordinary least squares regression to generate predictions.) Iftrain_indsis not provided, all of the observations in the provided data set will be used for feature selection.- num_cores
Optional; an integer. If using parallel processing, the number of cores to use for parallel processing (
num_coreswill be supplied internally as themc.coresargument ofparallel::mclapply()).
Value
A list containing the following items:
feat_sel_matA
B(or2*Bforsampling_type="SS") xpnumeric (binary) matrix.feat_sel_mat[i, j] = 1if featurejwas selected by the base feature selection method on subsamplei, and 0 otherwise.clus_sel_matA
B(or2*Bfor SS sampling) xlength(clusters)numeric (binary) matrix.clus_sel_mat[i, j] = 1if at least one feature from cluster j was selected by the base feature selection method on subsamplei, and 0 otherwise.XThe
Xmatrix provided tocss(), coerced from a data.frame to a matrix if needed.yThe
yvector provided tocss().clustersA named list of integer vectors containing all of the clusters provided to
css(), as well as size 1 clusters of any features not listed in any of the clusters provided tocss(). All clusters will have names; any clusters not provided with a name in the input tocss()will be given names automatically bycss()(of the form c1, etc.).train_indsIdentical to the
train_indsprovided tocss().
References
Faletto, G., & Bien, J. (2022). Cluster Stability Selection. arXiv preprint arXiv:2201.00494. https://arxiv.org/abs/2201.00494.
Shah, R. D., & Samworth, R. J. (2013). Variable selection with error control: Another look at stability selection. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 75(1), 55–80. https://doi.org/10.1109/RITA.2014.2302071.
Meinshausen, N., & Bühlmann, P. (2010). Stability Selection. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 72(4), 417–473. https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1467-9868.2010.00740.x.
Beinrucker, A., Dogan, Ü., & Blanchard, G. (2016). Extensions of stability selection using subsamples of observations and covariates. Statistics and Computing, 26(5), 1059- 1077. https://doi.org/10.1007/s11222-015-9589-y.
Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. URL https://www.jstatsoft.org/v33/i01/.