Cluster Stability Selection
css.Rd
Executes cluster stability selection algorithm. Takes subsamples of data, executes feature selection algorithm on each subsample, and returns matrices of feature selection indicators as well as cluster selection indicators.
Arguments
- X
An n x p numeric matrix (preferably) or a data.frame (which will be coerced internally to a matrix by the function model.matrix) containing p >= 2 features/predictors.
- y
The response; can be anything that takes the form of an n-dimensional vector, with the ith entry corresponding to the ith row of X. Typically (and for default
fitfun = cssLasso
),y
should be an n-dimensional numeric vector.- lambda
A tuning parameter or set of tuning parameters that may be used by the feature selection method
fitfun
. In the default case whenfitfun = cssLasso
, lambda should be a numeric: the penalty to use for each lasso fit. (css()
does not require lambda to be any particular object because for a user-specified feature selection methodfitfun
, lambda can be an arbitrary object. See the description offitfun
below.)- clusters
A list of integer vectors; each vector should contain the indices of a cluster of features (a subset of
1:p
). (If there is only one cluster, clusters can either be a list of length 1 or an integer vector.) All of the provided clusters must be non-overlapping. Every feature not appearing in any cluster will be assumed to be unclustered (that is, they will be treated as if they are in a "cluster" containing only themselves). If clusters is a list of length 0 (or a list only containing clusters of length 1), thencss()
returns the same results as stability selection (so the returnedfeat_sel_mat
will be identical toclus_sel_mat
). Names for the clusters will be needed later; any clusters that are not given names in the provided list will be given names automatically bycss()
. #' CAUTION: if the provided X is a data.frame that contains a categorical feature with more than two levels, then the resulting matrix made from model.matrix will have a different number of columns than the provided data.frame, some of the feature numbers will change, and the clusters argument will not work properly (in the current version of the package). To get correct results in this case, please use model.matrix to convert the data.frame to a numeric matrix on your own, then provide this matrix and cluster assignments with respect to this matrix. Default islist()
(so no clusters are specified).- fitfun
A function; the feature selection function used on each subsample by cluster stability selection. This can be any feature selection method; the only requirement is that it accepts the arguments (and only the arguments)
X
,y
, andlambda
and returns an integer vector that is a subset of1:p
. For example,fitfun
could be best subset selection or forward stepwise selection or LARS andlambda
could be the desired model size; orfitfun
could be the elastic net andlambda
could be a length-two vector specifying lambda and alpha. Default iscssLasso
, an implementation of lasso (relying on the R packageglmnet
), wherelambda
must be a positive numeric specifying the L1 penalty for thelasso
.- sampling_type
A character vector; either "SS" or "MB". For "MB", all B subsamples are drawn randomly (as proposed by Meinshausen and Bühlmann 2010). For "SS", in addition to these B subsamples, the B complementary pair subsamples will be drawn as well (see Faletto and Bien 2022 or Shah and Samworth 2013 for details). Default is "SS", and "MB" is not supported yet.
- B
Integer or numeric; the number of subsamples. Note: For
sampling_type=="MB"
the total number of subsamples will beB
; forsampling_type="SS"
the number of subsamples will be2*B
. Default is 100 forsampling_type="MB"
and 50 forsampling_type="SS"
.- prop_feats_remove
Numeric; if
prop_feats_remove
is greater than 0, then on each subsample, each feature is randomly dropped from the design matrix that is provided tofitfun
with probabilityprop_feats_remove
(independently across features). That is, in a typical subsample,prop_feats_remove*p
features will be dropped (though this number will vary). This is similar in spirit (but distinct from) extended stability selection (Beinrucker et. al. 2016); see their paper for some of the benefits of dropping features (besides increasing computational speed and decreasing memory requirements). Forsampling_type="SS"
, the features dropped in each complementary pair of subsamples are identical in order to ensure that the theoretical guarantees of Faletto and Bien (2022) are retained within each individual pair of subsamples. (Note that this feature is not investigated either theoretically or in simulations by Faletto and Bien 2022). Must be between 0 and 1. Default is 0.- train_inds
Optional; an integer or numeric vector containing the indices of observations in
X
andy
to set aside for model training by the functiongetCssPreds()
after feature selection. (This will only work ify
is real-valued, becausegetCssPreds()
using ordinary least squares regression to generate predictions.) Iftrain_inds
is not provided, all of the observations in the provided data set will be used for feature selection.- num_cores
Optional; an integer. If using parallel processing, the number of cores to use for parallel processing (
num_cores
will be supplied internally as themc.cores
argument ofparallel::mclapply()
).
Value
A list containing the following items:
feat_sel_mat
A
B
(or2*B
forsampling_type="SS"
) xp
numeric (binary) matrix.feat_sel_mat[i, j] = 1
if featurej
was selected by the base feature selection method on subsamplei
, and 0 otherwise.clus_sel_mat
A
B
(or2*B
for SS sampling) xlength(clusters)
numeric (binary) matrix.clus_sel_mat[i, j] = 1
if at least one feature from cluster j was selected by the base feature selection method on subsamplei
, and 0 otherwise.X
The
X
matrix provided tocss()
, coerced from a data.frame to a matrix if needed.y
The
y
vector provided tocss()
.clusters
A named list of integer vectors containing all of the clusters provided to
css()
, as well as size 1 clusters of any features not listed in any of the clusters provided tocss()
. All clusters will have names; any clusters not provided with a name in the input tocss()
will be given names automatically bycss()
(of the form c1, etc.).train_inds
Identical to the
train_inds
provided tocss()
.
References
Faletto, G., & Bien, J. (2022). Cluster Stability Selection. arXiv preprint arXiv:2201.00494. https://arxiv.org/abs/2201.00494.
Shah, R. D., & Samworth, R. J. (2013). Variable selection with error control: Another look at stability selection. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 75(1), 55–80. https://doi.org/10.1109/RITA.2014.2302071.
Meinshausen, N., & Bühlmann, P. (2010). Stability Selection. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 72(4), 417–473. https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1467-9868.2010.00740.x.
Beinrucker, A., Dogan, Ü., & Blanchard, G. (2016). Extensions of stability selection using subsamples of observations and covariates. Statistics and Computing, 26(5), 1059- 1077. https://doi.org/10.1007/s11222-015-9589-y.
Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. URL https://www.jstatsoft.org/v33/i01/.