2 Overview
2.1 Background
2.1.1 Complementary pairs stability selection
The following paper introduces complementary pairs subsampling:
###"shah2013"###
#' Shah, R. D., & Samworth, R. J.
#' (2013). Variable selection with error control: Another look at stability
#' selection. \emph{Journal of the Royal Statistical Society. Series B:
#' Statistical Methodology}, 75(1), 55–80.
#' \url{https://doi.org/10.1109/RITA.2014.2302071}.
Subsamples \(A_1, \ldots, A_B \subset [n]\) of size \(\lfloor n/2 \rfloor\) are drawn, as well as subsamples \(\overline{A}_b \subset [n]\) of the same size with \(A_b \cap \overline{A}_b = \emptyset\). Our function getSubsamps()
implements this subsampling.
Writing \(\hat{S}^\lambda(A)\) for the selected features from applying a variable selection procedure with tuning parameter \(\lambda\) on the data in a subset \(A\subset[n]\), Shah and Samworth suggest computing the following:
\[ \hat{\Pi}_B^{\text{(SS)}}(j) = \frac{1}{2B} \sum_{b=1}^B \left[1\left\{j \in \hat{S}^\lambda \left( A_b \right)\right\} + 1\left\{j \in \hat{S}^\lambda (\overline{A}_b)\right\} \right]. \]
2.1.2 Cluster stability selection
We compute the proportion of the time that a feature \(j \in [p]\) is selected for at least one \(\lambda \in \Lambda\): \[ \hat{\Pi}_B(j) := \frac{1}{2B} \sum_{b=1}^B \left[ 1\left\{ j \in \bigcup_{\lambda \in \Lambda} \hat{S}^\lambda \left( A_b \right) \right\} + 1\left\{ j \in \bigcup_{\lambda \in \Lambda} \hat{S}^\lambda (\overline{A}_b)\right\} \right]. \]
We also compute a similar quantity at the level of clusters: For every cluster \(C_k\), we calculate the proportion of the time that at least one feature from \(C_k\) is selected for at least one \(\lambda \in \Lambda\):
\[ \hat{\Theta}_B(C_k) := \frac{1}{2B} \sum_{b=1}^B \left[ 1 \left\{ C_k \cap \bigcup_{\lambda \in \Lambda} \hat{S}^{\lambda}\left(A_b \right) \neq \emptyset \right\} + 1 \left\{ C_k \cap \bigcup_{\lambda \in \Lambda} \hat{S}^{\lambda}\left(\overline{A}_b \right) \neq \emptyset \right\} \right]. \]
We then calculate cluster representatives \(\boldsymbol{X}_{\cdot C_k}^{\text{rep}}\) as weighted averages of the cluster members:
\[ \boldsymbol{X}_{\cdot C_k}^{\text{rep}}:= \sum_{j \in C_k } w_{kj} \boldsymbol{X}_{\cdot j}. \] We allow for several choices of weights:
- Weighted averaged cluster stability selection:
\[ w_{kj} = \frac{\hat{\Pi}_B(j)}{\sum_{j' \in C_k} \hat{\Pi}_B(j')} \qquad \forall j \in C_k . \]
- Simple averaged cluster stability selection:
\[ w_{kj} = \frac{1}{\left| C_k \right|} \qquad \forall j \in C_k . \]
- Sparse cluster stability selection: For each \(j \in C_k\),
\[ w_{kj} = \left. 1 \left\{ j \in \underset{j' \in C_k}{\arg \max} \left\{ \hat{\Pi}_B(j') \right\} \right\} \middle/ \left| \underset{j' \in C_k}{\arg \max} \left\{ \hat{\Pi}_B(j') \right\} \right| \right. . \]
2.2 Outline
We will define the functions in this package in steps.
- First we define the workhorse function of the package,
css()
. This function takes as inputs a data set (\(X\), \(y\)), a feature selection method \(\hat S^\lambda(\cdot)\), and clusters of features appearing in the data: \(C_1,\ldots,C_K\). It calculates random subsamples of the data (\(A_1,\ldots, A_B\) and \(\bar A_1,\ldots, \bar A_B\)) and gets selected sets of features on each subsample (i.e., \(\hat S^\lambda(A_b)\) and \(\hat S^\lambda(\bar A_b)\)). It returns a \(2B\times p\) matrix of indicator variables for whether each feature was selected on each subsample (one row for each subsample, one column for each feature). It returns a similar \(2B\times K\) matrix of indicator variables for whether any feature from each cluster was selected on each subsample. This is the most computationally intensive step of cluster stability selection, so it is isolated within its own function that is designed to be run only once on a data set. - Next we define functions that take these indicator matrices and return useful output–for example, selected sets of features, or predictions on test data. These outputs depend on the selected features, which themselves depend on user-selected parameters (for example, the cutoff for selection proportions for selected clusters).
- Next, we define wrapper functions that compute all of the above in one step in a user-friendly way. (These functions are not recommended for large data sets or for “power users” who may want to call these functions multiple times on the same data set, because these wrapper functions make a new call to
css()
every time they are called, which is slow and computationally wasteful if the inputs tocss()
are not changed.) - Finally, we define some other useful functions, like functions that generate clustered data to use for simulations or testing, or select a tuning parameter for the lasso via cross-validation.
2.3 Package setup
Before proceeding to the actual functionality of the package, we start by specifying the information needed in the DESCRIPTION file of the R package.
::create_package(
usethispath = ".",
fields = list(
Package = params$package_name,
Version = "0.1.1",
Title = "Cluster Stability Selection",
Description = "Implementation of Cluster Stability Selection (Faletto and Bien 2022).",
`Authors@R` = c(person(
given = "Gregory",
family = "Faletto",
email = "gregory.faletto@marshall.usc.edu",
role = c("aut", "cre")
person(
), given = "Jacob",
family = "Bien",
email = "jbien@usc.edu",
role = c("aut")
))
)
)::use_mit_license(copyright_holder = "F. Last") usethis
We also define the package-level documentation that shows up when someone types package?cssr
in the console:
#' Cluster Stability Selection
#'
#' Cluster stability selection is a feature selection method designed to allow
#' stability selection to work effectively in the presence of highly correlated
#' features. It was proposed in <https://arxiv.org/abs/2201.00494>. To learn
#' more about this package, please visit its website
#' <https://gregfaletto.github.io/cssr-project/>.
#'
#' @docType package
#' @seealso \code{\link{css}} \code{\link{cssSelect}}