Check the inputs to protolasso and clusterRepLasso, format clusters, and identify prototypes for each cluster

Usage

processClusterLassoInputs(X, y, clusters, nlambda)

Arguments

X: An n x p numeric matrix (preferably) or a data.frame (which will be coerced internally to a matrix by the function model.matrix) containing p >= 2 features/predictors
y: The response; A length n numeric (or integer) real-valued vector.
clusters: A list of integer vectors; each vector should contain the indices of a cluster of features (a subset of 1:p). (If there is only one cluster, clusters can either be a list of length 1 or an integer vector.) All of the provided clusters must be non-overlapping. Every feature not appearing in any cluster will be assumed to be unclustered (that is, they will be treated as if they are in a "cluster" containing only themselves). Default is list() (so no clusters are specified).
nlambda: Integer; the number of lambda values to use in the lasso fit for the protolasso. Default is 100 (following the default for glmnet). For now, nlambda must be at least 2 (using a single lambda is not supported).

Value

A list with four elements.

x: The provided X, converted to a matrix if it was provided as a data.frame, and with column names removed.
clusters: A named list where each entry is an integer vector of indices of features that are in a common cluster. (The length of list clusters is equal to the number of clusters.) All identified clusters are non-overlapping. All features appear in exactly one cluster (any unclustered features will be put in their own "cluster" of size 1).
prototypes: An integer vector whose length is equal to the number of clusters. Entry i is the index of the feature belonging to cluster i that is most highly correlated with y (that is, the prototype for the cluster, as in the protolasso; see Reid and Tibshirani 2016).
var_names: If the provided X matrix had column names, the names of the featurrs in the provided X matrix. If no names were provided, feat_names will be NA.

References

Reid, S., & Tibshirani, R. (2016). Sparse regression and marginal testing using cluster prototypes. Biostatistics, 17(2), 364–376. https://doi.org/10.1093/biostatistics/kxv049.

Author

Gregory Faletto, Jacob Bien