Skip to contents

Check the inputs to protolasso and clusterRepLasso, format clusters, and identify prototypes for each cluster

Usage

processClusterLassoInputs(X, y, clusters, nlambda)

Arguments

X

An n x p numeric matrix (preferably) or a data.frame (which will be coerced internally to a matrix by the function model.matrix) containing p >= 2 features/predictors

y

The response; A length n numeric (or integer) real-valued vector.

clusters

A list of integer vectors; each vector should contain the indices of a cluster of features (a subset of 1:p). (If there is only one cluster, clusters can either be a list of length 1 or an integer vector.) All of the provided clusters must be non-overlapping. Every feature not appearing in any cluster will be assumed to be unclustered (that is, they will be treated as if they are in a "cluster" containing only themselves). Default is list() (so no clusters are specified).

nlambda

Integer; the number of lambda values to use in the lasso fit for the protolasso. Default is 100 (following the default for glmnet). For now, nlambda must be at least 2 (using a single lambda is not supported).

Value

A list with four elements.

x

The provided X, converted to a matrix if it was provided as a data.frame, and with column names removed.

clusters

A named list where each entry is an integer vector of indices of features that are in a common cluster. (The length of list clusters is equal to the number of clusters.) All identified clusters are non-overlapping. All features appear in exactly one cluster (any unclustered features will be put in their own "cluster" of size 1).

prototypes

An integer vector whose length is equal to the number of clusters. Entry i is the index of the feature belonging to cluster i that is most highly correlated with y (that is, the prototype for the cluster, as in the protolasso; see Reid and Tibshirani 2016).

var_names

If the provided X matrix had column names, the names of the featurrs in the provided X matrix. If no names were provided, feat_names will be NA.

References

Reid, S., & Tibshirani, R. (2016). Sparse regression and marginal testing using cluster prototypes. Biostatistics, 17(2), 364–376. https://doi.org/10.1093/biostatistics/kxv049.

Author

Gregory Faletto, Jacob Bien