Skip to contents

This function is uses the lasso with cross-validation to estimate the model size. Before using the lasso, in each cluster all features will be dropped from X except the one feature with the highest marginal correlation with y, as in the protolasso (Reid and Tibshirani 2016).

Usage

getModelSize(X, y, clusters)

Arguments

X

An n x p numeric matrix (preferably) or a data.frame (which will be coerced internally to a matrix by the function model.matrix) containing the p >= 2 features/predictors.

y

A length-n numeric vector containing the responses; y[i] is the response corresponding to observation X[i, ]. (Note that for the css function, y does not have to be a numeric response, but for this function, the underlying selection procedure is the lasso, so y must be a real-valued response.)

clusters

A named list where each entry is an integer vector of indices of features that are in a common cluster, as in the output of css. (The length of list clusters is equal to the number of clusters.) All identified clusters must be non-overlapping, and all features must appear in exactly one cluster (any unclustered features should be in their own "cluster" of size 1). CAUTION: if the provided X is a data.frame that contains a categorical feature with more than two levels, then the resulting matrix made from model.matrix will have a different number of columns than the provided data.frame, some of the feature numbers will change, and the clusters argument will not work properly (in the current version of the package). To get correct results in this case, please use model.matrix to convert the data.frame to a numeric matrix on your own, then provide this matrix and cluster assignments with respect to this matrix.

Value

An integer; the estimated size of the model. The minimum returned value is 1, even if the lasso with cross-validation chose no features.

References

Reid, S., & Tibshirani, R. (2016). Sparse regression and marginal testing using cluster prototypes. Biostatistics, 17(2), 364–376. https://doi.org/10.1093/biostatistics/kxv049.

Author

Gregory Faletto, Jacob Bien