When the LASSO fails???

By Gabriel Vasconcelos

When the LASSO fails?

The LASSO has two important uses, the first is forecasting and the second is variable selection. We are going to talk about the second. The variable selection objective is to recover the correct set of variables that generate the data or at least the best approximation given the candidate variables. The LASSO has attracted a lot of attention lately because it allows us to estimate a linear regression with thousands of variables and the model select the right ones for us. However, what many people ignore is when the LASSO fails.

Like any model, the LASSO also rely on assumptions in order to work. The first is sparsity, i.e. only a small number of variables may actually be relevant. If this assumption does not hold there is no hope to use the LASSO for variable selection. Another assumption is that the irrepresentable condition must hold, this condition may look very technical but it only says that the relevant variable may not be very correlated with the irrelevant variables.

Suppose your candidate variables are represented by the matrix X, where each column is a variable and each line is an observation. We can calculate the covariance matrix \Sigma=n^{-1} X'X, which is a symmetric matrix. This matrix may be broken into four pieces:

\displaystyle \Sigma=\left( \begin{array}{cc} C_{1,1} & C_{1,2} \\ C_{2,1} & C_{2,2} \end{array} \right)

The first piece, C_{1,1}, shows the covariances between only the important variables, C_{2,2} is the covariance matrix of the irrelevant variables and C_{1,2} and C_{2,1} shows the covariances between relevant and irrelevant variables. With that in mind, the irrepresentable condition is:

\displaystyle |C_{2,1}C_{1,1}^{-1}sign(\beta)|<1

The inequality above must hold for all elements. sign(\beta) is 1 for positive values of \beta, -1 for negative values and 0 if \beta=0.

Example

For this example we are going to generate two covariance matrices, one that satisfies the irrepresentable condition and one that violates it. Our design will be very simple: only 10 candidate variables where five of them are relevant.

library(mvtnorm)
library(corrplot)
library(glmnet)
library(clusterGeneration)
k=10 # = Number of Candidate Variables
p=5 # = Number of Relevant Variables
N=500 # = Number of observations
betas=(-1)^(1:p) # = Values for beta
set.seed(12345) # = Seed for replication
sigma1=genPositiveDefMat(k,"unifcorrmat")$Sigma # = Sigma1 violates the irc
sigma2=sigma1 # = Sigma2 satisfies the irc
sigma2[(p+1):k,1:p]=0
sigma2[1:p,(p+1):k]=0

# = Verify the irrepresentable condition
irc1=sort(abs(sigma1[(p+1):k,1:p]%*%solve(sigma1[1:p,1:p])%*%sign(betas)))
irc2=sort(abs(sigma2[(p+1):k,1:p]%*%solve(sigma2[1:p,1:p])%*%sign(betas)))
c(max(irc1),max(irc2))
## [1] 3.222599 0.000000
# = Have a look at the correlation matrices
par(mfrow=c(1,2))
corrplot(cov2cor(sigma1))
corrplot(cov2cor(sigma2))

plot of chunk unnamed-chunk-20

As you can see, irc1 violates the irrepresentable condition and irc2 does not. The correlation matrix that satisfies the irrepresentable condition is block diagonal and the relevant variables have no correlation with the irrelevant ones. This is an extreme case, you may have a small correlation and still satisfy the condition.

Now let us check how the LASSO works for both covariance matrices. First we need do understand what is the regularization path. The LASSO objective function penalizes the size of the coefficients and this penalization is controlled by a hyper-parameter \lambda. We can find the exact \lambda_0 that is sufficiently big to shrink all variables to zero and for any value smaller than \lambda_0 some variable will be included. As we decrease the size of \lambda more variables are included until we have a model with all variables (or the biggest identified model when we have more variables than observations). This path between the model with all variables and the model with no variables is the regularization path. The code below generates data from multivariate normal distributions for the covariance matrix that violates the irrepresentable condition and the covariance matrix that satisfies it. Then I estimate the regularization path for both case and summarize the information in plots.

X1=rmvnorm(N,sigma = sigma1) # = Variables for the design that violates the IRC = #
X2=rmvnorm(N,sigma = sigma2) # = Variables for the design that satisfies the IRC = #
e=rnorm(N) # = Error = #
y1=X1[,1:p]%*%betas+e # = Generate y for design 1 = #
y2=X2[,1:p]%*%betas+e # = Generate y for design 2 = #

lasso1=glmnet(X1,y1,nlambda = 100) # = Estimation for design 1 = #
lasso2=glmnet(X2,y2,nlambda = 100) # = Estimation for design 2 = #

## == Regularization path == ##
par(mfrow=c(1,2))
l1=log(lasso1$lambda)
matplot(as.matrix(l1),t(coef(lasso1)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="Violates IRC")
l2=log(lasso2$lambda)
matplot(as.matrix(l2),t(coef(lasso2)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="Satisfies IRC")

plot of chunk unnamed-chunk-21

The plot on the left shows the results when the irrepresentable condition is violated and the plot on the right is the case when it is satisfied. The five black lines that slowly converge to zero are the five relevant variables and the red line is an irrelevant variable. As you can see, when the IRC is satisfied all relevant variables shrink very fast to zero as we increase lambda. However, when the IRC is violated one irrelevant variable starts with a very small coefficient that slowly increases before decreasing to zero in the very end of the path. This variable is selected through the entire path, it is virtually impossible to recover the correct set of variables in this case unless you apply a different penalty to each variable. This is precisely what the adaptive LASSO does. Does that mean that the adaLASSO is free from the irrepresentable condition??? The answer is: partially. The adaptive LASSO requires a less restrictive condition called weighted irrepresentable condition, which is much easier to satisfy. The two plots below show the regularization path for the LASSO and the adaLASSO in the case the IRC is violated. As you can see, the adaLASSO selects the correct set of variables in all the path.

lasso1.1=cv.glmnet(X1,y1)
w.=(abs(coef(lasso1.1)[-1])+1/N)^(-1)
adalasso1=glmnet(X1,y1,penalty.factor = w.)

par(mfrow=c(1,2))
l1=log(lasso1$lambda)
matplot(as.matrix(l1),t(coef(lasso1)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="LASSO")
l2=log(adalasso1$lambda)
matplot(as.matrix(l2),t(coef(adalasso1)[-1,]),type="l",lty=1,col=c(rep(1,9),2),ylab="coef",xlab="log(lambda)",main="adaLASSO")

plot of chunk unnamed-chunk-22

The biggest problem is that the irrepresentable condition and its less restricted weighted version are not testable in the real world because we need the populational covariance matrix and the true betas that generate the data. The solution is to study your data as much as possible to at least have an idea of the situation.

Some articles on the topic

Zhao, Peng, and Bin Yu. “On model selection consistency of Lasso.” Journal of Machine learning research 7.Nov (2006): 2541-2563.

Meinshausen, Nicolai, and Bin Yu. “Lasso-type recovery of sparse representations for high-dimensional data.” The Annals of Statistics (2009): 246-270.

This entry was posted in R and tagged , , , , , . Bookmark the permalink.

22 Responses to When the LASSO fails???

  1. Amazing, I finally get the problems often forgotten of Lasso

    Like

  2. Pingback: When the LASSO fails??? | A bunch of data

  3. Pingback: When the LASSO fails??? – Mubashir Qasim

  4. juan says:

    Hi.
    Does elastic-net suffer from the same problem?
    What other alternatives do we have?

    Like

    • insightr says:

      I don’t know the exact condition for the elastic-net. But its penalization is less restrictive than the LASSO in a way that it normally selects more variables. It probably has a similar problem. The adaptive versions are the best you can get as far as I know.

      Like

      • ag says:

        The elastinet is a linear combination of ridge and lasso. Unless your elastinet says that straight lasso is best, it will contain some ridge and will not remove coefficients. I can expand on this if you like.

        Like

      • insightr says:

        Yes. But it also needs a condition called elastic irrepresentable condition to have model selection consistency. It is more complicated because you also need to know the values of lambda for both the l1 and the l2 norm.

        Like

  5. aginensky says:

    What do you mean by |C_{21}C_{11}^{-1} sign(|beta)| < 1 in particular what does “|” mean. Is it a norm ? The C are matrices and can be multiplied, but do you want to take their determinant and then ordinary absolute value or is this a norm like Frobenius ?

    Like

  6. Pingback: Distilled News | Data Analytics & R

  7. Hi, thanks for this post. Maybe I overlooked it, but in the formulas beta is used before it is defined and in the code it is set at betas=(-1)^(1:p) # = Values for beta and then below there is the criterion. What does beta stand for? Thank you!

    Like

    • insightr says:

      These betas defined in the code are the real betas used to generate the data. The irrepresentable condition is not testable because you need to know these betas. Since in this case I generate the data, I know exactly the true betas but this is not true if you are working with real data.

      Like

  8. Emilie says:

    Thank you for this article !
    How do you define “relevant” and “irrelevant” variables ?
    Are relevant variables those selected at the end by the model ? Or those that you think are important before running the LASSO selection ?

    Like

  9. Emilie says:

    And in “real world” analysis? How do you know which explicative variables are relevant and which ones are not?

    Like

    • insightr says:

      You don’t. But you have an idea on how correlated your variables are and if the lasso might work or not. If the condition is satisfied you will recover the correct set of variables with high probability. If it fails you will probably still recover the correct set of variables plus some irrelevants.

      Like

      • Emilie says:

        Ok, so actually the collinearity is an issue for variable selection with LASSO (and I guess elactic net too). I was adviced to use this kind of variable selection because of high colinearity in my dataset and being able to have an “automatic” variable selection, instead of manually select variable in relation to a threshold of correlation between explicative variables. But first results were quite disapointing. In particular, two highly correlated variables were selected, but one had a negative coefficient while it showed a positive relationship with the response variable…

        Like

      • insightr says:

        Collinearity between relevants and irrelevants yes. Try the adaLASSO to increase your chances because the condition is less restrictive.

        Like

  10. Emilie says:

    I will! Thank you!

    Like

  11. Pingback: Machine Learning for Beginners, Part 12: Lasso – The Data Lass

Leave a reply to aginensky Cancel reply