I'm using the package 'lars' in R with the following code: > library(lars) > set.seed(3) > n <- 1000 > x1 <- rnorm(n) > x2 <- x1+rnorm(n)*0.5 > x3 <- rnorm(n) > x4 <- rnorm(n) > x5 <- rexp(n) > y <- 5*x1 + 4*x2 + 2*x3 + 7*x4 + rnorm(n) > x <- cbind(x1,x2,x3,x4,x5) > cor(cbind(y,x)) y x1 x2 x3 x4 x5 y 1.00000000 0.74678534 0.743536093 0.210757777 0.59218321 0.03943133 x1 0.74678534 1.00000000 0.892113559 0.015302566 -0.03040464 0.04952222 x2 0.74353609 0.89211356 1.000000000 -0.003146131 -0.02172854 0.05703270 x3 0.21075778 0.01530257 -0.003146131 1.000000000 0.05437726 0.01449142 x4 0.59218321 -0.03040464 -0.021728535 0.054377256 1.00000000 -0.02166716 x5 0.03943133 0.04952222 0.057032700 0.014491422 -0.02166716 1.00000000 > m <- lars(x,y,"step",trace=T) Forward Stepwise sequence Computing X'X ..... LARS Step 1 : Variable 1 added LARS Step 2 : Variable 4 added LARS Step 3 : Variable 3 added LARS Step 4 : Variable 2 added LARS Step 5 : Variable 5 added Computing residuals, RSS etc .....
I've got a dataset with 5 continuous variables and I'm trying to fit a model to a single (dependent) variable y. Two of my predictors are highly correlated with each other (x1, x2). As you can see in the above example the lars function with 'stepwise' option first chooses the variable that is most correlated with y. The next variable to enter the model is the one that is most correlated with the residuals. Indeed, it is x4: > round((cor(cbind(resid(lm(y~x1)),x))[1,3:6]),4) x2 x3 x4 x5 0.1163 0.2997 0.9246 0.0037 Now, if I do the 'lasso' option: > m <- lars(x,y,"lasso",trace=T) LASSO sequence Computing X'X .... LARS Step 1 : Variable 1 added LARS Step 2 : Variable 2 added LARS Step 3 : Variable 4 added LARS Step 4 : Variable 3 added LARS Step 5 : Variable 5 added It adds both of the correlated variables to the model in the first two steps. This is the opposite from what I read in several papers. Most of then say that if there is a group of variables among which the correlations are very high, then the 'lasso' tends to select only one variable from the group at random. Can someone provide an example of this behavior? Or explain, why my variables x1, x2 are added to the model one after another (together) ? -- View this message in context: http://r.789695.n4.nabble.com/Selecting-correlated-predictors-with-LASSO-tp4633586.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.