The lasso regression is based on the idea of solving\widehat{\mathbf{\beta}}_{\lambda}=\text{argmin}\lbrace -\log\mathcal{L}(\mathbf{\beta}|\mathbf{x},\mathbf{y})+\lambda\|\mathbf{\beta}\|_{\ell_1}\rbracewhere\Vert\mathbf{a} \Vert_{\ell_1}=\sum_{i=1}^d |a_i|for any \mathbf{a}\in\mathbb{R}^d. In a recent post, we’ve seen computational aspects of the optimization problem. But I went quickly throught the story of the \ell_1-norm. Because it means, somehow, that the value of \beta_1 and \beta_2 should be comparable. Somehow, with two significant variables, with very different scales, we should expect orders (or relative magnitudes) of \widehat{\beta}_1 and \widehat{\beta}_2 to be very very different. So people say that it is therefore necessary to center and reduce (or standardize) the variables.

Consider the following (simulated) dataset

Sigma = matrix(c(1,.8,.2,.8,1,.4,.2,.4,1),3,3) n = 1000 library(mnormt) X = rmnorm(n,rep(0,3),Sigma) set.seed(123) df = data.frame(X1=X[,1],X2=X[,2],X3=X[,3],X4=rnorm(n), X5=runif(n),X6=exp(X[,3]), X7=sample(c("A","B"),size=n,replace=TRUE,prob=c(.5,.5)), X8=sample(c("C","D"),size=n,replace=TRUE,prob=c(.5,.5))) df$Y = 1+df$X1-df$X4+5*(df$X7=="A")+rnorm(n) X = model.matrix(lm(Y~.,data=df)) |

Use the following colors for the graphs and the value of \lambda

library("RColorBrewer") colrs = c(brewer.pal(8,"Set1"))[c(1,4,5,2,6,3,7,8)] vlambda=exp(seq(-8,1,length=201)) |

The first regression we can run is a non-standardized one

library(glmnet) lasso = glmnet(x=X,y=df[,"Y"],family="gaussian",alpha=1,lambda=vlambda,standardize=FALSE) |

We can visualize the graphs of \lambda\mapsto\widehat{\beta}_\lambda

idx = which(apply(lasso$beta,1,function(x) sum(x==0))<200) plot(lasso,col=colrs,'lambda',xlim=c(-5.5,2.3),lwd=2) legend(1.2,.9,legend=paste('X',0:8,sep='')[idx],col=colrs,lty=1,lwd=2) |

At least, observe that the most significant variables are the one that were used to generate the data.

Now, consider the case that we standardize the data

lasso = glmnet(x=X,y=df[,"Y"],family="gaussian",alpha=1,lambda=vlambda,standardize=TRUE) |

The graphs of \lambda\mapsto\widehat{\beta}_\lambda

The graph is (strangely) very similar to the previous one. Except perhaps for the green curve. Maybe that categorical are not simular to continuous variables… Because somehow, standardisation of categorical variables might be not natural…

Why not consider some home-made function ? Let us transform (linearly) all variable in the X matrix (except the first one, which is the intercept)

Xc = X for(j in 2:ncol(X)) Xc[,j]=(Xc[,j]-mean(Xc[,j]))/sd(Xc[,j]) |

Now, we can run our lasso regression on that one (with the intercept since all the variables are centered, but y)

lasso = glmnet(x=Xc,y=df$Y,family="gaussian",alpha=1,intercept=TRUE,lambda=vlambda) |

The plot is now

plot(lasso,col=colrs,"lambda",xlim=c(-6.7,1.3),lwd=2) idx = which(apply(lasso$beta,1,function(x) sum(x==0))<length(vlambda)) legend(.15,.45,legend=paste('X',0:8,sep='')[idx],col=colrs,lty=1,bty="n",lwd=2) |

Actually, why not also center the y variable, and remove also the intercept

Yc = (df[,"Y"]-mean(df[,"Y"]))/sd(df[,"Y"]) lasso = glmnet(x=Xc,y=Yc,family="gaussian",alpha=1,intercept=FALSE,lambda=vlambda) |

Hopefully, those graphs are very consistent (and if we use those for variable selection, they suggest to use variables that were actually used to generate the dataset). And having qualitative and quantitative variable is not a big deal. But still, I do not feel confortable with the differences…

Hey Arthur,

I like your way to make complex concept so easy to understand,

Thanks.

Arthur, I really liked your LASSO posts. In my mind’s eye I have struggled also with having a mixed variable model (e.g., independent variables which are both continuous and categorical). Post standardization, I have not seen a clear interpretation of the outputted coefficients. I have accepted but have not see this stated anywhere that for a standardized binary independent variable the coefficient would be the change in the dependent variable for a standard deviation change in the prevalence of the binary variable. I work in a more interpretive field, so I need to be able to convey the effect estimates meanings. Any comments would be appreciated.

P.S., On your other post you use the bootstrap with LASSO, which provides a great visualization. Though, I have heard Rob Tibshirani mention that BS would not be appropriate for getting precision values if that was ever the purpose of using it.

Thanks again,

Hayden

I always wondered what one should use for the norm in $ \mathbb{R}^{n} $ spaces.

You use $ {\ell}_{1} $ which was my first thought.

Yet I was told to be mistaken on Math Exchange and I should use $ {L}_{1} $ as $ {\ell}_{n} $ is reserved for infinite series which are summable with respect to the $ {\ell}_{n} $ norm.

you have to use L1 ! that’s the trick ! If use a norm Lp, with p in [0,1], it is a “sparse” regression, and if p exceeds 1, it is a convex optimization problem. So you have to use L1… L2 will be the Ridge regression. But here, my concern is that those norms are used on “standardized” variables… and I am still not sure what it means….

I’m not talking about the problem. I only talked about Math Conventions.

I think you should use $ {\ell}_{n} $ for Infinite Sequences.

You should use $ {L}_{n} $ for Finite Spaces.

In you post you use $ {\ell}_{n} $ for finite dimension vector.

ok, right, I get your point…