Thanks for all your help and I apologize for not being clear in the beginning. I will try the "group lasso" packages. From the paper, it seems like that is what I want to do. Thanks again!
On Tue, May 3, 2011 at 2:40 AM, Nick Sabbe <nick.sa...@ugent.be> wrote: > For performance reasons, I advise on using the following function instead of > model.matrix: > > factorsToDummyVariables<-function(dfr, betweenColAndLevel="") > { > nc<-dim(dfr)[2] > firstRow<-dfr[1,] > coln<-colnames(dfr) > retval<-do.call(cbind, lapply(seq(nc), function(ci){ > if(is.factor(firstRow[,ci])) > { > lvls<-levels(firstRow[,ci])[-1] > stretchedcols<-sapply(lvls, function(lvl){ > rv<-dfr[,ci]==lvl > mode(rv)<-"integer" > return(rv) > }) > if(!is.matrix(stretchedcols)) > stretchedcols<-matrix(stretchedcols, nrow=1) > colnames(stretchedcols)<-paste(coln[ci], > lvls, sep=betweenColAndLevel) > return(stretchedcols) > } > else > { > curcol<-matrix(dfr[,ci], ncol=1) > colnames(curcol)<-coln[ci] > return(curcol) > } > })) > rownames(retval)<-rownames(dfr) > return(retval) > } > > > Just for comparison: here is my old version of the same function, using > model.matrix: > > factorsToDummyVariables.old<-function(dfrPredictors, > form=paste("~",paste(colnames(dfrPredictors), collapse="+"), sep="")) > { > #note: this function seems to operate quite slowly! > #Because it is used often, it may be worth improving its speed > dfrTmp<-model.frame(dfrPredictors, na.action=na.pass) > frm<-as.formula(form) > mm<-model.matrix(frm, data=dfrTmp) > retval<-as.matrix(mm)[,-1] > > return(retval) > } > > In a testcase with a reasonably big dataset, I compared the speeds: > > #system.time(tmp.fd.convds.full.man<-manualFactorsToDummyVariables(ds)) > ## user system elapsed > ## 9.44 0.00 9.48 > #system.time(tmp.fd.convds.full<-factorsToDummyVariables.old(ds)) > ## user system elapsed > ## 15.49 0.00 15.64 > #system.time(invisible(factorsToDummyVariables (ds[10,]))) > ## user system elapsed > ## 0.36 0.00 0.36 > #system.time(invisible(factorsToDummyVariables.old (ds[10,]))) > ## user system elapsed > ## 2.18 0.00 2.20 > #system.time(invisible(factorsToDummyVariables (ds[20:30,]))) > ## user system elapsed > ## 0.34 0.00 0.38 > #system.time(invisible(factorsToDummyVariables.old (ds[20:30,]))) > ## user system elapsed > ## 2.11 0.00 2.15 > > If you have to do this quite often, the difference surely adds up... > More improvements may be possible. > This function only works if you don't include interactions, though. > > > Nick Sabbe > -- > ping: nick.sa...@ugent.be > link: http://biomath.ugent.be > wink: A1.056, Coupure Links 653, 9000 Gent > ring: 09/264.59.36 > > -- Do Not Disapprove > > > > > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > Behalf Of David Winsemius > Sent: maandag 2 mei 2011 20:48 > To: Steve Lianoglou > Cc: r-help@r-project.org > Subject: Re: [R] Lasso with Categorical Variables > > > On May 2, 2011, at 10:51 AM, Steve Lianoglou wrote: > >> Hi, >> >> On Mon, May 2, 2011 at 12:45 PM, Clemontina Alexander <ckale...@ncsu.edu >> > wrote: >>> Hi! This is my first time posting. I've read the general rules and >>> guidelines, but please bear with me if I make some fatal error in >>> posting. Anyway, I have a continuous response and 29 predictors made >>> up of continuous variables and nominal and ordinal categorical >>> variables. I'd like to do lasso on these, but I get an error. The way >>> I am using "lars" doesn't allow for the factors. Is there a special >>> option or some other method in order to do lasso with cat. variables? >>> >>> Here is and example (considering ordinal variables as just nominal): >>> >>> set.seed(1) >>> Y <- rnorm(10,0,1) >>> X1 <- factor(sample(x=LETTERS[1:4], size=10, replace = TRUE)) >>> X2 <- factor(sample(x=LETTERS[5:10], size=10, replace = TRUE)) >>> X3 <- sample(x=30:55, size=10, replace=TRUE) # think age >>> X4 <- rchisq(10, df=4, ncp=0) >>> X <- data.frame(X1,X2,X3,X4) >>> >>>> str(X) >>> 'data.frame': 10 obs. of 4 variables: >>> $ X1: Factor w/ 4 levels "A","B","C","D": 4 1 3 1 2 2 1 2 4 2 >>> $ X2: Factor w/ 5 levels "E","F","G","H",..: 3 4 3 2 5 5 5 1 5 3 >>> $ X3: int 51 46 50 44 43 50 30 42 49 48 >>> $ X4: num 2.86 1.55 1.94 2.45 2.75 ... >>> >>> >>> I'd like to do: >>> obj <- lars(x=X, y=Y, type = "lasso") >>> >>> Instead, what I have been doing is converting all data to continuous >>> but I think this is really bad! >> >> Yeah, it is. >> >> Check out the "Categorical Predictor Variables" section here for a way >> to handle such predictor vars: >> http://www.psychstat.missouristate.edu/multibook/mlt08m.html > > Steve's citation is somewhat helpful, but not sufficient to take the > next steps. You can find details regarding the mechanics of typical > linear regression in R on the ?lm page where you find that the factor > variables are typically handled by model.matrix. See below: > > > model.matrix(~X1 + X2 + X3 + X4, X) > (Intercept) X1B X1C X1D X2F X2G X2H X2I X3 X4 > 1 1 0 0 1 0 1 0 0 51 2.8640884 > 2 1 0 0 0 0 0 1 0 46 1.5462243 > 3 1 0 1 0 0 1 0 0 50 1.9430901 > 4 1 0 0 0 1 0 0 0 44 2.4504180 > 5 1 1 0 0 0 0 0 1 43 2.7535052 > 6 1 1 0 0 0 0 0 1 50 1.6200326 > 7 1 0 0 0 0 0 0 1 30 0.5750533 > 8 1 1 0 0 0 0 0 0 42 5.9224777 > 9 1 0 0 1 0 0 0 1 49 2.0401528 > 10 1 1 0 0 0 1 0 0 48 6.2995288 > attr(,"assign") > [1] 0 1 1 1 2 2 2 2 3 4 > attr(,"contrasts") > attr(,"contrasts")$X1 > [1] "contr.treatment" > > attr(,"contrasts")$X2 > [1] "contr.treatment" > > The numeric variables are passed through, while the dummy variables > for factor columns are constructed (as treatment contrasts) and the > whole thing it returned in a neat package. > > -- > David. >> >> HTH, >> -steve >> > -- > David Winsemius, MD > Heritage Laboratories > West Hartford, CT > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.