Dear Peter and Henrik, Thanks for your replies - this helps speed up a bit, but I thought there would be something much faster.
What I mean is that I thought that a particular value of a level could be accessed instantly, similarly to a "hash" key. Since I've got about 6000 levels in that data frame, it means that making a list L of the form L[[1]] = values of name "1" L[[2]] = values of name "2" L[[3]] = values of name "3" ... would take ~1hour. Best, Emmanuel 2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>: > To simplify: > > n <- 2.7e6; > x <- factor(c(rep("A", n/2), rep("B", n/2))); > > # Identify 'A':s > t1 <- system.time(res <- which(x == "A")); > > # To compare a factor to a string, the factor is in practice > # coerced to a character vector. > t2 <- system.time(res <- which(as.character(x) == "A")); > > # Interestingly enough, this seems to be faster (repeated many times) > # Don't know why. > print(t2/t1); > user system elapsed > 0.632653 1.600000 0.754717 > > # Avoid coercing the factor, but instead coerce the level compared to > t3 <- system.time(res <- which(x == match("A", levels(x)))); > > # ...but gives no speed up > print(t3/t1); > user system elapsed > 1.041667 1.000000 1.018182 > > # But coercing the factor to integers does > t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x)))) > print(t4/t1); > user system elapsed > 0.4166667 0.0000000 0.3636364 > > So, the latter seems to be the fastest way to identify those elements. > > My $.02 > > /Henrik > > > On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote: >> Emmanuel, >> >> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> wrote: >>> Dear All, >>> >>> I have a large data frame ( 2700000 lines and 14 columns), and I would like >>> to >>> extract the information in a particular way illustrated below: >>> >>> >>> Given a data frame "df": >>> >>>> col1=sample(c(0,1),10, rep=T) >>>> names = factor(c(rep("A",5),rep("B",5))) >>>> df = data.frame(names,col1) >>>> df >>> names col1 >>> 1 A 1 >>> 2 A 0 >>> 3 A 1 >>> 4 A 0 >>> 5 A 1 >>> 6 B 0 >>> 7 B 0 >>> 8 B 1 >>> 9 B 0 >>> 10 B 0 >>> >>> I would like to tranform it in the form: >>> >>>> index = c("A","B") >>>> col1[[1]]=df$col1[which(df$name=="A")] >>>> col1[[2]]=df$col1[which(df$name=="B")] >> >> I'm not sure I fully understand your problem, you example would not run for >> me. >> >> You could get a small speedup by omitting which(), you can subset by a >> logical vector also which give a small speedup. >> >>> n <- 2700000 >>> foo <- data.frame( >> + one = sample(c(0,1), n, rep = T), >> + two = factor(c(rep("A", n/2 ),rep("B", n/2 ))) >> + ) >>> system.time(out <- which(foo$two=="A")) >> user system elapsed >> 0.566 0.146 0.761 >>> system.time(out <- foo$two=="A") >> user system elapsed >> 0.429 0.075 0.588 >> >> You might also find use for unstack(), though I didn't see a speedup. >>> system.time(out <- unstack(foo)) >> user system elapsed >> 1.068 0.697 2.004 >> >> HTH >> >> Peter >> >>> My problem is that the command: *** which(df$name=="A") *** >>> takes about 1 second because df is so big. >>> >>> I was thinking that a "level" could maybe be accessed instantly but I am not >>> sure about how to do it. >>> >>> I would be very grateful for any advice that would allow me to speed this >>> up. >>> >>> Best wishes, >>> >>> Emmanuel >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.