split if probably what you are after. Here is an example: > n <- 2700000 > x <- data.frame(name=sample(1:6000,n,TRUE), value=runif(n)) > # split it into 6000 lists > system.time(y <- split(x$value, x$name)) user system elapsed 0.80 0.20 1.07 > str(y[1:10]) List of 10 $ 1 : num [1:454] 0.270 0.380 0.238 0.048 0.715 ... $ 2 : num [1:440] 0.769 0.822 0.832 0.527 0.808 ... $ 3 : num [1:444] 0.626 0.324 0.918 0.916 0.743 ... $ 4 : num [1:455] 0.341 0.482 0.134 0.237 0.324 ... $ 5 : num [1:430] 0.610 0.217 0.245 0.716 0.600 ... $ 6 : num [1:443] 0.460 0.335 0.503 0.798 0.181 ... $ 7 : num [1:424] 0.4417 0.4759 0.7436 0.0863 0.1770 ... $ 8 : num [1:480] 0.0712 0.6774 0.2995 0.8378 0.1902 ... $ 9 : num [1:431] 0.892 0.836 0.397 0.612 0.395 ... $ 10: num [1:448] 0.984 0.601 0.793 0.363 0.898 ... > Takes less that 1 second to split into 6000 lists.
On Wed, Aug 13, 2008 at 9:03 AM, Emmanuel Levy <[EMAIL PROTECTED]> wrote: > Wow great! Split was exactly what was needed. It takes about 1 second > for the whole operation :D > > Thanks again - I can't believe I never used this function in the past. > > All the best, > > Emmanuel > > > 2008/8/13 Erik Iverson <[EMAIL PROTECTED]>: >> I still don't understand what you are doing. Can you make a small example >> that shows what you have and what you want? >> >> Is ?split what you are after? >> >> Emmanuel Levy wrote: >>> >>> Dear Peter and Henrik, >>> >>> Thanks for your replies - this helps speed up a bit, but I thought >>> there would be something much faster. >>> >>> What I mean is that I thought that a particular value of a level >>> could be accessed instantly, similarly to a "hash" key. >>> >>> Since I've got about 6000 levels in that data frame, it means that >>> making a list L of the form >>> L[[1]] = values of name "1" >>> L[[2]] = values of name "2" >>> L[[3]] = values of name "3" >>> ... >>> would take ~1hour. >>> >>> Best, >>> >>> Emmanuel >>> >>> >>> >>> >>> 2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>: >>>> >>>> To simplify: >>>> >>>> n <- 2.7e6; >>>> x <- factor(c(rep("A", n/2), rep("B", n/2))); >>>> >>>> # Identify 'A':s >>>> t1 <- system.time(res <- which(x == "A")); >>>> >>>> # To compare a factor to a string, the factor is in practice >>>> # coerced to a character vector. >>>> t2 <- system.time(res <- which(as.character(x) == "A")); >>>> >>>> # Interestingly enough, this seems to be faster (repeated many times) >>>> # Don't know why. >>>> print(t2/t1); >>>> user system elapsed >>>> 0.632653 1.600000 0.754717 >>>> >>>> # Avoid coercing the factor, but instead coerce the level compared to >>>> t3 <- system.time(res <- which(x == match("A", levels(x)))); >>>> >>>> # ...but gives no speed up >>>> print(t3/t1); >>>> user system elapsed >>>> 1.041667 1.000000 1.018182 >>>> >>>> # But coercing the factor to integers does >>>> t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x)))) >>>> print(t4/t1); >>>> user system elapsed >>>> 0.4166667 0.0000000 0.3636364 >>>> >>>> So, the latter seems to be the fastest way to identify those elements. >>>> >>>> My $.02 >>>> >>>> /Henrik >>>> >>>> >>>> On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote: >>>>> >>>>> Emmanuel, >>>>> >>>>> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> >>>>> wrote: >>>>>> >>>>>> Dear All, >>>>>> >>>>>> I have a large data frame ( 2700000 lines and 14 columns), and I would >>>>>> like to >>>>>> extract the information in a particular way illustrated below: >>>>>> >>>>>> >>>>>> Given a data frame "df": >>>>>> >>>>>>> col1=sample(c(0,1),10, rep=T) >>>>>>> names = factor(c(rep("A",5),rep("B",5))) >>>>>>> df = data.frame(names,col1) >>>>>>> df >>>>>> >>>>>> names col1 >>>>>> 1 A 1 >>>>>> 2 A 0 >>>>>> 3 A 1 >>>>>> 4 A 0 >>>>>> 5 A 1 >>>>>> 6 B 0 >>>>>> 7 B 0 >>>>>> 8 B 1 >>>>>> 9 B 0 >>>>>> 10 B 0 >>>>>> >>>>>> I would like to tranform it in the form: >>>>>> >>>>>>> index = c("A","B") >>>>>>> col1[[1]]=df$col1[which(df$name=="A")] >>>>>>> col1[[2]]=df$col1[which(df$name=="B")] >>>>> >>>>> I'm not sure I fully understand your problem, you example would not run >>>>> for me. >>>>> >>>>> You could get a small speedup by omitting which(), you can subset by a >>>>> logical vector also which give a small speedup. >>>>> >>>>>> n <- 2700000 >>>>>> foo <- data.frame( >>>>> >>>>> + one = sample(c(0,1), n, rep = T), >>>>> + two = factor(c(rep("A", n/2 ),rep("B", n/2 ))) >>>>> + ) >>>>>> >>>>>> system.time(out <- which(foo$two=="A")) >>>>> >>>>> user system elapsed >>>>> 0.566 0.146 0.761 >>>>>> >>>>>> system.time(out <- foo$two=="A") >>>>> >>>>> user system elapsed >>>>> 0.429 0.075 0.588 >>>>> >>>>> You might also find use for unstack(), though I didn't see a speedup. >>>>>> >>>>>> system.time(out <- unstack(foo)) >>>>> >>>>> user system elapsed >>>>> 1.068 0.697 2.004 >>>>> >>>>> HTH >>>>> >>>>> Peter >>>>> >>>>>> My problem is that the command: *** which(df$name=="A") *** >>>>>> takes about 1 second because df is so big. >>>>>> >>>>>> I was thinking that a "level" could maybe be accessed instantly but I >>>>>> am not >>>>>> sure about how to do it. >>>>>> >>>>>> I would be very grateful for any advice that would allow me to speed >>>>>> this up. >>>>>> >>>>>> Best wishes, >>>>>> >>>>>> Emmanuel >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.