If you want the index, then use: > system.time(y <- split(seq(nrow(x)), x$name)) user system elapsed 0.81 0.06 0.88 > str(y[1:10]) List of 10 $ 1 : int [1:454] 6924 17503 26880 39197 42881 50835 57896 62624 65767 75359 ... $ 2 : int [1:440] 9954 25619 25761 33776 56651 60372 61042 63134 64414 64491 ... $ 3 : int [1:444] 5413 6831 15780 21652 29423 37000 38661 60977 72267 74839 ... $ 4 : int [1:455] 23859 24748 27221 34886 40538 41326 45065 79769 81783 83951 ... $ 5 : int [1:430] 2572 3514 9934 24969 33844 35409 38122 38161 40113 45593 ... $ 6 : int [1:443] 7145 25184 26348 31182 39965 44191 49114 52791 69855 74272 ... $ 7 : int [1:424] 4596 11762 24949 30324 57906 59043 64833 70769 88878 90594 ... $ 8 : int [1:480] 14809 17604 18958 28436 31449 45339 51829 57725 65243 73260 ... $ 9 : int [1:431] 10748 14579 27153 27685 31930 32593 34605 35680 35828 50490 ... $ 10: int [1:448] 5292 13049 21132 22673 22983 28324 40099 43709 55505 70957 ... > >
On Wed, Aug 13, 2008 at 9:09 AM, jim holtman <[EMAIL PROTECTED]> wrote: > split if probably what you are after. Here is an example: > >> n <- 2700000 >> x <- data.frame(name=sample(1:6000,n,TRUE), value=runif(n)) >> # split it into 6000 lists >> system.time(y <- split(x$value, x$name)) > user system elapsed > 0.80 0.20 1.07 >> str(y[1:10]) > List of 10 > $ 1 : num [1:454] 0.270 0.380 0.238 0.048 0.715 ... > $ 2 : num [1:440] 0.769 0.822 0.832 0.527 0.808 ... > $ 3 : num [1:444] 0.626 0.324 0.918 0.916 0.743 ... > $ 4 : num [1:455] 0.341 0.482 0.134 0.237 0.324 ... > $ 5 : num [1:430] 0.610 0.217 0.245 0.716 0.600 ... > $ 6 : num [1:443] 0.460 0.335 0.503 0.798 0.181 ... > $ 7 : num [1:424] 0.4417 0.4759 0.7436 0.0863 0.1770 ... > $ 8 : num [1:480] 0.0712 0.6774 0.2995 0.8378 0.1902 ... > $ 9 : num [1:431] 0.892 0.836 0.397 0.612 0.395 ... > $ 10: num [1:448] 0.984 0.601 0.793 0.363 0.898 ... >> > Takes less that 1 second to split into 6000 lists. > > On Wed, Aug 13, 2008 at 9:03 AM, Emmanuel Levy <[EMAIL PROTECTED]> wrote: >> Wow great! Split was exactly what was needed. It takes about 1 second >> for the whole operation :D >> >> Thanks again - I can't believe I never used this function in the past. >> >> All the best, >> >> Emmanuel >> >> >> 2008/8/13 Erik Iverson <[EMAIL PROTECTED]>: >>> I still don't understand what you are doing. Can you make a small example >>> that shows what you have and what you want? >>> >>> Is ?split what you are after? >>> >>> Emmanuel Levy wrote: >>>> >>>> Dear Peter and Henrik, >>>> >>>> Thanks for your replies - this helps speed up a bit, but I thought >>>> there would be something much faster. >>>> >>>> What I mean is that I thought that a particular value of a level >>>> could be accessed instantly, similarly to a "hash" key. >>>> >>>> Since I've got about 6000 levels in that data frame, it means that >>>> making a list L of the form >>>> L[[1]] = values of name "1" >>>> L[[2]] = values of name "2" >>>> L[[3]] = values of name "3" >>>> ... >>>> would take ~1hour. >>>> >>>> Best, >>>> >>>> Emmanuel >>>> >>>> >>>> >>>> >>>> 2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>: >>>>> >>>>> To simplify: >>>>> >>>>> n <- 2.7e6; >>>>> x <- factor(c(rep("A", n/2), rep("B", n/2))); >>>>> >>>>> # Identify 'A':s >>>>> t1 <- system.time(res <- which(x == "A")); >>>>> >>>>> # To compare a factor to a string, the factor is in practice >>>>> # coerced to a character vector. >>>>> t2 <- system.time(res <- which(as.character(x) == "A")); >>>>> >>>>> # Interestingly enough, this seems to be faster (repeated many times) >>>>> # Don't know why. >>>>> print(t2/t1); >>>>> user system elapsed >>>>> 0.632653 1.600000 0.754717 >>>>> >>>>> # Avoid coercing the factor, but instead coerce the level compared to >>>>> t3 <- system.time(res <- which(x == match("A", levels(x)))); >>>>> >>>>> # ...but gives no speed up >>>>> print(t3/t1); >>>>> user system elapsed >>>>> 1.041667 1.000000 1.018182 >>>>> >>>>> # But coercing the factor to integers does >>>>> t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x)))) >>>>> print(t4/t1); >>>>> user system elapsed >>>>> 0.4166667 0.0000000 0.3636364 >>>>> >>>>> So, the latter seems to be the fastest way to identify those elements. >>>>> >>>>> My $.02 >>>>> >>>>> /Henrik >>>>> >>>>> >>>>> On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote: >>>>>> >>>>>> Emmanuel, >>>>>> >>>>>> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> >>>>>> wrote: >>>>>>> >>>>>>> Dear All, >>>>>>> >>>>>>> I have a large data frame ( 2700000 lines and 14 columns), and I would >>>>>>> like to >>>>>>> extract the information in a particular way illustrated below: >>>>>>> >>>>>>> >>>>>>> Given a data frame "df": >>>>>>> >>>>>>>> col1=sample(c(0,1),10, rep=T) >>>>>>>> names = factor(c(rep("A",5),rep("B",5))) >>>>>>>> df = data.frame(names,col1) >>>>>>>> df >>>>>>> >>>>>>> names col1 >>>>>>> 1 A 1 >>>>>>> 2 A 0 >>>>>>> 3 A 1 >>>>>>> 4 A 0 >>>>>>> 5 A 1 >>>>>>> 6 B 0 >>>>>>> 7 B 0 >>>>>>> 8 B 1 >>>>>>> 9 B 0 >>>>>>> 10 B 0 >>>>>>> >>>>>>> I would like to tranform it in the form: >>>>>>> >>>>>>>> index = c("A","B") >>>>>>>> col1[[1]]=df$col1[which(df$name=="A")] >>>>>>>> col1[[2]]=df$col1[which(df$name=="B")] >>>>>> >>>>>> I'm not sure I fully understand your problem, you example would not run >>>>>> for me. >>>>>> >>>>>> You could get a small speedup by omitting which(), you can subset by a >>>>>> logical vector also which give a small speedup. >>>>>> >>>>>>> n <- 2700000 >>>>>>> foo <- data.frame( >>>>>> >>>>>> + one = sample(c(0,1), n, rep = T), >>>>>> + two = factor(c(rep("A", n/2 ),rep("B", n/2 ))) >>>>>> + ) >>>>>>> >>>>>>> system.time(out <- which(foo$two=="A")) >>>>>> >>>>>> user system elapsed >>>>>> 0.566 0.146 0.761 >>>>>>> >>>>>>> system.time(out <- foo$two=="A") >>>>>> >>>>>> user system elapsed >>>>>> 0.429 0.075 0.588 >>>>>> >>>>>> You might also find use for unstack(), though I didn't see a speedup. >>>>>>> >>>>>>> system.time(out <- unstack(foo)) >>>>>> >>>>>> user system elapsed >>>>>> 1.068 0.697 2.004 >>>>>> >>>>>> HTH >>>>>> >>>>>> Peter >>>>>> >>>>>>> My problem is that the command: *** which(df$name=="A") *** >>>>>>> takes about 1 second because df is so big. >>>>>>> >>>>>>> I was thinking that a "level" could maybe be accessed instantly but I >>>>>>> am not >>>>>>> sure about how to do it. >>>>>>> >>>>>>> I would be very grateful for any advice that would allow me to speed >>>>>>> this up. >>>>>>> >>>>>>> Best wishes, >>>>>>> >>>>>>> Emmanuel >>>>>> >>>>>> ______________________________________________ >>>>>> R-help@r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide >>>>>> http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem that you are trying to solve? > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.