To simplify: n <- 2.7e6; x <- factor(c(rep("A", n/2), rep("B", n/2)));
# Identify 'A':s t1 <- system.time(res <- which(x == "A")); # To compare a factor to a string, the factor is in practice # coerced to a character vector. t2 <- system.time(res <- which(as.character(x) == "A")); # Interestingly enough, this seems to be faster (repeated many times) # Don't know why. print(t2/t1); user system elapsed 0.632653 1.600000 0.754717 # Avoid coercing the factor, but instead coerce the level compared to t3 <- system.time(res <- which(x == match("A", levels(x)))); # ...but gives no speed up print(t3/t1); user system elapsed 1.041667 1.000000 1.018182 # But coercing the factor to integers does t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x)))) print(t4/t1); user system elapsed 0.4166667 0.0000000 0.3636364 So, the latter seems to be the fastest way to identify those elements. My $.02 /Henrik On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote: > Emmanuel, > > On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> wrote: >> Dear All, >> >> I have a large data frame ( 2700000 lines and 14 columns), and I would like >> to >> extract the information in a particular way illustrated below: >> >> >> Given a data frame "df": >> >>> col1=sample(c(0,1),10, rep=T) >>> names = factor(c(rep("A",5),rep("B",5))) >>> df = data.frame(names,col1) >>> df >> names col1 >> 1 A 1 >> 2 A 0 >> 3 A 1 >> 4 A 0 >> 5 A 1 >> 6 B 0 >> 7 B 0 >> 8 B 1 >> 9 B 0 >> 10 B 0 >> >> I would like to tranform it in the form: >> >>> index = c("A","B") >>> col1[[1]]=df$col1[which(df$name=="A")] >>> col1[[2]]=df$col1[which(df$name=="B")] > > I'm not sure I fully understand your problem, you example would not run for > me. > > You could get a small speedup by omitting which(), you can subset by a > logical vector also which give a small speedup. > >> n <- 2700000 >> foo <- data.frame( > + one = sample(c(0,1), n, rep = T), > + two = factor(c(rep("A", n/2 ),rep("B", n/2 ))) > + ) >> system.time(out <- which(foo$two=="A")) > user system elapsed > 0.566 0.146 0.761 >> system.time(out <- foo$two=="A") > user system elapsed > 0.429 0.075 0.588 > > You might also find use for unstack(), though I didn't see a speedup. >> system.time(out <- unstack(foo)) > user system elapsed > 1.068 0.697 2.004 > > HTH > > Peter > >> My problem is that the command: *** which(df$name=="A") *** >> takes about 1 second because df is so big. >> >> I was thinking that a "level" could maybe be accessed instantly but I am not >> sure about how to do it. >> >> I would be very grateful for any advice that would allow me to speed this up. >> >> Best wishes, >> >> Emmanuel > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.