Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

Erik Iverson Wed, 13 Aug 2008 08:41:35 -0700

I still don't understand what you are doing. Can you make a smallexample that shows what you have and what you want?


Is ?split what you are after?


Emmanuel Levy wrote:

Dear Peter and Henrik,

Thanks for your replies - this helps speed up a bit, but I thought
there would be something much faster.

What I mean is that I thought that a particular value of a level
could be accessed instantly, similarly to a "hash" key.

Since I've got about 6000 levels in that data frame, it means that
making a list L of the form
L[[1]] = values of name "1"
L[[2]] = values of name "2"
L[[3]] = values of name "3"
...
would take ~1hour.

Best,

Emmanuel




2008/8/12 Henrik Bengtsson <[EMAIL PROTECTED]>:

To simplify:

n <- 2.7e6;
x <- factor(c(rep("A", n/2), rep("B", n/2)));

# Identify 'A':s
t1 <- system.time(res <- which(x == "A"));

# To compare a factor to a string, the factor is in practice
# coerced to a character vector.
t2 <- system.time(res <- which(as.character(x) == "A"));

# Interestingly enough, this seems to be faster (repeated many times)
# Don't know why.
print(t2/t1);
   user   system  elapsed
0.632653 1.600000 0.754717

# Avoid coercing the factor, but instead coerce the level compared to
t3 <- system.time(res <- which(x == match("A", levels(x))));

# ...but gives no speed up
print(t3/t1);
   user   system  elapsed
1.041667 1.000000 1.018182

# But coercing the factor to integers does
t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x))))
print(t4/t1);
    user    system   elapsed
0.4166667 0.0000000 0.3636364

So, the latter seems to be the fastest way to identify those elements.

My $.02

/Henrik


On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <[EMAIL PROTECTED]> wrote:

Emmanuel,

On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <[EMAIL PROTECTED]> wrote:

Dear All,

I have a large data frame ( 2700000 lines and 14 columns), and I would like to
extract the information in a particular way illustrated below:


Given a data frame "df":

col1=sample(c(0,1),10, rep=T)
names = factor(c(rep("A",5),rep("B",5)))
df = data.frame(names,col1)
df

  names col1
1      A    1
2      A    0
3      A    1
4      A    0
5      A    1
6      B    0
7      B    0
8      B    1
9      B    0
10     B    0

I would like to tranform it in the form:

index = c("A","B")
col1[[1]]=df$col1[which(df$name=="A")]
col1[[2]]=df$col1[which(df$name=="B")]

I'm not sure I fully understand your problem, you example would not run for me.

You could get a small speedup by omitting which(), you can subset by a
logical vector also which give a small speedup.

n <- 2700000
foo <- data.frame(

+       one = sample(c(0,1), n, rep = T),
+       two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
+       )

system.time(out <- which(foo$two=="A"))

  user  system elapsed
 0.566   0.146   0.761

system.time(out <- foo$two=="A")

  user  system elapsed
 0.429   0.075   0.588

You might also find use for unstack(), though I didn't see a speedup.

system.time(out <- unstack(foo))

  user  system elapsed
 1.068   0.697   2.004

HTH

Peter

My problem is that the command:  *** which(df$name=="A") ***
takes about 1 second because df is so big.

I was thinking that a "level" could maybe be accessed instantly but I am not
sure about how to do it.

I would be very grateful for any advice that would allow me to speed this up.

Best wishes,

Emmanuel

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

Reply via email to