Re: [Rd] [datatable-help] speeding up perception

luke-tierney Wed, 06 Jul 2011 07:09:39 -0700

On Wed, 6 Jul 2011, Simon Urbanek wrote:

Interesting, and I stand corrected:

x = data.frame(a=1:n,b=1:n)
.Internal(inspect(x))

@103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
 @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...

x[1,1]=42L
.Internal(inspect(x))

@10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
 @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...

x[[1]][1]=42L
.Internal(inspect(x))

@103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
 @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
 @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,...

x[[1]][1]=42L
.Internal(inspect(x))

@10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
 @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
 @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...


I have R to release ;) so I won't be looking into this right now, but it's 
something worth investigating ... Since all the inner contents have NAMED=0 I 
would not expect any duplication to be needed, but apparently becomes so is at 
some point ...



The internals assume in various places that deep copies are made (one
of the reasons NAMED setings are not propagated to sub-sturcture).
The main issues are avoiding cycles and that there is no easy way to
check for sharing.  There may be some circumstances in which a shallow
copy would be OK but making sure it would be in all cases is probably
more trouble than it is worth at this point. (I've tried this in the
past in a few cases and always had to back off.)


Best,

luke


Cheers,
Simon


On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:


On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:

No subassignment function satisfies that condition, because you can always call 
them directly. However, that doesn't stop the default method from making that 
assumption, so I'm not sure it's an issue.

David, Just to clarify - the data frame content is not copied, we are talking 
about the vector holding columns.


If it is just the vector holding the columns that is copied (and not the
columns themselves), why does n make a difference in this test (on R
2.13.0)?

n = 1000
x = data.frame(a=1:n,b=1:n)
system.time(for (i in 1:1000) x[1,1] <- 42L)

  user  system elapsed
 0.628   0.000   0.628

n = 100000
x = data.frame(a=1:n,b=1:n)      # still 2 columns, but longer columns
system.time(for (i in 1:1000) x[1,1] <- 42L)

  user  system elapsed
20.145   1.232  21.455


With $<- :

n = 1000
x = data.frame(a=1:n,b=1:n)
system.time(for (i in 1:1000) x$a[1] <- 42L)

  user  system elapsed
 0.304   0.000   0.307

n = 100000
x = data.frame(a=1:n,b=1:n)
system.time(for (i in 1:1000) x$a[1] <- 42L)

  user  system elapsed
37.586   0.388  38.161


If it's because the 1st column needs to be copied (only) because that's
the one being assigned to (in this test), that magnitude of slow down
doesn't seem consistent with the time of a vector copy of the 1st
column :

n=100000
v = 1:n
system.time(for (i in 1:1000) v[1] <- 42L)

  user  system elapsed
 0.016   0.000   0.017

system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L})

  user  system elapsed
 1.816   1.076   2.900

Finally, increasing the number of columns, again only the 1st is
assigned to :

n=100000
x = data.frame(rep(list(1:n),100))
dim(x)

[1] 100000    100

system.time(for (i in 1:1000) x[1,1] <- 42L)

  user  system elapsed
167.974  50.903 219.711


Cheers,
Simon

Sent from my iPhone

On Jul 5, 2011, at 9:01 PM, David Winsemius <dwinsem...@comcast.net> wrote:


On Jul 5, 2011, at 7:18 PM, <luke-tier...@uiowa.edu> <luke-tier...@uiowa.edu> 
wrote:

On Tue, 5 Jul 2011, Simon Urbanek wrote:


On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:

Simon (and all),

I've tried to make assignment as fast as calling `[<-.data.table`
directly, for user convenience. Profiling shows (IIUC) that it isn't
dispatch, but x being copied. Is there a way to prevent '[<-' from
copying x?


Good point, and conceptually, no. It's a subassignment after all - see R-lang 
3.4.4 - it is equivalent to

`*tmp*` <- x
x <- `[<-`(`*tmp*`, i, j, value)
rm(`*tmp*`)

so there is always a copy involved.

Now, a conceptual copy doesn't mean real copy in R since R tries to keep the 
pass-by-value illusion while passing references in cases where it knows that 
modifications cannot occur and/or they are safe. The default subassign method 
uses that feature which means it can afford to not duplicate if there is only 
one reference -- then it's safe to not duplicate as we are replacing that only 
existing reference. And in the case of a matrix, that will be true at the 
latest from the second subassignment on.

Unfortunately the method dispatch (AFAICS) introduces one more reference in the 
dispatch chain so there will always be two references so duplication is 
necessary. Since we have only 0 / 1 / 2+ information on the references, we 
can't distinguish whether the second reference is due to the dispatch or due to 
the passed object having more than one reference, so we have to duplicate in 
any case. That is unfortunate, and I don't see a way around (unless we handle 
subassignment methods is some special way).


I don't believe dispatch is bumping NAMED (and a quick experiment
seems to confirm this though I don't guarantee I did that right). The
issue is that a replacement function implemented as a closure, which
is the only option for a package, will always see NAMED on the object
to be modified as 2 (because the value is obtained by forcing the
argument promise) and so any R level assignments will duplicate.  This
also isn't really an issue of imprecise reference counting -- there
really are (at least) two legitimate references -- one though the
argument and one through the caller's environment.

It would be good it we could come up with a way for packages to be
able to define replacement functions that do not duplicate in cases
where we really don't want them to, but this would require coming up
with some sort of protocol, minimally involving an efficient way to
detect whether a replacement funciton is being called in a replacement
context or directly.


Would "$<-" always satisfy that condition. It would be big help to me if it 
could be designed to avoid duplication the rest of the data.frame.

--


There are some replacement functions that use C code to cheat, but
these may create problems if called directly, so I won't advertise
them.

Best,

luke


Cheers,
Simon


--
Luke Tierney
Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
Actuarial Science
241 Schaeffer Hall                  email:      l...@stat.uiowa.edu
Iowa City, IA 52242                 WWW:  
http://www.stat.uiowa.edu______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


David Winsemius, MD
West Hartford, CT


--
Luke Tierney
Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
   Actuarial Science
241 Schaeffer Hall                  email:      l...@stat.uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [datatable-help] speeding up perception

Reply via email to