Re: [Rd] [datatable-help] speeding up perception

Simon Urbanek Mon, 11 Jul 2011 18:26:55 -0700

Matthew,

I was hoping I misunderstood you first proposal, but I suspect I did not ;).


Personally, I find  DT[1,V1 <- 3] highly disturbing - I would expect it to 
evaluate to
{ V1 <- 3; DT[1, V1] }
thus returning the first element of the third column.

I do understand that within(foo, expr, ...) was the motivation for passing 
expressions, but unlike within() the subsetting operator [ is not expected to 
take expression as its second argument. Such abuse is quite unexpected and I 
would say dangerous.

That said, I don't think it works, either. Taking you example and data.table 
form r-forge:

> m = matrix(1,nrow=100000,ncol=100)
> DF = as.data.frame(m)
> DT = as.data.table(m)
> for (i in 1:1000) DT[1,V1 <- 3]
> DT[1,]
     V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
[1,]  1  1  1  1  1  1  1  1  1   1   1   1   1   1   1   1   1   1   1   1   1

as you can see, DT is not modified.

Also I suspect there is something quite amiss because even trivial things don't 
work:

> DF[1:4,1:4]
  V1 V2 V3 V4
1  3  1  1  1
2  1  1  1  1
3  1  1  1  1
4  1  1  1  1
> DT[1:4,1:4]
[1] 1 2 3 4


When I first saw your proposal, I thought you have rather something like
within(DT, V1[1] <- 3)
in mind which looks innocent enough but performs terribly (note that I had to 
scale down the loop by a factor of 100!!!):

> system.time(for (i in 1:10) within(DT, V1[1] <- 3))
   user  system elapsed 
  2.701   4.437   7.138 

With the for loop something like within(DF, for (i in 1:1000) V1[i] <- 3)) 
performs reasonably:

> system.time(within(DT, for (i in 1:1000) V1[i] <- 3))
   user  system elapsed 
  0.392   0.613   1.003 

(Note: system.time() can be misleading when within() is involved, because the 
expression is evaluated in a different environment so within() won't actually 
change the object in the  global environment - it also interacts with the 
possible duplication)

Cheers,
Simon

On Jul 11, 2011, at 8:21 PM, Matthew Dowle wrote:

> Thanks for the replies and info. An attempt at fast
> assign is now committed to data.table v1.6.3 on
> R-Forge. From NEWS :
> 
> o   Fast update is now implemented, FR#200.
>    DT[i,j]<-value is now handled by data.table in C rather
>    than falling through to data.frame methods.
> 
>    Thanks to Ivo Welch for raising speed issues on r-devel,
>    to Simon Urbanek for the suggestion, and Luke Tierney and
>    Simon for information on R internals.
> 
>    [<- syntax still incurs one working copy of the whole
>    table (as of R 2.13.0) due to R's [<- dispatch mechanism
>    copying to `*tmp*`, so, for ultimate speed and brevity,
>    'within' syntax is now available as follows.
> 
> o   A new 'within' argument has been added to [.data.table,
>    by default TRUE. It is very similar to the within()
>    function in base R. If an assignment appears in j, it
>    assigns to the column of DT, by reference; e.g.,
> 
>    DT[i,colname<-value]
> 
>    This syntax makes no copies of any part of memory at all.
> 
>> m = matrix(1,nrow=100000,ncol=100)
>> DF = as.data.frame(m)
>> DT = as.data.table(m)
>> system.time(for (i in 1:1000) DF[1,1] <- 3)
>       user  system elapsed 
>    287.730 323.196 613.453 
>> system.time(for (i in 1:1000) DT[1,V1 <- 3])
>       user  system elapsed 
>      1.152   0.004   1.161         # 528 times faster
> 
> Please note :
> 
>    *******************************************************
>    **  Within syntax is presently highly experimental.  **
>    *******************************************************
> 
> http://datatable.r-forge.r-project.org/
> 
> 
> On Wed, 2011-07-06 at 09:08 -0500, [email protected] wrote:
>> On Wed, 6 Jul 2011, Simon Urbanek wrote:
>> 
>>> Interesting, and I stand corrected:
>>> 
>>>> x = data.frame(a=1:n,b=1:n)
>>>> .Internal(inspect(x))
>>> @103511c00 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>> @102c7b000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>> @102af3000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>> 
>>>> x[1,1]=42L
>>>> .Internal(inspect(x))
>>> @10349c720 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>> @102c19000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>> @102b55000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>> 
>>>> x[[1]][1]=42L
>>>> .Internal(inspect(x))
>>> @103511a78 19 VECSXP g1c2 [OBJ,MARK,NAM(2),ATT] (len=2, tl=0)
>>> @102e65000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>> @101f14000 13 INTSXP g1c7 [MARK] (len=100000, tl=0) 1,2,3,4,5,...
>>> 
>>>> x[[1]][1]=42L
>>>> .Internal(inspect(x))
>>> @10349c800 19 VECSXP g0c2 [OBJ,NAM(2),ATT] (len=2, tl=0)
>>> @102a2f000 13 INTSXP g0c7 [] (len=100000, tl=0) 42,2,3,4,5,...
>>> @102ec7000 13 INTSXP g0c7 [] (len=100000, tl=0) 1,2,3,4,5,...
>>> 
>>> 
>>> I have R to release ;) so I won't be looking into this right now, but it's 
>>> something worth investigating ... Since all the inner contents have NAMED=0 
>>> I would not expect any duplication to be needed, but apparently becomes so 
>>> is at some point ...
>> 
>> 
>> The internals assume in various places that deep copies are made (one
>> of the reasons NAMED setings are not propagated to sub-sturcture).
>> The main issues are avoiding cycles and that there is no easy way to
>> check for sharing.  There may be some circumstances in which a shallow
>> copy would be OK but making sure it would be in all cases is probably
>> more trouble than it is worth at this point. (I've tried this in the
>> past in a few cases and always had to back off.)
>> 
>> 
>> Best,
>> 
>> luke
>> 
>>> 
>>> Cheers,
>>> Simon
>>> 
>>> 
>>> On Jul 6, 2011, at 4:36 AM, Matthew Dowle wrote:
>>> 
>>>> 
>>>> On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
>>>>> No subassignment function satisfies that condition, because you can 
>>>>> always call them directly. However, that doesn't stop the default method 
>>>>> from making that assumption, so I'm not sure it's an issue.
>>>>> 
>>>>> David, Just to clarify - the data frame content is not copied, we are 
>>>>> talking about the vector holding columns.
>>>> 
>>>> If it is just the vector holding the columns that is copied (and not the
>>>> columns themselves), why does n make a difference in this test (on R
>>>> 2.13.0)?
>>>> 
>>>>> n = 1000
>>>>> x = data.frame(a=1:n,b=1:n)
>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>  user  system elapsed
>>>> 0.628   0.000   0.628
>>>>> n = 100000
>>>>> x = data.frame(a=1:n,b=1:n)      # still 2 columns, but longer columns
>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>  user  system elapsed
>>>> 20.145   1.232  21.455
>>>>> 
>>>> 
>>>> With $<- :
>>>> 
>>>>> n = 1000
>>>>> x = data.frame(a=1:n,b=1:n)
>>>>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>>>>  user  system elapsed
>>>> 0.304   0.000   0.307
>>>>> n = 100000
>>>>> x = data.frame(a=1:n,b=1:n)
>>>>> system.time(for (i in 1:1000) x$a[1] <- 42L)
>>>>  user  system elapsed
>>>> 37.586   0.388  38.161
>>>>> 
>>>> 
>>>> If it's because the 1st column needs to be copied (only) because that's
>>>> the one being assigned to (in this test), that magnitude of slow down
>>>> doesn't seem consistent with the time of a vector copy of the 1st
>>>> column :
>>>> 
>>>>> n=100000
>>>>> v = 1:n
>>>>> system.time(for (i in 1:1000) v[1] <- 42L)
>>>>  user  system elapsed
>>>> 0.016   0.000   0.017
>>>>> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L})
>>>>  user  system elapsed
>>>> 1.816   1.076   2.900
>>>> 
>>>> Finally, increasing the number of columns, again only the 1st is
>>>> assigned to :
>>>> 
>>>>> n=100000
>>>>> x = data.frame(rep(list(1:n),100))
>>>>> dim(x)
>>>> [1] 100000    100
>>>>> system.time(for (i in 1:1000) x[1,1] <- 42L)
>>>>  user  system elapsed
>>>> 167.974  50.903 219.711
>>>>> 
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> Cheers,
>>>>> Simon
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>> On Jul 5, 2011, at 9:01 PM, David Winsemius <[email protected]> 
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> On Jul 5, 2011, at 7:18 PM, <[email protected]> 
>>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>>> On Tue, 5 Jul 2011, Simon Urbanek wrote:
>>>>>>> 
>>>>>>>> 
>>>>>>>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
>>>>>>>> 
>>>>>>>>> Simon (and all),
>>>>>>>>> 
>>>>>>>>> I've tried to make assignment as fast as calling `[<-.data.table`
>>>>>>>>> directly, for user convenience. Profiling shows (IIUC) that it isn't
>>>>>>>>> dispatch, but x being copied. Is there a way to prevent '[<-' from
>>>>>>>>> copying x?
>>>>>>>> 
>>>>>>>> Good point, and conceptually, no. It's a subassignment after all - see 
>>>>>>>> R-lang 3.4.4 - it is equivalent to
>>>>>>>> 
>>>>>>>> `*tmp*` <- x
>>>>>>>> x <- `[<-`(`*tmp*`, i, j, value)
>>>>>>>> rm(`*tmp*`)
>>>>>>>> 
>>>>>>>> so there is always a copy involved.
>>>>>>>> 
>>>>>>>> Now, a conceptual copy doesn't mean real copy in R since R tries to 
>>>>>>>> keep the pass-by-value illusion while passing references in cases 
>>>>>>>> where it knows that modifications cannot occur and/or they are safe. 
>>>>>>>> The default subassign method uses that feature which means it can 
>>>>>>>> afford to not duplicate if there is only one reference -- then it's 
>>>>>>>> safe to not duplicate as we are replacing that only existing 
>>>>>>>> reference. And in the case of a matrix, that will be true at the 
>>>>>>>> latest from the second subassignment on.
>>>>>>>> 
>>>>>>>> Unfortunately the method dispatch (AFAICS) introduces one more 
>>>>>>>> reference in the dispatch chain so there will always be two references 
>>>>>>>> so duplication is necessary. Since we have only 0 / 1 / 2+ information 
>>>>>>>> on the references, we can't distinguish whether the second reference 
>>>>>>>> is due to the dispatch or due to the passed object having more than 
>>>>>>>> one reference, so we have to duplicate in any case. That is 
>>>>>>>> unfortunate, and I don't see a way around (unless we handle 
>>>>>>>> subassignment methods is some special way).
>>>>>>> 
>>>>>>> I don't believe dispatch is bumping NAMED (and a quick experiment
>>>>>>> seems to confirm this though I don't guarantee I did that right). The
>>>>>>> issue is that a replacement function implemented as a closure, which
>>>>>>> is the only option for a package, will always see NAMED on the object
>>>>>>> to be modified as 2 (because the value is obtained by forcing the
>>>>>>> argument promise) and so any R level assignments will duplicate.  This
>>>>>>> also isn't really an issue of imprecise reference counting -- there
>>>>>>> really are (at least) two legitimate references -- one though the
>>>>>>> argument and one through the caller's environment.
>>>>>>> 
>>>>>>> It would be good it we could come up with a way for packages to be
>>>>>>> able to define replacement functions that do not duplicate in cases
>>>>>>> where we really don't want them to, but this would require coming up
>>>>>>> with some sort of protocol, minimally involving an efficient way to
>>>>>>> detect whether a replacement funciton is being called in a replacement
>>>>>>> context or directly.
>>>>>> 
>>>>>> Would "$<-" always satisfy that condition. It would be big help to me if 
>>>>>> it could be designed to avoid duplication the rest of the data.frame.
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>>> 
>>>>>>> There are some replacement functions that use C code to cheat, but
>>>>>>> these may create problems if called directly, so I won't advertise
>>>>>>> them.
>>>>>>> 
>>>>>>> Best,
>>>>>>> 
>>>>>>> luke
>>>>>>> 
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> Simon
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Luke Tierney
>>>>>>> Statistics and Actuarial Science
>>>>>>> Ralph E. Wareham Professor of Mathematical Sciences
>>>>>>> University of Iowa                  Phone:             319-335-3386
>>>>>>> Department of Statistics and        Fax:               319-335-3017
>>>>>>> Actuarial Science
>>>>>>> 241 Schaeffer Hall                  email:      [email protected]
>>>>>>> Iowa City, IA 52242                 WWW:  
>>>>>>> http://www.stat.uiowa.edu______________________________________________
>>>>>>> [email protected] mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>> 
>>>>>> David Winsemius, MD
>>>>>> West Hartford, CT
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> -- 
>> Luke Tierney
>> Statistics and Actuarial Science
>> Ralph E. Wareham Professor of Mathematical Sciences
>> University of Iowa                  Phone:             319-335-3386
>> Department of Statistics and        Fax:               319-335-3017
>>    Actuarial Science
>> 241 Schaeffer Hall                  email:      [email protected]
>> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
> 
> 
> 

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [datatable-help] speeding up perception

Reply via email to