On Wednesday 13 December 2006 6:01 am, Martin Maechler wrote: > > - Vladimir, have you verified your 'take2' against recent versions > of R-devel?
Yes. > > - If they still work, could you re-post them to R-devel, this > time using a proper MIME type, > i.e. most probably one of > application/x-tar > application/x-compressed-tar > application/x-gzip > > In case you don't know how to achieve this, > I'd be interested to get it by "private" e-mail. No problem. The old e-mail did have a mime type: "text/x-diff". I am resending the patch - now compressed, hopefully it will get pass whatever filters are in place. With regard to speedups in R, here is my wish list - I would greatly appreciate comments on what makes sense here or not, etc: 1. I greatly miss equivalents of Tcl append and lappend commands - not the function performed by these commands but efficiency (they are O(1) on average). Tcl easily handles lists with 1e6 components and strings of 10s of megabytes in length. 2. It would be nice to have true hashed arrays in R (i.e. O(1) access times). So far I have used named lists for this, but they are O(n): > L<-list(); system.time(for(i in 1:10000)L[[paste(i)]]<-i); [1] 2.864 0.004 2.868 0.000 0.000 > L<-list(); system.time(for(i in 1:20000)L[[paste(i)]]<-i); [1] 11.789 0.216 12.004 0.000 0.000 3. Efficient manipulation of large numbers of strings. The big reason character row.names are slow is because they require a large number of string objects that slow down garbage collector. This is possibly not a problem that has an easy solution, here are a couple of approaches I have considered: a) Inline strings - use a structure like union { struct { unsigned char size; char body[15]; } inlined_string; /* use this when size<16 */ struct { unsigned char flag; char reserved[7]; /* for 64 bit */ CHRSXP ptr; } indirect_string; /* use this when flag=255 */ } This basically turns small strings into an enum type stored within a 128-bit integer. This would greatly decrease required number of CHRSXP in many common cases (in particular for many rownames). The biggest disadvantage is more complicated access to string data. Also this does not solve the issue of how to deal with 1e6 long strings - though I feel like 15 characters should be good enough for most uses. b) CHRSXPs are always leaf nodes. One could implement true reference counting and create a separate garbage collector pool for them. This way one can rely on reference counting to free string objects during normal operation, but also keep track of the number of referenced strings during garbage collector passes - and trigger string garbage collection passes (with a warning) when the number of referenced strings is much smaller the number of objects in string pool. This gets rid of overhead that strings impose on garbage collector. The disadvantage are very large changes to R code. best Vladimir Dergachev
subset.patch.2.diff.gz
Description: GNU Zip compressed data
______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel