Gabor Grothendieck wrote: > There was nothing attached in the copy that came through > to me.
I like to see that patch also. > By the way, there was some discussion earlier this year > on a light-weight data.frame class but I don't think anyone > ever posted any code. It may have been me. I am working on a bit-packed data.frame which only uses 2-bits per unit of data, so it is 4 units per RAWSXP. (work in progress, nothing to show). So I am very interested to see the patch. Yes, I took a couple of weeks reading/learning where have all the memory gone in data.frame. The rowname/column names allocation is a bit stupid. Each rowname and each column name is a full R object, so there is a 32(or 28) byte overhead just from managing that, before the STRSXP for the actual string, which is another X bytes. so for an 1 x N data.frame with integers for content, the the content is 4-byte * N, but the rowname/columnname is 32 * N -ish. (a 9x increase). Word is 32-bit on most people's machines, and I am counting the extra one from which you have to keep the address of each SEXPREC somewhere, so it is 7+1 = 8, if I understand it correctly. Here is the relevant comment, quoted verbatum from around line 225 of "src/include/Rinternals.h": /* The generational collector uses a reduced version of SEXPREC as a header in vector nodes. The layout MUST be kept consistent with the SEXPREC definition. The standard SEXPREC takes up 7 words on most hardware; this reduced version should take up only 6 words. In addition to slightly reducing memory use, this can lead to more favorable data alignment on 32-bit architectures like the Intel Pentium III where odd word alignment of doubles is allowed but much less efficient than even word alignment. */ Hin-Tak Leung > On 12/9/05, Matthew Dowle <[EMAIL PROTECTED]> wrote: > >>Hi, >> >>Please see below for post on r-help regarding data.frame() and the >>possibility of dropping rownames, for space and time reasons. >>I've made some changes, attached, and it seems to be working well. I see the >>expected space (90% saved) and time (10 times faster) savings. There are no >>doubt some bugs, and needs more work and testing, but I thought I would post >>first at this stage. >> >>Could some changes along these lines be made to R ? I'm happy to help with >>testing and further work if required. In the meantime I can work with >>overloaded functions which fixes the problems in my case. >> >>Functions effected : >> >> dim.data.frame >> format.data.frame >> print.data.frame >> data.frame >> [.data.frame >> as.matrix.data.frame >> >>Modified source code attached. >> >>Regards, >>Matthew >> >> >>-----Original Message----- >>From: Matthew Dowle >>Sent: 09 December 2005 09:44 >>To: 'Peter Dalgaard' >>Cc: 'r-help@stat.math.ethz.ch' >>Subject: RE: [R] data.frame() size >> >> >> >>That explains it. Thanks. I don't need rownames though, as I'll only ever >>use integer subscripts. Is there anyway to drop them, or even better not >>create them in the first place? The memory saved (90%) by not having them >>and 10 times speed up would be very useful. I think I need a data.frame >>rather than a matrix because I have columns of different types in real life. >> >> >>>rownames(d) = NULL >> >>Error in "dimnames<-.data.frame"(`*tmp*`, value = list(NULL, c("a", "b" : >> invalid 'dimnames' given for data frame >> >> >>-----Original Message----- >>From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Peter >>Dalgaard >>Sent: 08 December 2005 18:57 >>To: Matthew Dowle >>Cc: 'r-help@stat.math.ethz.ch' >>Subject: Re: [R] data.frame() size >> >> >>Matthew Dowle <[EMAIL PROTECTED]> writes: >> >> >>>Hi, >>> >>>In the example below why is d 10 times bigger than m, according to >>>object.size ? It also takes around 10 times as long to create, which >>>fits with object.size() being truthful. gcinfo(TRUE) also indicates a >>>great deal more garbage collector activity caused by data.frame() than >>>matrix(). >>> >>>$ R --vanilla >>>.... >>> >>>>nr = 1000000 >>>>system.time(m<<-matrix(integer(1), nrow=nr, ncol=2)) >>> >>>[1] 0.22 0.01 0.23 0.00 0.00 >>> >>>>system.time(d<<-data.frame(a=integer(nr), b=integer(nr))) >>> >>>[1] 2.81 0.20 3.01 0.00 0.00 # 10 times longer >>> >>> >>>>dim(m) >>> >>>[1] 1000000 2 >>> >>>>dim(d) >>> >>>[1] 1000000 2 # same dimensions >>> >>> >>>>storage.mode(m) >>> >>>[1] "integer" >>> >>>>sapply(d, storage.mode) >>> >>> a b >>>"integer" "integer" # same storage.mode >>> >>> >>>>object.size(m)/1024^2 >>> >>>[1] 7.629616 >>> >>>>object.size(d)/1024^2 >>> >>>[1] 76.29482 # but 10 times bigger >>> >>> >>>>sum(sapply(d, object.size))/1024^2 >>> >>>[1] 7.629501 # or is it ? If its not >>>really 10 times bigger, why 10 times longer above ? >> >>Row names!! >> >> >> >>>r <- as.character(1:1e6) >>>object.size(r) >> >>[1] 72000056 >> >>>object.size(r)/1024^2 >> >>[1] 68.6646 >> >>'nuff said? >> >>-- >> O__ ---- Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B >> c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K >> (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 >>~~~~~~~~~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 >> >> >> >> >>______________________________________________ >>R-devel@r-project.org mailing list >>https://stat.ethz.ch/mailman/listinfo/r-devel >> >> >> > > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel