On 04/14/2013 07:11 PM, luke-tier...@uiowa.edu wrote:
There were a couple of bug fixes to somewhat obscure compound assignment related bugs that required bumping up internal reference counts. It's possible that one or more of these are responsible. If so it is unavoidable for now, but it's worth finding out for sure. With some stripped down test examples it should be possible to identify when things changed. I won't have time to look for some time, but if someone else wanted to nail this down that would be useful.
I can't quite tell from Tim's script what he's documenting. In R-2.15.3 I have > Rprofmem(); Rprofmem(NULL); readLines("Rprofmem.out", warn=FALSE) character(0) (or sometimes [1] "new page:new page:\"Rprofmem\" ") whereas in R-3.0.0 > Rprofmem(); Rprofmem(NULL); readLines("Rprofmem.out", warn=FALSE) [1] "320040 :80040 :240048 :320040 :80040 :240048 :"I think these are the allocations Tim is seeing. They're from the parser (see below) rather than as.data.frame. For Tim's example
y <- 1:10^4 + 0.0 Rprofmem(); d <- as.data.frame(y); Rprofmem(NULL); readLines("Rprofmem.out")[1] "320040 :80040 :240048 :320040 :80040 :240048 :80040 :\"as.data.frame.numeric\" \"as.data.frame\" "
[2] "320040 :80040 :240048 :320040 :80040 :240048 :" only the allocation 80040 is from as.data.frame (from the call stack output). Under R -d gdb (gdb) b R_OutputStackTrace (gdb) r > Rprofmem(); Rprofmem(NULL)Breakpoint 1, R_OutputStackTrace (file=0xbd43f0) at /home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3434
3434 { (gdb) bt#0 R_OutputStackTrace (file=0xbd43f0) at /home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3434 #1 0x00007ffff792ff83 in R_ReportAllocation (size=320040) at /home/mtmorgan/src/R-3-0-branch/src/main/memory.c:3456 #2 Rf_allocVector (type=13, length=80000) at /home/mtmorgan/src/R-3-0-branch/src/main/memory.c:2478
#3 0x00007ffff790bedf in growData () at gram.y:3391 and the memory allocations are from these lines in the parser gram.y PROTECT( bigger = allocVector( INTSXP, data_size * DATA_ROWS ) ) ; PROTECT( biggertext = allocVector( STRSXP, data_size ) ); I'm not sure why these show up under R 3.0.0, though. $ R-2-15-branch/bin/R --version R version 2.15.3 Patched (2013-03-13 r62579) -- "Security Blanket" Copyright (C) 2013 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-unknown-linux-gnu (64-bit) R-3-0-branch$ bin/R --version R version 3.0.0 Patched (2013-04-14 r62579) -- "Masked Marvel" Copyright (C) 2013 The R Foundation for Statistical Computing Platform: x86_64-unknown-linux-gnu (64-bit) Martin
Best, luke On Sun, 14 Apr 2013, Tim Hesterberg wrote:I did some benchmarking of data frame code, and it appears that R 3.0.0 is far worse than earlier versions of R in terms of how many large objects it allocates space for, for data frame operations - creation, subscripting, subscript replacement. For a data frame with n rows, it makes either 2 or 4 extra copies of all of: 8n bytes (e.g. double precision) 24n bytes 32n bytes E.g., for as.data.frame(numeric vector), instead of allocations totalling ~8n bytes, it allocates 33 times that much. Here, compare columns 3 and 5 (columns 2 and 4 are with the dataframe package). # Summary # R-2.14.2 R-2.15.3 R-3.0.0 # w/o with w/o with w/o # as.data.frame(y) 3 1 1 1 5;4;4 # data.frame(y) 7 3 4 2 6;2;2 # data.frame(y, z) 7 each 3 each 4 2 8;4;4 # as.data.frame(l) 8 3 5 2 9;4;4 # data.frame(l) 13 5 8 3 12;4;4 # d$z <- z 3,2 1,1 3,1 2,1 7;4;4,1 # d[["z"]] <- z 4,3 1,1 3,1 2,1 7;4;4,1 # d[, "z"] <- z 6,4,2 2,2,1 4,2,2 3,2,1 8;4;4,2,2 # d["z"] <- z 6,5,2 2,2,1 4,2,2 3,2,1 8;4;4,2,2 # d["z"] <- list(z=z) 6,3,2 2,2,1 4,2,2 3,2,1 8;4;4,2,2 # d["z"] <- Z #list(z=z) 6,2,2 2,1,1 4,1,2 3,1,1 8;4;4,1,2 # a <- d["y"] 2 1 2 1 6;4;4 # a <- d[, "y", drop=F] 2 1 2 1 6;4;4 # Where two numbers are given, they refer to: # (copies of the old data frame), # (copies of the new column) # A third number refers to numbers of # (copies made of an integer vector of row names) # For R 3.0.0, I'm getting astounding results - many more copies, # and also some copies of larger objects; in addition to the data # vectors of size 80K and 160K, also 240K and 320K. # Where three numbers are given in form a;c;d, they refer to # (copies of 80K; 240K; 320K) The benchmarks are at http://www.timhesterberg.net/r-packages/memory.R I'm using versions of R I installed from source on a Linux box, using e.g. ./configure --prefix=(my path) --enable-memory-profiling --with-readline=no make make install ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel