On 11/14/2011 11:47 AM, dhi...@sonic.net wrote:
dhi...@sonic.net wrote:
Martin Morgan<mtmor...@fhcrc.org> wrote:
Allocating many small objects triggers numerous garbage collections as R
grows its memory, seriously degrading performance. The specific use case
is in creating a STRSXP of several 1,000,000's of elements of 60-100
characters each; a simplified illustration understating the effects
(because there is initially little to garbage collect, in contrast to an
R session with several packages loaded) is below.
What a coincidence -- I was just going to post a question about why it
is so slow to create a STRSXP of ~10,000,000 unique elements, each ~10
characters long. I had noticed that this seemed to show much worse
than linear scaling. I had not thought of garbage collection as the
culprit -- but indeed it is. By manipulating the GC trigger, I can
make this operation take as little as 3 seconds (with no GC) or as
long as 76 seconds (with 31 garbage collections).
I had done some google searches on this issue, since it seemed like it
should not be too uncommon, but the only other hit I could come up
with was a thread from 2006:
https://stat.ethz.ch/pipermail/r-devel/2006-November/043446.html
In any case, one issue with your suggested workaround is that it
requires knowing how much additional storage is needed, which may be
an expensive operation to determine. I've just tried implementing a
different approach, which is to define two new functions to either
disable or enable GC. The function to disable GC first invokes
R_gc_full() to shrink the heap as much as possible, then sets a flag.
Then in R_gc_internal(), I first check that flag, and if it is set, I
call AdjustHeapSize(size_needed) and exit immediately.
I think this is a better approach; mine seriously understated the
complexity of figuring out required size.
These calls could be used to bracket any code section that expects to
make lots of calls to R's memory allocator. The down side is that
this approach requires that all paths out of such a code section
(including error handling) need to take care to unset the GC-disabled
flag. I think I would want to hear from someone on the R team about
whether they think this is a good idea.
A final alternative might be to provide a vectorized version of mkChar
that would accept a char ** and use one of these methods internally,
rather than exporting the underlying methods as part of R's API. I
don't know if there are other clear use cases where GC is a serious
bottleneck, besides constructing large vectors of mostly unique
strings. Such a function would be less generally useful since it
would require that the full vector of C strings be assembled at one
time.
Another place where this comes up is during package load, especially for
packages with many S4 instances.
> gcinfo(TRUE)
> library(Matrix)
Garbage collection 2 = 1+0+1 (level 0) ...
7.6 Mbytes of cons cells used (40%)
1.1 Mbytes of vectors used (18%)
...
Garbage collection 58 = 39+9+10 (level 2) ...
39.4 Mbytes of cons cells used (75%)
2.9 Mbytes of vectors used (47%)
and continuing
> library(IRanges)
...
Garbage collection 89 = 60+14+15 (level 1) ...
63.1 Mbytes of cons cells used (80%)
4.3 Mbytes of vectors used (53%)
Also, something like
> system.time(as.character(1:10000000))
...
Garbage collection 124 = 60+14+50 (level 2) ...
596.1 Mbytes of cons cells used (95%)
226.3 Mbytes of vectors used (69%)
user system elapsed
61.908 0.297 62.303
might be an R-level manifestation of the same problem.
Being able to disable / enable the GC seems like a useful patch, and I
hope this is interesting enough for the R-core team.
A more fundamental issue seems to be garbage collection when there are a
lot of SEXP in play
> system.time(gc())
user system elapsed
0.236 0.000 0.236
There's a hierarchy of CHARSXP / STRSXP, so maybe that could be
exploited in the mark phase?
Martin
-- Dave
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel