dhi...@sonic.net wrote: > Martin Morgan <mtmor...@fhcrc.org> wrote: > > Allocating many small objects triggers numerous garbage collections as R > > grows its memory, seriously degrading performance. The specific use case > > is in creating a STRSXP of several 1,000,000's of elements of 60-100 > > characters each; a simplified illustration understating the effects > > (because there is initially little to garbage collect, in contrast to an > > R session with several packages loaded) is below.
> What a coincidence -- I was just going to post a question about why it > is so slow to create a STRSXP of ~10,000,000 unique elements, each ~10 > characters long. I had noticed that this seemed to show much worse > than linear scaling. I had not thought of garbage collection as the > culprit -- but indeed it is. By manipulating the GC trigger, I can > make this operation take as little as 3 seconds (with no GC) or as > long as 76 seconds (with 31 garbage collections). I had done some google searches on this issue, since it seemed like it should not be too uncommon, but the only other hit I could come up with was a thread from 2006: https://stat.ethz.ch/pipermail/r-devel/2006-November/043446.html In any case, one issue with your suggested workaround is that it requires knowing how much additional storage is needed, which may be an expensive operation to determine. I've just tried implementing a different approach, which is to define two new functions to either disable or enable GC. The function to disable GC first invokes R_gc_full() to shrink the heap as much as possible, then sets a flag. Then in R_gc_internal(), I first check that flag, and if it is set, I call AdjustHeapSize(size_needed) and exit immediately. These calls could be used to bracket any code section that expects to make lots of calls to R's memory allocator. The down side is that this approach requires that all paths out of such a code section (including error handling) need to take care to unset the GC-disabled flag. I think I would want to hear from someone on the R team about whether they think this is a good idea. A final alternative might be to provide a vectorized version of mkChar that would accept a char ** and use one of these methods internally, rather than exporting the underlying methods as part of R's API. I don't know if there are other clear use cases where GC is a serious bottleneck, besides constructing large vectors of mostly unique strings. Such a function would be less generally useful since it would require that the full vector of C strings be assembled at one time. -- Dave ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel