Hi Luke,

As proposed by you, I have just added this to the wishlist in Bugzilla: 
https://bugs.r-project.org/show_bug.cgi?id=18233

Best,
Andreas

2021-11-04 17:04 GMT+01:00 luke-tier...@uiowa.edu:
> Can you please submit this as a wishlist item to bugzilla? it is
> easier to keep track of there. You could also submit your threads
> based suggestion there, again to keep it easier to keep track of and
> possibly get back to in the future.
> 
> I will have a look at your approach when I get a chance, but I am
> exploring a different approach to avoid scanning old generations that
> may be simpler.
> 
> Best,
> 
> luke
> 
> On Wed, 3 Nov 2021, Andreas Kersting wrote:
> 
>> Hi,
>>
>> In https://stat.ethz.ch/pipermail/r-devel/2021-October/081147.html I 
>> proposed to speed up the CHARSXP cache maintenance during GC using 
>> threading. This was rejected by Luke in 
>> https://stat.ethz.ch/pipermail/r-devel/2021-October/081172.html.
>>
>> Here I want to propose an alternative approach to significantly speed up 
>> CHARSXP cache maintenance during partial GCs. A patch which passes `make 
>> check-devel` is attached. Compared to R devel (revision 81110) I get the 
>> following performance improvements on my system:
>>
>> Elapsed time for five non-full gc in a session after
>>
>> x <- as.character(runif(5e7))[]
>> gc(full = TRUE)
>>
>> +20sec -> ~1sec.
>>
>>
>> This patch introduces (theoretical) overheads to mkCharLenCE() and full GCs. 
>> However, I did not measure dramatic differences:
>>
>> y <- "old_CHARSXP"
>>
>> after
>>
>> x <- "old_CHARSXP"; gc(); gc()
>>
>> takes a median 32 nanoseconds with and without the patch.
>>
>>
>> gc(full = TRUE)
>>
>> in a new session takes a median 16 milliseconds with and 14 without the 
>> patch.
>>
>>
>> The basic idea is to maintain the CHARSXP cache using subtables in 
>> R_StringHash, one for each of the (NUM_GC_GENERATIONS := NUM_OLD_GENERATIONS 
>> + 1) GC generations. New CHARSXPs are added by mkCharLenCE() to the subtable 
>> of the youngest generation. After a partial GC, only the chains anchored at 
>> the subtables of the youngest (num_old_gens_to_collect + 1) generations need 
>> to be searched for and cleaned of unmarked nodes. Afterwards, these chains 
>> need to be merged into those of the respective next generation, if any. This 
>> approach relies on the fact that an object/CHARSXP can never become younger 
>> again. It is OK though if an object/CHARSXP "skips" a GC generation.
>>
>> R_StringHash, which is now of length (NUM_GC_GENERATIONS * char_hash_size), 
>> is structured such that the chains for the same hashcode but for different 
>> generations are anchored at slots of R_StringHash which are next to each 
>> other in memory. This is because we often need to access two or more (i.e. 
>> currently all three) of them for one operation and this avoids cache misses.
>>
>> HASHPRI, i.e. the number of occupied primary slots, is computed and stored 
>> as NUM_GC_GENERATIONS times the number of slots which are occupied in at 
>> least one of the subtables. This is done because in mkCharLenCE() we need to 
>> iterate through one or more chains if and only if there is a chain for the 
>> particular hashcode in at least one subtable.
>>
>> I tried to keep the patch as minimal as possible. In particular, I did not 
>> add long vector support to R_StringHash. I rather reduced the max value of 
>> char_hash_size from 2^30 to 2^29, assuming that NUM_OLD_GENERATIONS is (not 
>> larger than) 2. I also did not yet adjust do_show_cache() and 
>> do_write_cache(), but I could do so if the patch is accepted.
>>
>> Thanks for your consideration and feedback.
>>
>> Regards,
>> Andreas
>>
>>
>> P.S. I had a hard time to get the indentation right in the patch due the mix 
>> of tabs and spaces. Sorry, if I screwed this up.
> 
> -- 
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>    Actuarial Science
> 241 Schaeffer Hall                  email:   luke-tier...@uiowa.edu
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
> 
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to