Taylor Simpson <[email protected]> writes:
>> -----Original Message----- >> From: Richard Henderson <[email protected]> >> Sent: Monday, April 18, 2022 10:38 AM >> To: Taylor Simpson <[email protected]>; [email protected] >> Cc: Philippe Mathieu-Daudé <[email protected]> >> Subject: Re: Question about direct block chaining >> >> On 4/18/22 07:54, Taylor Simpson wrote: >> > I implemented both approaches for inner loops and didn't see speedup >> > in my benchmark. So, I have a couple of questions >> > 1) What are the pros and cons of the two approaches >> (lookup_and_goto_ptr and goto_tb + exit_tb)? >> >> goto_tb can only be used within a single page (plus other restrictions, see >> translator_use_goto_tb). In addition, as documented, the change in cpu >> state must be constant, beginning with a direct jump. >> >> lookup_and_goto_ptr can handle any change in cpu state, including indirect >> jumps. >> >> >> > 2) How can I verify that direct block chaining is working properly? >> > With -d exec, I see lines like the following with goto_tb + exit_tb >> > but >> NOT lookup_and_goto_ptr >> > Linking TBs 0x7fda44172e00 [0050ac38] index 1 -> 0x7fda44173b40 >> > [0050ac6c] >> >> Well, that's one way. I would have also suggested simply looking at -d op >> output, for the various branchy cases you're considering, to see that all of >> the >> exits are as expected. > > Thanks!! > > I created a synthetic benchmark with a loop with a very small body and a very > high number of iterations. I can see differences in execution time. > > Here are my observations: > - goto_tb + exit_tb gives the fastest execution time because it will > patch the native jump address As we would expect. > - lookup_and_goto_ptr is an improvement over tcg_gen_exit_tb(NULL, 0) Yes - mainly saving the cost of prologue and coming out of generated code to the main loop. However once we get to tb_lookup and fail the tb_jump_cache its going to take some time to get a block via QHT. The tb_jump_cache is pretty simple in its implementation but I don't know if we've ever decently characterised the hit rate and if it could be improved. I think we already have slightly different hashing functions for user-mode vs softmmu. (aside I suspect the trace_vcpu_dstate check can now be removed which should save a bit of time on the hash function). -- Alex Bennée
