https://bugs.kde.org/show_bug.cgi?id=402781
Bug ID: 402781 Summary: Redo the cache used to process indirect branch targets Product: valgrind Version: 3.15 SVN Platform: Other OS: Linux Status: REPORTED Severity: normal Priority: NOR Component: general Assignee: jsew...@acm.org Reporter: jsew...@acm.org Target Milestone: --- In the baseline simulator, jumps to guest code addresses that are not known at JIT time have to be looked up in a guest->host mapping table. That means: indirect branches, indirect calls and most commonly, returns. Since there are huge numbers of these (often 10+ million/second) the mapping mechanism needs to be extremely cheap. Currently, this is implemented using a direct-mapped cache, VG_(tt_fast), with 2^15 (guest_addr, host_addr) pairs. This is queried in handwritten assembly in VG_(disp_cp_xindir) in dispatch-<arch>-<os>.S. If there is a miss in the cache then we fall back out to C land, and do a slow lookup using VG_(search_transtab). Given that the size of the translation table(s) in recent years have expanded significantly in order to keep pace with increasing applicatin sizes, two bad things have happened: (1) the cost of a miss in the fast cache has risen significantly, and (2) the miss rate on the fast cache has also increased significantly. This means that large (~ one-million-basic-blocks-JITted) applications that run for a long time end up spending a lot of time in VG_(search_transtab). The proposed fix is to increase associativity of the fast cache, from 1 (direct mapped) to 4. Simulations of various cache configurations using indirect-branch traces from a large application show that is the best of various configurations. In an extreme case with 5.7 billion indirect branches: * The increase of associativity from 1 way to 4 way, whilst keeping the overall cache size the same (32k guest/host pairs), reduces the miss rate by around a factor of 3, from 4.02% to 1.30%. * The use of a slightly better hash function than merely slicing off the bottom 15 bits of the address, reduces the miss rate further, from 1.30% to 0.53%. Overall the VG_(tt_fast) miss rate is almost unchanged on small workloads, but reduced by a factor of up to almost 8 on large workloads. By implementing each (4-entry) cache set using a move-to-front scheme in the case of hits in ways 1, 2 or 3, the vast majority of hits can be made to happen in way 0. Hence the cost of having this extra associativity is almost zero in the case of a hit. The improved hash function costs an extra 2 ALU shots (a shift and an xor) but overall this seems performance neutral to a win. -- You are receiving this mail because: You are watching all bug changes.