Hello, On Wed, 12 Jun 2024, Paolo Bonzini wrote:
> I didn't do this because of RHEL9, I did it because it's silly that > QEMU cannot use POPCNT and has to waste 2% of the L1 d-cache to > compute the x86 parity flag (and POPCNT was introduced at the same > time as SSE4.2). I do not see where the 2% figure is coming from: even considering that the 256-byte LUT may take an extra cache line due to misalignment, 320 bytes is still less than 1% of 32KB L1D size. More importantly, the way this comment is phrased made me think that Qemu eagerly computes PF. But the comment in target/i386/cpu.h is saying that all flags are computed in an on-demand manner. Considering that software pretty much never uses PF, why would the parity table be resident in L1D? As far as I can see, the cost is rather a cache miss and perhaps a TLB miss when PF is computed (mostly when EFLAGS are accessed all together on context switches I think). Is there something I'm not seeing? Thanks. Alexander
