On 17/03/2023 11:22 am, Roger Pau Monné wrote: > On Mon, Jul 15, 2019 at 02:39:04PM +0000, Jan Beulich wrote: >> This is faster than using the software implementation, and the insn is >> available on all half-way recent hardware. Therefore convert >> generic_hweight<N>() to out-of-line functions (without affecting Arm) >> and use alternatives patching to replace the function calls. >> >> Note that the approach doesn#t work for clang, due to it not recognizing >> -ffixed-*. > I've been giving this a look, and I wonder if it would be fine to > simply push and pop the scratch registers in the 'call' path of the > alternative, as that won't require any specific compiler option.
It's been a long while, and in that time I've learnt a lot more about performance, but my root objection to the approach taken here still stands - it is penalising the common case to optimise some pointless corner cases. Yes - on the call path, an extra push/pop pair (or few) to get temp registers is basically free. Right now, the generic_hweight() helpers are static inline in xen/bitops.h and this is nonsense. The absolute best they should be is extern inline in our new lib/ and I'll bet that the compilers stop inlining them there and then. Given new abi's like x86_64-v2 (which guarantees POPCNT as an available feature), it would be nice to arrange to use __builtin_popcnt() to let the compiler optimise to its hearts content, but outside of this case, it is actively damaging to try and optimise for memory operands or to inline the 8/16 case. So, for x86 specifically, we want: if ( CONFIG_POPCNT ) __builtin_popcnt() else ALT( "popcnt D -> a", "call arch_popcnt$N" ) and we can write arch_popcnt* in x86's lib/ and in assembly, because these are trivial enough and we do need to be careful with registers/etc. I'm not sure if a "+D" vs "D" will matter at the top level. Probably not, so it might be an easy way to get one tempt register. Other temp registers can come from push/pop. While we're at it, we should split hweight out of bitops and write the common header in such a way that it defaults to the generic implementations in lib/, and that will subsume the ARM header and also make this work on RISC-V for free. ~Andrew
